CN103440287B

CN103440287B - A kind of Web question and answer searching system based on product information structure

Info

Publication number: CN103440287B
Application number: CN201310354888.4A
Authority: CN
Inventors: 郝志峰; 温雯; 蔡瑞初; 王鸿飞; 张奇; 张鑫; 刘建明; 王宗武
Original assignee: Guangdong University of Technology
Current assignee: BEIMING SOFTWARE CO., LTD.; Guangdong University of Technology; Foshan University
Priority date: 2013-08-14
Filing date: 2013-08-14
Publication date: 2016-12-28
Anticipated expiration: 2033-08-14
Also published as: CN103440287A

Abstract

The present invention is a kind of Web question and answer searching system based on product information structure.Crawl module including user interface, product information, information extraction module, inverted index set up module, data base interface, information integration module, question sentence processing module, data base, the present invention can obtain the latest development of online product information in real time, and by information extraction and integration module, in time structuring product data existing in data base can be updated or increase new structuring product data so that system can adapt to the change of online product information.Additionally, product information is acquired by the present invention from multiple infomediaries, and by information extraction and integration module, identical product product information on different web sites is integrated, the information of contradiction is judged, the information lacked is carried out the complementation between different aforementioned sources, it is ensured that the integrity of retrieval information and verity.The present invention is a kind of Web question and answer searching system based on product information structure with higher recall precision.

Description

A kind of Web question and answer searching system based on product information structure

Technical field

The present invention relates to the Internet destructuring, semi-structured information extraction, modeling and search field, be specially a kind of base In Web question and answer searching system and the method for product information structure, belonging to Web question and answer based on product information structure retrieval is The renovation technique of system.

Background technology

21 century is the informationalized epoch, and network has become people and lived an indispensable part.Along with the Internet Developing rapidly, on the one hand people grow with each passing day for the demand of the network information, on the other hand there is the information of magnanimity on the Internet, Yet with inherent characters such as the Internet Large Copacity, dynamics, these magnanimity informations are often scrappy, and inorganizable property is also wrapped Contain a large amount of invalid data.It reduce people's utilization ratio to abundant information resource.In order to solve this " information overload " Problem, many companies and research institution have turned to the research to automatically request-answering system.

Question answering system (Question Answering System, QA) is a kind of advanced form of information retrieval system.It The problem that user proposes can be answered with natural language with accurate, succinct natural language.The main cause that its research is risen is people To the demand obtaining information quickly and accurately.Question answering system is in current artificial intelligence and natural language processing field one Receive much attention and there is the research direction of broad development prospect.

From the point of view of ken, existing question answering system can be divided into " closing field " and " Opening field " two class system. Closing neighborhood system and be absorbed in the problem answering specific area, current most of question answering systems belong to closing neighborhood system.Open Neighborhood system is then wished not limit the context of problem, and difficulty is of a relatively high.

Existing closing field question answering system mainly has: the Application No. 200810233734 of Kunming University of Science and Technology, invention Entitled " tourism request-answer system answer abstracting method based on ontology inference ".The method concentrates on tourism request-answer system answer Concept, attribute and relation in the research of abstracting method, first Manual definition's tour field, and artificial constructed tour field body Knowledge base, tests to the concordance of body the most again；Next utilizes the semantic information in ontology knowledge base to user's question sentence Carry out semantic disambiguation；Then the semantic rule in Manual definition's tour field；It is again based on the Research of Question Analysis knot of semantic disambiguation Really, the method using the reasoning of corresponding semantic rule and information retrieval to combine extracts answer in ontology knowledge base；Finally According to different question sentence types, design corresponding answer extracting algorithm, improve responsiveness and the recall rate of system.

It can be seen that need substantial amounts of artificial interference in the method for this invention employing, structure, concept including knowledge base belong to The definition of property and the formulation of semantic rule are required for manually participating in.Too much artificial participation can cause the increase of human cost, And need to keep certain personnel system is safeguarded and updates.

Summary of the invention

It is an object of the invention to consider the problems referred to above and a kind of integrity guaranteeing retrieval information and verity are provided, and There is the Web question and answer searching system based on product information structure of higher recall precision.

The technical scheme is that the Web question and answer searching system based on product information structure of the present invention, include User interface, product information crawl module, information extraction module, inverted index set up module, data base interface, information integration mould Block, question sentence processing module, data base, wherein,

User interface, for realizing the various communications of Web question answering system and user, including the product phase obtaining user's input Close natural language question sentence and question sentence is passed to question sentence processing module；Return to use by corresponding Search Results and related web page Family；

Product information crawls module, for crawling webpage according to intervals, and is entered by the webpage crawled Row storage, passes to information extraction module and processes；

Information extraction module, for crawling at the non-structured web page information that module crawls in webpage product information These unstructured information are converted into structured message, and are built with structuring product information data by data base interface by reason Vertical connection, is stored in the structured message handled well in data base；

Inverted index sets up module, for crawling extraction key content the webpage that module crawls from product information and right These webpages set up inverted index；

Data base interface, it is achieved the access of structuring product data, the unified interface updating database manipulation and access right Limit controls；

Information integration module, for integrating multiple Data Source structured messages of information extraction module output, and by whole Structural data after conjunction is connected with Database by data base interface, is saved in data base；

Question sentence processing module, for the natural language question sentence that user inputs is converted into structurized statement, this module is led to Cross user interface and set up the natural language question sentence being connected acquisition user's input with user, and built with data base by data base interface Vertical connection, uses the statement obtained after converting to inquire about in data base, and by user interface by the Query Result of statement Feed back to user.

Natural language question sentence is converted by above-mentioned question sentence processing module in two steps, first by the simple pattra leaves trained Natural language question sentence is classified by this grader, then uses skip-chain CRF model to the life in natural language question sentence Name entity is identified and extracts.

Above-mentioned name entity is mobile phone title, mobile phone attribute.

Above-mentioned Skip-chain CRF model is to develop on the basis of linear condition random field (Linear CRF) model , it is the one in condition random field (CRF) model.

In above-mentioned name entity recognition method, ignore conjunction " with ", "or" effect in sentence, at Skip-chain CRF Model establishes the contact between former and later two words of conjunction, helps the raising of final precision；Take out for inquiring about question sentence name entity The identification model taken, after using Skip-Chain CRF model to learn training set, it is thus achieved that for the name of product information Entity recognition and judgment criterion, and then question sentence is converted into key word and the product attribute of retrieval meaning.

Above-mentioned information integration module first obtains an attribute mapping table according to the attribute value information in two pending tables, Two tables will have same meaning but name may different attribute-name be mapped, it is simple to next step integration work； Create object table further according to the map information that obtains, the most sequentially rearrange the row name of two tables, according to can be the most true The Major key of a fixed record determines whether the corresponding record in two tables may compare, if equal, is considered to compare , if comparable, then the information in two tables merged or is removed redundancy and process, result is inserted in object table, And the corresponding record in former table is marked；Finally unlabelled record is inserted in object table the most one by one, obtain a warp Cross the object table integrated；If there being multiple tables, processing two tables the most every time, repeating said method and i.e. obtaining final result.

The said goods information crawler module, is used for according to intervals pconline, the large-scale digital website of bubble On introduce the webpage of digital product details and crawl, and the webpage crawled is stored, passes to information extraction mould Block processes.

Above-mentioned question sentence processing module, should for the natural language question sentence that user inputs is converted into structurized SQL statement Module is set up with user by user interface and is connected the natural language question sentence obtaining user's input, and by data base interface and knot Structure product information database is set up and is connected, and uses the SQL statement obtained after converting to inquire about in data base, and by using The Query Result of SQL statement is fed back to user by family interface.

The present invention is directed to the analysis system of destructuring, semi-structured product information, to the multiple sources letter with a product Breath is integrated, it is ensured that information true and perfect；Use sorting algorithm and name entity identification algorithms by nature language simultaneously Speech question sentence is converted into structured database query statement；For the fine granularity sentiment analysis system of product review information, use The separate sources information of identical product is integrated by the algorithm of a kind of Case-based Reasoning similarity.

The algorithm of above-mentioned Case-based Reasoning similarity is divided into mapping and merging two steps to carry out the integration of information, in mapping step The algorithm using Case-based Reasoning similarity carries out Similarity Measure, at combining step according to back to the corresponding element of two tables Two tables are merged by rapid result；For the fine granularity sentiment analysis system of product review information, first question sentence is entered Row classification, then sets up identification model and extracts the name entity in question sentence, and the structure finally according to first two steps uses phase This natural language question sentence is converted into SQL statement by the rule answered.

Present invention Web based on product information structure question and answer searching system, the advantage with the following aspects: 1) this Inventing the well adapting to property of product information of change on the Internet, the effective periodic information that native system proposes updates to be collected Technology can carry out same timely collection to the change of the product information on the Internet, it is possible to obtains online product information in real time Latest development, and by information extraction and integration module, it is possible in time structuring product data existing in data base are carried out Update or increase new structuring product data, so that system can adapt to the change of online product information.2) present invention The product information collected is more complete and has higher verity.Product information is entered by the present invention from multiple infomediaries Row gathers, and by information extraction and integration module, integrates identical product product information on different web sites, right The information of contradiction judges, the information lacked is carried out the complementation between different aforementioned sources, ensures that the integrity of information And verity.3) present invention has higher recall precision, returns key word related web page not with traditional information retrieval system With, the natural question sentence of user's input by question sentence processing module, is asked while providing related web page information by the present invention A series of process such as sentence classification, name Entity recognition, are converted into nature question sentence structurized SQL statement, finally use and obtain SQL statement inquire about and return the simplest result to user to data base is carried out.The present invention is a kind of convenient and practical Web question and answer searching system based on product information structure, it is a kind of advanced form of information retrieval, it can be with accurate, brief introduction Language answer the problem that proposes with natural language of user.

Accompanying drawing explanation

Fig. 1 is the Web question answering system Organization Chart of the present invention；

Fig. 2 be the inverted index of the present invention set up module realize schematic diagram；

Fig. 3 be the Data Integration module of the present invention realize schematic diagram；

Fig. 4 be the question sentence processing module of the present invention realize schematic diagram；

Fig. 5 be the present invention question sentence processing module in Question Classification realize schematic diagram；

Fig. 6 be the present invention question sentence processing module in name Entity recognition realize schematic diagram；

Fig. 7 is the graph structure of the Linear-CRF model as a example by naming entity task.

Detailed description of the invention

Embodiment:

For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and reference Accompanying drawing, the present invention is described in more detail.

Fig. 1 shows present invention Web based on product information structure question answering system Organization Chart.

With reference to Fig. 1, the Web question answering system of the present invention includes user interface, question sentence processing module, data base interface, structure Change product information database, information integration module, information extraction module, product information crawls module, inverted index sets up module.

User interface, for realizing the various communications of Web question answering system and user, including the product phase obtaining user's input Close natural language question sentence and question sentence is passed to question sentence processing module；Return to use by corresponding Search Results and related web page Family.

Product information crawls module, for according to intervals digital website large-scale to pconline, bubble etc. The webpage introducing the digital product details such as mobile phone, computer crawls, and is stored by the webpage crawled, and passes to letter Breath abstraction module processes.

Information extraction module, for crawling at the non-structured web page information that module crawls in webpage product information Reason, such as the dominant frequency of mobile phone, screen size etc..These unstructured information are converted into structured message, and are connect by data base Mouth is set up with structuring product information data and is connected, and is stored in data base by the structured message handled well.

Inverted index sets up module, for crawling extraction key content the webpage that module crawls from product information and right These webpages set up inverted index.

Data base interface, it is achieved the unified interface of the database manipulations such as the access of structuring product data, renewal and access Control of authority.

Information integration module, for integrating multiple Data Source structured messages of information extraction module output, and by whole Structural data after conjunction is connected with Database by data base interface, is saved in data base.The present invention is first according to treating The information such as the property value in two tables processed obtain an attribute mapping table, will have same meaning but name in two tables May different attribute-name be mapped, it is simple to next step integration work；Target is created further according to the map information obtained Table, the most sequentially rearranges the row name of two tables.According to can uniquely determine that a Major key recorded determines two tables In corresponding record whether may compare, if equal, be considered to compare, if comparable, then the information in two tables entered Row merges or removes the process such as redundancy, result is inserted in object table, and the corresponding record in former table is marked.? After unlabelled record is inserted in object table the most one by one, obtain one through integration object table.If having multiple tables, the most every time Process two tables, repeat said method and i.e. obtain final result.

Question sentence processing module, for being converted into structurized SQL statement by the natural language question sentence that user inputs.This module Set up with user by user interface and be connected the natural language question sentence obtaining user's input, and by data base interface and structuring Product information database is set up and is connected, and uses the SQL statement obtained after converting to inquire about in data base, and is connect by user The Query Result of SQL statement is fed back to user by mouth.Natural language question sentence is converted by the present invention in two steps, first by instruction Natural language question sentence is classified by the Naive Bayes Classifier perfected, and then uses skip-chain CRF model to nature Name entity in language question sentence such as mobile phone title, mobile phone attribute etc. are identified and extract.Skip-chain CRF model is In Linear CRF(linearity condition random field) develop on the basis of model, be CRF(condition random field) in model one Kind.In conventional name entity recognition method, generally have ignored conjunction as " with ", the word effect in sentence such as "or", Skip-chain CRF model establishes the contact between former and later two words of conjunction, helps the raising of final precision.

The present invention uses algorithm based on Similarity Measure to integrate the separate sources information of Uniform Product.Due to this System crawls module in product information can carry out information crawler to multiple digital product websites, and the purpose of this way is to protect The product information that card is collected can try one's best and completely enriched, but due to different web sites may to the different name of employing of same attribute or Person's property value is different, and this causes the separate sources information of identical product to there may be the situation of redundancy or contradiction.The present invention adopts Algorithm based on Similarity Measure can effectively the separate sources information of these redundancies or contradiction be integrated, from And both ensure that the complete of data can guarantee that data have higher correctness.

The present invention uses the method for Question Classification and name Entity recognition to be converted into structurized by natural language question sentence SQL statement.Question sentence is carried out classification and question sentence can be carried out finer process, different classes of question sentence is used different turning Change rule, it is possible to increase the system understandability to natural language question sentence.Name entity in natural language question sentence is known Being the most substantially to be identified the main body in question sentence or object, main body and object in a Rational Solutions question sentence could be in conjunction with concrete Transformation rule carry out question sentence conversion.The natural language question sentence of plurality of classes can be entered by the question sentence converting algorithm that the present invention uses Row converts, and is able to ensure that higher accuracy rate.

In sum, the main modular of this system is that question sentence processing module, Data Integration module and inverted index are set up Module.Below in conjunction with accompanying drawing, these three module is further discussed in detail.

Fig. 2 be inverted index set up module realize schematic diagram.With reference to Fig. 2, this module realizes crawling mould from product information The webpage that block crawls extracts key content, these webpages is set up inverted index and stores.The construction process of index can To be divided into three parts:

1) pretreatment stage, uses Htmlparser to extract the key content information in webpage, removes in webpage Noise information, improve later retrieval accuracy rate.Utilize Document pair of these data construct Lucene extracted As and the Field object of correspondence.

2) analysis phase, by calling the addDocument(Document of index manager (IndexWriter)) method Pass data to Lucene and be indexed operation.When being indexed data processing, Lucene can first analytical data, make Be more suitable for indexed.

3) write index, after completing input data analysis, writes the result in index file, will input data with the row of falling The data structure of index stores.

Fig. 3 be Data Integration module realize schematic diagram.With reference to Fig. 3, this module realizes integrating information extraction module output Multiple Data Source structured messages, and will integrate after structural data be stored in data base.This module can be divided into Two submodules:

1) obtain an attribute mapping table according to information such as the property values in two pending tables, will two tables have There is same meaning but name and may different attribute-name be mapped, it is simple to next step integration work.

2) map information obtained according to the 1st step creates object table, the most sequentially rearranges the row name of two tables.Root According to can uniquely determine that a Major key recorded determines whether the corresponding record in two tables may compare, if equal, think Can compare, if comparable, then the information in two tables be merged or removed redundancy etc. and process, result is inserted Enter in object table, and the corresponding record in former table is marked.Finally unlabelled record is inserted in object table the most one by one, Obtain one through the object table integrated.

If there being multiple tables, processing two tables, repeat the above steps 1 and step 2 i.e. obtain final result the most every time.

The detailed step setting up mapping table and the information of integration is:

1) step of mapping table is obtained:

1. obtain the attribute value information in two tables, they are stored in result1 and result2 respectively, such as

result1=List<a₁,a₂,a₃,....,a_m>,a_i=<a_i1,a_i2,a_i3,...a_in>i=1,2,3,…m.

Wherein m is the columns of attribute column of first table, and n is the line number of the attribute column of first table.Will first table In each row be stored in a respectively₁,a₂,a₃,....,a_mIn.In like manner can obtain:

result2=List<b₁,b₂,b₃,....,b_m>,b_i=<b_i1,b_i2,b_i3,...b_in>i=1,2,3,…m.

2. with Chinese Academy of Sciences participle instrument imdict-chinese-analyzer to a₁,a₂,a₃,....,a_mAnd b₁,b₂, b₃,....,b_mAfter carrying out participle, it is not stored in result1SegmentFilter=List < a₁',a'₂,a'₃,...,a'_m>,a_i'=< a_i'₁,a_i'₂,....,a_i'_k>result2egmentFilter=List<b₁',b'₂,b₃',...,b'_m>,b_i'=<b_i'₁,b_i '₂,....,b_i'_k>

The most respectively to a₁',a'₂,a'₃,...,a'_mTake set, to b₁',b'₂,b₃',...,b'_mTake set, i.e. remove repetition The value occurred is stored in result1Set=List < a₁'',a'₂',a'₃',...,a''_m>,a_i'=<a_i''₁,a_i''₂,....,a_i''_Li >L_iIt is a_i' ' in the number of word

result2Set=List<b₁'',b'₂',b₃'',...,b'_m'>,b_i''=<b_i''₁,b_i''₂,....,b_i''_L'_i> L'_iIt is b_i' ' in the number of word

4. calculate in result1Set and result2Set element a two-by-two_i' ' and b_i' ' similarity:

A) if a_i' ' and b_i' ' the number difference of word less, then directly to a_i' ' and b_i' ' carry out Similarity Measure, phase Seemingly spend computing formula:Wherein same function calculates a_i' ' and b_i' ' there is phase Number with word.Result of calculation is stored in M, and (i, j), to each i, tries to achieve corresponding j value so that M(i, j) maximum.Should J value is i.e. row number most possibly corresponding with the i-th row in first table in second table.If M(i, size j) is more than a certain Threshold value, then it is assumed that i and j is corresponding, by they corresponding outputs to mapping table.

B) if a_i' ' and b_i' ' the number difference of word relatively big, then need a_i' ' and b_i' ' in number the greater of word carry out Pretreatment, i.e. adds up word frequency, is ranked up from high to low by word frequency, block the most in position, obtain a_i' ' and b_i''。 Go to step A again.

2) step of information in two tables of integration:

1. utilize the major key (can uniquely identify the property value of a record) of two tables, record identical for major key is carried out Data Integration, such as, remove redundancy, perfect information, conflict removal etc., marked respectively by the record processed in two tables Note.Record after processing is inserted in object table.

After the key assignments of first table of traversal the most to be recycled, two tables are found not labeled record, by them respectively Being inserted in object table, so far, the integration of two tables completes.To integrate multiple tables, can only need to will obtain according to the method described above Integration table common treat integration table as one.

Fig. 4 be question sentence processing module realize schematic diagram.With reference to Fig. 4, this module realizes natural language user inputted Question sentence is converted into structurized SQL statement.In this module, the conversion process of natural language question sentence is divided into three steps: text Pretreatment, Question Classification and name Entity recognition.Participle and the part of speech of question sentence it is substantially carried out in Text Pretreatment step Marks etc. process.Question Classification detailed herein and name Entity recognition step.

Fig. 5 be Question Classification of the present invention realize schematic diagram.With reference to Fig. 5, the present invention uses NB Algorithm to certainly So language question sentence is classified, and selects final classification results according to maximal possibility estimation criterion.Assume that class set is combined into C={C₁, C₂,....,C_n, the result after the natural language participle of input is X={x₁,x₂,.....,x_m, wherein x_iFor the word in question sentence Language, according to training to data equation below calculate question sentence and belong to the probability of each class:

P (C_{i} | X) = \frac{P (x_{i} | C_{i}) \times P (x_{2} | C_{i}) \times . . . \times P (x_{m} | C_{i})}{P (X)}, (1 < = i < = n)

It is wherein fixing for each natural language P (X), the most only need to calculate P (x₁|C_i)×P(x₂|C_i)×... ×P(x_m|C_i), the class of select probability maximum is as final class.

Fig. 6 be the present invention name Entity recognition realize schematic diagram.With reference to Fig. 6, the present invention uses one to have skip- Name entity in question sentence is identified by the CRF model of chain structure.This model is that the present invention carries out natural language question sentence and turns The key point changed, therefore, is described in detail below the structure of skip-chain CRF model, principle and advantage.

We carry out observation analysis discovery by question sentence of being correlated with substantial amounts of product information, can occur in many question sentences simultaneously Two or more name entity names, such as input question sentence are that " which is more preferable for Nokia5230 and Nokia N8？", or input " Nokia 5230 and Nokia 5233 which good " question sentence be, and in this kind of question sentence, entity name conventional as " with ", "or" Connect Deng conjunction.So there is such a phenomenon, if the word before conjunction is judged as entity name, the word after conjunction has The biggest may also be same entity name.This phenomenon has referred in conventional work, but does not propose to solve well Certainly method.Word before and after conjunction is coupled together by the present invention by constructing the CRF model with skip-chain, thus is judging During consider the information in this kind of phenomenon, help the raising of name Entity recognition accuracy rate.In Fig. 6, wherein labelling T1 table Showing that entity word is anterior, T2 presentation-entity word rear portion, O represents other words.

Skip-chain CRF is a kind of special CRF model.CRF is a kind of non-directed graph model, and it is in given feature In collection basis, the conditional probability distribution of sequence mark is modeled.As a example by most basic Linear-CRF, given observation Under conditions of sequence, the conditional probability of labelled sequence can be following form with formalized description:

P (Y | X) = \frac{1}{Z (X)} Π_{i = 1}^{I} ψ_{i} (y_{i}, y_{i = 1}, X)

Wherein, ψ_iIt is the potential function in non-directed graph model concept,It is a length of I The likely regularization factors under labelled sequence.Potential function ψ_iCan be to be decomposed into following form, wherein f_kFeature for definition Function.

ψ_{i} (y_{i}, y_{i - 1}, X) = \exp {\underset{k}{Σ} λ_{k} * f_{k} (y_{i}, y_{i - 1}, X, i)}

The graph model structure of its correspondence is as it is shown in fig. 7, here as a example by name Entity recognition task, input pretreated Text message, sets up the Linear-CRF model of its correspondence.The conditional probability of labelled sequence is directly built by Linear-CRF Mould, is different from Directed Graph Model such as HMM(hidden horse model), it need not just can introduce rich to doing independence assumption between feature Rich feature；On the other hand, it is also considered as the MEMM(maximum entropy Markov model of overall situation regularization), and avoid Marking bias problem in MEMM.Therefore, Linear-CRF can when solving the identification of sequence mark problem such as name entity Obtain good effect.

Skip-chain CRF is a kind of CRF model improved on the basis of Linear-CRF.Such as Skip-in Fig. 6 Shown in the graph model of chain CRF, its structure is in addition to comprising the linear-chain between Linear-CRF adjacent node, also Introduce the skip-chain between former and later two words of conjunction, thus on the basis of Linear-CRF, add word before and after conjunction Contact details between label.

The formalized description of Skip-chain CRF is as follows:

P (Y | X) = \frac{1}{Z (Y)} Π_{i = 1}^{I} Ψ_{i} (y_{i}, y_{i - 1}, X) Π_{(j, j + 2) &Element; S}^{J} φ_{j, j + 2} (y_{i}, y_{i + 2}, X)

Wherein Ψ_iIt is defined on the potential function on adjacent label node, φ_j,j+2The potential function being defined on skip-chain, S ={ (j, j+2) } is the set of all skip-chain.Being defined as follows of they:

Ψ_{i} (y_{i}, y_{i - 1}, X) = \exp {\underset{k}{Σ} λ_{k} * f_{k} (y_{i}, y_{i - 1}, X, i)}

φ_{j, j + 2} (y_{i}, y_{i + 2}, X) = \exp {\underset{l}{Σ} η_{l} * f_{l} (y_{i}, y_{i + 2}, X . j, j + 2)}

Wherein f_k(y_i,y_i-1, X, i) characteristic function being defined on linear-chain, f_l(y_i,y_i+2,X,j,j+2) The characteristic function being defined on skip-chain.

When model training, the present invention uses L-BFGS algorithm to be trained the skip-chain CRF model launched, Parameter lambda in learning model_kAnd η_l。

Particular embodiments described above, has been carried out the purpose of the present invention, technical scheme and beneficial effect the most in detail Describe in detail bright, be it should be understood that the specific embodiment that the foregoing is only the present invention, be not limited to the present invention, all Within the spirit and principles in the present invention, any modification, equivalent substitution and improvement etc. done, should be included in the guarantor of the present invention Within the scope of protecting.

Claims

1. a Web question and answer searching system based on product information structure, it is characterised in that include user interface, product letter Breath crawls module, information extraction module, inverted index set up module, data base interface, information integration module, question sentence process mould Block, data base, wherein,

User interface, is used for the various communications realizing Web question answering system with user, relevant certainly including the product obtaining user's input Question sentence is also passed to question sentence processing module by right language question sentence；Corresponding Search Results and related web page are returned to user；

Product information crawls module, for crawling webpage according to intervals, and is deposited by the webpage crawled Storage, passes to information extraction module and processes；

Information extraction module, the non-structured web page information crawling in webpage for product information crawls module processes, These unstructured information are converted into structured message, and are set up even with structuring product information data by data base interface Connect, the structured message handled well is stored in data base；

Inverted index sets up module, for crawling extraction key content the webpage that module crawls from product information, and to these Webpage sets up inverted index；

Data base interface, it is achieved the access of structuring product data, the unified interface updating database manipulation and access rights control System；

Information integration module, for integrating multiple Data Source structured messages of information extraction module output, and by after integration Structural data be connected with Database by data base interface, be saved in data base；

Question sentence processing module, for the natural language question sentence that user inputs is converted into structurized statement, this module is by using Family interface is set up with user and is connected the natural language question sentence obtaining user's input, and by data base interface with Database even Connect, use the statement obtained after converting to inquire about in data base, and by user interface, the Query Result of statement is fed back To user；

Above-mentioned information integration module first obtains an attribute mapping table according to the attribute value information in two pending tables, will Two tables have same meaning but names and may different attribute-name be mapped, it is simple to next step integration work；Root again Create object table according to the map information obtained, the most sequentially rearrange the row name of two tables, according to can uniquely determine one The Major key of bar record determines whether the corresponding record in two tables may compare, if equal, is considered to compare, if Comparable, then the information in two tables is merged or removed redundancy and process, result is inserted in object table, and by former Corresponding record in table is marked；Finally unlabelled record is inserted in object table the most one by one, obtain one through integrating Object table；If there being multiple tables, processing two tables the most every time, repeating said method and i.e. obtaining final result.

Web question and answer searching system based on product information structure the most according to claim 1, it is characterised in that above-mentioned ask Natural language question sentence is converted, first by the Naive Bayes Classifier trained to nature by sentence processing module in two steps Language question sentence is classified, and then uses skip-chain CRF model to be identified the name entity in natural language question sentence And extraction.

Web question and answer searching system based on product information structure the most according to claim 2, it is characterised in that above-mentioned life Name entity is mobile phone title, mobile phone attribute.

Web question and answer searching system based on product information structure the most according to claim 2, it is characterised in that above-mentioned Skip-chain CRF model is to develop on the basis of linear conditional random field models, is in conditional random field models A kind of.

Web question and answer searching system based on product information structure the most according to claim 2, it is characterised in that above-mentioned life Name entity recognition method in, ignore conjunction " with ", "or" effect in sentence, the company of establishing in Skip-chain CRF model Contact between former and later two words of word, helps the raising of final precision；For inquiring about the identification model of question sentence name entity extraction, adopt After training set being learnt with Skip-Chain CRF model, it is thus achieved that for name Entity recognition and the judgement standard of product information Then, and then by question sentence key word and the product attribute of retrieval meaning it are converted into.

Web question and answer searching system based on product information structure the most according to claim 1, it is characterised in that above-mentioned product Product information crawler module, for introducing digital product to pconline, bubble on large-scale digital website according to intervals The webpage of details crawls, and is stored by the webpage crawled, and passes to information extraction module and processes.

Web question and answer searching system based on product information structure the most according to claim 1, it is characterised in that above-mentioned ask Sentence processing module is for being converted into structurized SQL statement by the natural language question sentence that user inputs, and this module is connect by user Mouth is set up with user and is connected the natural language question sentence obtaining user's input, and by data base interface and structuring product information number Set up according to storehouse and connect, use the SQL statement obtained after converting to inquire about in data base, and by user interface by SQL language The Query Result of sentence feeds back to user.

Web question and answer searching system based on product information structure the most according to claim 1, it is characterised in that for non- Structuring, the analysis system of semi-structured product information, integrate with multiple source-informations of a product, it is ensured that information True and perfect；Use sorting algorithm and name entity identification algorithms that natural language question sentence is converted into structural data simultaneously Library inquiry statement；For the fine granularity sentiment analysis system of product review information, use the calculation of a kind of Case-based Reasoning similarity The separate sources information of identical product is integrated by method.

Web question and answer searching system based on product information structure the most according to claim 8, it is characterised in that above-mentioned base Algorithm in case similarity is divided into mapping and merging two steps to carry out the integration of information, uses Case-based Reasoning phase in mapping step The corresponding element of two tables is carried out Similarity Measure, at combining step according to the result of previous step to two like the algorithm of degree Table merges；For the fine granularity sentiment analysis system of product review information, first question sentence is classified, then set up Identifying that the name entity in question sentence is extracted by model, the structure finally according to first two steps uses corresponding rule that this is natural Language question sentence is converted into SQL statement.