WO2015058604A1 - 获取问答对相关联程度、优化搜索排名的装置和方法 - Google Patents

获取问答对相关联程度、优化搜索排名的装置和方法 Download PDF

Info

Publication number
WO2015058604A1
WO2015058604A1 PCT/CN2014/086838 CN2014086838W WO2015058604A1 WO 2015058604 A1 WO2015058604 A1 WO 2015058604A1 CN 2014086838 W CN2014086838 W CN 2014086838W WO 2015058604 A1 WO2015058604 A1 WO 2015058604A1
Authority
WO
WIPO (PCT)
Prior art keywords
question
answer
word
analyzed
category
Prior art date
Application number
PCT/CN2014/086838
Other languages
English (en)
French (fr)
Inventor
孙林
陈培军
秦吉胜
Original Assignee
北京奇虎科技有限公司
奇智软件(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201310495856.6A external-priority patent/CN103577557B/zh
Priority claimed from CN201310495641.4A external-priority patent/CN103577556B/zh
Priority claimed from CN201310495881.4A external-priority patent/CN103577558B/zh
Application filed by 北京奇虎科技有限公司, 奇智软件(北京)有限公司 filed Critical 北京奇虎科技有限公司
Publication of WO2015058604A1 publication Critical patent/WO2015058604A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis

Definitions

  • the present invention relates to the field of network data communication technologies, and in particular, to an apparatus and method for obtaining a correlation degree of a question and answer pair, an apparatus and method for optimizing a search ranking of a question and answer pair, and a method for determining a frequency of capturing network resource points. Apparatus and method.
  • the Q&A community is a web application that generates content for users.
  • the basic form is that users ask questions according to their own needs, and other users give answers. This form provides a new channel for users to access information on the web.
  • the quality of the information in the Q&A community is so different that there are a large number of low-quality Q&A pairs in the Q&A community. This not only brings a lot of inconvenience to users to find information, but also reduces the quality of the Q&A community.
  • the prior art method of judging the quality of question and answer depends more on the non-text features of the question and answer pair to evaluate the quality of the question and answer, which will affect its versatility.
  • the prior art sets the crawl frequency method for the network resource point, and relies more on Q&A analysis of links to websites, such methods are used for question-and-answer searches. They cannot be semantically analyzed. Q&A pairs cannot adjust the frequency of crawling (or crawling fineness, crawling frequency) according to the quality of network resource points. The accuracy and versatility of search results.
  • the present invention has been made in order to provide an apparatus and method for obtaining the degree of association of a question and answer pair that overcomes the above problems or at least partially solves the above problems, and an apparatus and method for optimizing a search ranking of a question and answer pair, And an apparatus and method for determining a crawl frequency of a network resource point.
  • an apparatus for obtaining a degree of association of a question and answer pair comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; a word extraction unit adapted to the question and answer pair to be analyzed The problem content and the answer content are subjected to a word extraction operation to obtain at least one question word to be analyzed and at least one answer word to be analyzed; the correlation degree calculating unit is adapted to select at least the question answer knowledge base according to the question word to be analyzed and the answer word to be analyzed.
  • a question and answer knowledge record that calculates the degree of association of the question and answer pairs to be analyzed based on the selected question and answer knowledge record.
  • an apparatus for optimizing a search ranking of a question and answer pair comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a search unit adapted to receive a user's search request, Obtaining, according to the user's search request, a plurality of pairs of questions and answers to be analyzed that are matched with the search request; and the calculating unit is configured to acquire, according to the question and answer knowledge base, the degree of association of each question and answer pair to be analyzed; the search ranking unit is adapted to be according to the The degree of association of the question and answer pairs to be analyzed optimizes the search ranking of the question and answer pairs to be analyzed.
  • an apparatus for determining a crawling frequency of a network resource point comprising: a question and answer knowledge base adapted to store a plurality of question and answer knowledge records; and a resource analysis unit adapted to be configured by a network resource point Grasping a plurality of pairs of questions to be analyzed; the calculating unit is adapted to obtain an association degree of each question and answer pair to be analyzed according to the question and answer knowledge base; the crawling frequency determining unit determines the association according to the degree of association of the question and answer pairs to be analyzed The frequency of crawling network resource points.
  • a method for obtaining a degree of association of a question and answer pair comprising the steps of: performing a word extraction operation on a question content and an answer content of the question and answer pair to be analyzed, and obtaining at least one problem to be analyzed a word and at least one word to be analyzed; selecting at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the question word to be analyzed and the word to be analyzed, and calculating the question and answer to be analyzed according to the selected question and answer knowledge record The degree of association.
  • a method for optimizing a search ranking of a question and answer pair comprising the steps of: receiving a search request of a user, and acquiring a plurality of to-be-matched matches with the search request according to the search request of the user
  • the question and answer pair is analyzed; according to the question and answer knowledge base including the plurality of question and answer knowledge records, the degree of association of each question and answer pair to be analyzed is obtained; and the search ranking of the question and answer pair to be analyzed is optimized according to the degree of association of the question and answer pairs to be analyzed.
  • a method for determining a crawling frequency of a network resource point comprising the steps of: capturing, by a network resource point, a plurality of question and answer pairs to be analyzed; according to the plurality of question and answer knowledge records
  • the question and answer knowledge base obtains the degree of association of each question and answer pair to be analyzed; and determines the frequency of the crawling of the network resource points according to the degree of association of the question and answer pairs to be analyzed.
  • multiple question and answer pairs are extracted from a webpage containing a question and answer pair, and multiple pieces are constructed according to the extracted question and answer pairs.
  • the question and answer knowledge base of the question and answer knowledge record, the word extraction operation of the question and answer pair of the question and the answer, and at least one word to be analyzed and at least one word to be analyzed are obtained, and then according to the question word to be analyzed and the word to be analyzed
  • Selecting at least one Q&A knowledge record from the Q&A knowledge base and calculating the correlation degree of the Q&A pairs to be analyzed according to the selected Q&A knowledge record can evaluate the quality of the Q&A pair from the semantic aspect and solve the prior art evaluation only on the lexical level.
  • each question and question to be analyzed is obtained according to the question and answer knowledge base.
  • the degree of association of the pair and the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed can evaluate the quality of the question and answer pair to be analyzed from the semantic aspect, and solve the problem that the prior art relies on the question and answer on the webpage and question and answer.
  • the problem of poor sorting effect further, by grasping a plurality of question and answer pairs to be analyzed by the network resource point, obtaining the correlation degree of each question and answer pair to be analyzed according to the question and answer knowledge base and determining the correlation degree according to the question and answer pair to be analyzed
  • the crawling frequency of the network resource point can determine the crawling frequency by evaluating the quality of the network resource point, and solves the problem that the prior art cannot select the crawling frequency according to the quality of the network resource point.
  • the solution of the present application is easy to implement and has high versatility.
  • FIG. 1 shows a flow chart of a method of obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention
  • Figure 2 shows a detailed flow chart for building a Q&A knowledge base
  • FIG. 3 is a schematic diagram showing an explanation model of the question and answer knowledge base obtained by using the steps shown in FIG. 2;
  • FIG. 4 shows a detailed flow chart of step S200 of Figure 1;
  • FIG. 5 illustrates a block diagram of an apparatus for obtaining a degree of association of a question and answer pair, in accordance with one embodiment of the present invention
  • FIG. 6 shows a flow chart of a method for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention
  • FIG. 7 shows a block diagram of an apparatus for optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention
  • FIG. 8 shows a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention
  • FIG. 9 shows a block diagram of an apparatus for determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention.
  • Figure 10 shows a block diagram of an application server for performing the method according to the invention
  • Figure 11 shows a storage unit for holding or carrying program code implementing the method according to the invention.
  • the existing method of obtaining the degree of association of question and answer pairs is to use text features and non-text features to describe the questions and answers of the question and answer pairs.
  • the existing method for obtaining a search ranking of a question and answer pair is to use a text feature and a non-text feature to describe the question and answer pair to rank the question and answer pair, or to answer questions based on the question and answer.
  • Text features mainly include textual visual features (such as punctuation density, average word length, text entropy, etc.) and text content features (such as text content word scale, question word density, related word coverage, etc.), and extract Chinese automatic errors widely used.
  • non-text features include user weightedness indicators, answer question status, answer answer time, user relationship interaction features, and so on.
  • a problem quality prediction model and an answer quality prediction model are respectively learned on the training set, and the output of the two models is used to evaluate the quality of the question and answer.
  • the relevant word coverage feature is used to describe the semantic matching of the question and answer questions, which is not only at the lexical level. And did not consider the semantic matching of questions and answers.
  • the semantic matching of questions and answers is precisely the core of question and answer.
  • the question is “Where is the capital of China?”, the answer 1 is “Beijing” and the answer 2 is “China's capital is Shanghai”. Then the question is “where is the capital of China” after the word segmentation and discarding the stop words, the answer 1 word segmentation result is “Beijing”, and the answer 2 word segmentation result is “China Capital Shanghai”.
  • FIG. 1 shows a flow chart of a method of obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention.
  • a method of obtaining a degree of association of a question and answer pair comprising the following steps S100 and S200:
  • S100 performing a word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtaining at least one question word to be analyzed and at least one answer word to be analyzed.
  • the word extraction operation of the question content and the answer content of the question and answer pair to be analyzed specifically includes: segmenting the question content and the answer content of the question and answer pair to be analyzed, removing the stop word, and word merge (word Join), and the operation of extracting entity words (such as nouns, verbs, etc.). Then, at least one problem word to be analyzed is obtained from the question content of the question and answer pair to be analyzed, and at least one answer word to be analyzed is obtained from the answer content of the question and answer pair to be analyzed.
  • S200 Select at least one question and answer knowledge record from the question and answer knowledge base including the plurality of question and answer knowledge records according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
  • the problem content and the answer content of the analysis question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
  • the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
  • the category corresponding to the question and answer pair is captured.
  • the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
  • Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word. .
  • FIG. 2 shows a detailed flow chart for building a Q&A knowledge base. Specifically, the following steps S310, S320, and S330 are included:
  • data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair;
  • the webpage including the high-quality question and answer pair includes cQA (Customer Quality Assurance) community, major professional forums, etc.
  • cQA Customer Quality Assurance
  • the webpage containing the high-quality question and answer pair includes the category information corresponding to each question and answer pair, the category corresponding to the question and answer pair can be grasped together while the question and answer pair is captured.
  • the word extraction operation is performed on the question content and the answer content of each question and answer pair in the question and answer pairs extracted in step S310, specifically including the question content and the answer content of the question and answer pair.
  • Word segmentation, removal of stop words, word merging, and operations for extracting entity words are examples of Word segmentation, removal of stop words, word merging, and operations for extracting entity words.
  • At least one question word is obtained from the question content of each question and answer pair, and at least one answer word is obtained from the answer content of each question and answer pair, and the category set ⁇ C 1 ,..., C k ,... for the question and answer pair can be obtained.
  • Step S330 in this embodiment may be performed based on the mass information record after the massive question and answer pair obtained from the web page is subjected to the word extraction operation as described in step S320 to obtain a massive information record.
  • the semantic relevance obtained based on massive information records is more accurate.
  • the calculating the probability that the answer word belongs to the category includes:
  • the calculating the degree of specificity of each answer word on the question word in the category includes:
  • the calculating the strength of the question word in the category to be explained by each answer word specifically comprising:
  • C Ck)*interpret(QWi,AWj
  • C Ck);
  • P(C k ) represents the probability of occurrence of the category C k
  • P(AW j ) represents the probability that the answer is AW j
  • C k ) represents the probability that the C k category belongs to AW j ;
  • #(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
  • #(AW j ) indicates the number of times the answer word is AW j .
  • a question and answer knowledge record can be obtained to construct a question and answer knowledge base.
  • Figure 3 shows a schematic diagram of an explanatory model of a question and answer knowledge base obtained using the steps shown in Figure 2. It can be seen that for each question word QW i , n question and answer knowledge records can be obtained for each of the category sets ⁇ C 1 , . . . , C k , . . . , C p >.
  • the calculated semantic relevance is 0, the corresponding question and answer knowledge record can be deleted; further, if the number of question and answer knowledge records in the question and answer knowledge base is too large, the question and answer knowledge is stored.
  • the overhead of recording and calculating the degree of association of the question and answer pairs to be analyzed is too large, and a threshold can be preset, and the question and answer knowledge record whose semantic relevance is less than the threshold is deleted to reduce the overhead.
  • FIG. 4 shows a detailed flowchart of step S200 in FIG. 1.
  • step S200 specifically includes the following steps S210, S220, and S230:
  • step S210 Select a question and answer knowledge record that matches the problem words included in the problem word to be analyzed and the included answer words and the answer words to be analyzed.
  • the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word.
  • a field matching or field search method is used to select a part of the question and answer knowledge record related to the question and answer pair to be analyzed from the question and answer knowledge base. .
  • the question and answer knowledge record corresponding to the same category in the selected question and answer knowledge record obtain the degree of association of the question and answer pairs to be analyzed for each category, and specifically include: the selected question and answer knowledge record corresponds to the same category
  • the semantic relevance of the Q&A knowledge record is weighted and added, and the degree of association of the question and answer pairs to be analyzed for each category is obtained.
  • the Q&A knowledge records selected by step S210 are grouped according to their corresponding categories, and the Q&A knowledge records corresponding to the same category are grouped; the semantic relevance of each group of Q&A knowledge records is weighted (for example, And adding a weight of 1 or 100), obtaining the degree of association of the question and answer pair to be analyzed for the category; thereby obtaining at least one (the number of degrees of association in the embodiment is the corresponding category of the question and answer pair to be analyzed The number) the degree of association.
  • Figure 5 illustrates a block diagram of an apparatus for obtaining the degree of association of a question and answer pair, in accordance with one embodiment of the present invention.
  • the apparatus includes a question and answer knowledge base 100, a word extraction unit 200, and an associated degree calculation unit 300.
  • the question and answer knowledge base 100 is adapted to store a plurality of question and answer knowledge records; the question and answer knowledge base 100 of the present embodiment can be constructed by crawling a large number of question and answer pairs in the web page.
  • the word extracting unit 200 is adapted to perform a word extracting operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
  • the word extracting unit 200 is adapted to perform word segmentation, remove stop words, word join, and extract entity words (for example, nouns) for the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
  • the association degree calculation unit 300 is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
  • the correlation degree calculation unit 300 is adapted to select a question and answer knowledge record whose question words are matched with the question words to be analyzed and the included answer words match the answer words to be analyzed.
  • the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
  • the semantic relevance weights for example, the weight
  • the number of degrees of association in the embodiment that is, the number of categories to be analyzed, the number of categories to be analyzed
  • the above-mentioned question and answer pairs to be analyzed are selected for each category
  • the maximum value of the degree of association with the maximum value as the degree of association of the question and answer pairs to be analyzed.
  • the word extracting unit 200 Using the question and answer knowledge base 100, the word extracting unit 200, and the associated degree calculating unit 300, selecting at least one question and answer knowledge record from the question and answer knowledge base by using the question word to be analyzed and the answer word to be analyzed, and calculating according to the selected question and answer knowledge record
  • the degree of correlation between the question and answer pairs to be analyzed can be analyzed from the semantic aspect of the analysis question and answer pair.
  • the evaluation effect is better and easier to implement.
  • the scope of application is wider and versatile. Stronger.
  • the device further includes a question and answer knowledge base construction unit 400, and the question and answer knowledge base construction unit 400 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
  • Recorded Q&A knowledge base the Q&A knowledge base.
  • the Q&A knowledge base is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base often needs to be updated, by adding a Q&A knowledge base building unit 400. Build (or update) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base.
  • the question and answer knowledge base construction unit 400 grabs the category corresponding to the question and answer pair.
  • data may be fetched from a webpage containing a high-quality question and answer pair on the Internet, and a question and answer pair may be extracted to ensure the quality of the extracted question and answer pair; the webpage including the high-quality question and answer pair includes cQA community, major professional forums, etc. Since the webpage containing the high quality question and answer pair includes category information corresponding to each question and answer pair, the question and answer knowledge base construction unit 400 can grab the category corresponding to the question and answer pair while grabbing the question and answer pair.
  • the question and answer knowledge base construction unit 400 is adapted to perform the following operations on each question and answer pair: performing a word extraction operation on the question content and the answer content of the question and answer pair to obtain a question word set and an answer word set, specifically
  • the question and answer knowledge base construction unit 400 performs the word segmentation, the removal of the stop word, the word combination, and the operation of extracting the entity word for the problem content and the answer content of each of the question and answer pairs in the extracted question and answer pairs to obtain the question words and answers.
  • a word each of the question words in the set of question words and each answer word in the set of answer words form an information record on each of the categories corresponding to the question and answer pair.
  • the question and answer knowledge base construction unit 400 is adapted to record, for each piece of information, an operation of calculating a probability that the answer word belongs to the category, and calculating a degree of specificity of the answer word to the question word on the category, The strength of the question word in the category to be explained by the answer word; multiplying the above probability, the degree of specificity and the intensity, the product obtained is the semantic relevance of the answer word and the question word;
  • the answer words and their semantic relevance form a question and answer knowledge record corresponding to the category.
  • the question and answer knowledge base construction unit 400 is adapted to calculate the probability that the answer word belongs to the category according to the following method:
  • the question and answer knowledge base construction unit 400 is adapted to calculate the degree of specificity of the interpretation of the question words by the respective answer words on the category according to the following method:
  • the question and answer knowledge base construction unit 400 is adapted to calculate the strength of the problem words explained by the respective answer words on the category according to the following method:
  • the question and answer knowledge base construction unit 400 is adapted to multiply the above probability, specific degree, and intensity according to the following method:
  • C Ck)*interpret(QWi,AWj
  • C Ck);
  • P(C k ) represents the probability of occurrence of the category C k
  • P(AW j ) represents the probability that the answer is AW j
  • C k ) represents the probability that the C k category belongs to AW j ;
  • #(QW i , AW j ) indicates the number of times the question word is QW i and the answer word is AW j ;
  • #(AW j ) indicates the number of times the answer word is AW j .
  • the words to be analyzed and the words to be analyzed are as follows:
  • an existing Q&A knowledge base may be retrieved, or a Q&A knowledge base may be constructed by grasping the QQA community and the Q&A pairs of the major professional forums;
  • the second step is to answer the question and answer pair to be analyzed.
  • the word set to be analyzed is obtained.
  • the answer word set to be analyzed ⁇ symptoms, drugs, treatment, anti-virus, pediatric cold particles, description , dosage, cough, Chinese medicine, granules, antibiotics, amoxicillin, amoxicillin granules, granules, oral, roxithromycin, efficacy>, and the type of question and answer pair to be analyzed is “medical health”;
  • a plurality of question and answer knowledge records matching the problem words and the words to be analyzed are selected from the question and answer knowledge base, thereby obtaining the following answer words and semantic relevance (for convenience of reading,
  • the values of the semantic relevance in the table are the values that have been properly normalized):
  • the Q&A knowledge records including the answer words and the answers to be analyzed are selected, and further Get the semantic relevance of the selected question and answer knowledge records.
  • the answers to the answers in this example that match the answer words in the Q&A knowledge record include: ⁇ Oral, Kechuan, Pediatric cold particles, examination, cough, treatment, flu symptoms, cold particles>.
  • the degree of correlation of the question and answer pairs to be analyzed may be calculated, and the degree of correlation of the question and answer pairs to be analyzed reaches 0.9 (under the condition that the correlation degree ranges from 0 to 1).
  • FIG. 6 shows a flow chart of a method of optimizing a search ranking of a question and answer pair, in accordance with one embodiment of the present invention.
  • the method includes the following steps S610, S620, and S630:
  • S610 Receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
  • the network search technology may be used, for example, using a question and answer pair search engine to obtain a question and answer pair to be analyzed according to the user's search request.
  • S620 Obtain an association degree of each question and answer pair to be analyzed according to a Q&A knowledge base including a plurality of Q&A knowledge records.
  • the question content and the answer content of the question and answer pair may be analyzed from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
  • step S620 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
  • the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
  • the category corresponding to the question and answer pair is captured.
  • the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
  • Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word.
  • QW question word
  • AW answer word
  • semantic relevance between the question word and the answer word.
  • the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
  • the method of the embodiment further includes the step of constructing the question and answer knowledge base, and the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. The interpretation model is roughly the same. It will not be repeated here.
  • the search ranking of the question and answer pair to be analyzed can be optimized by using the degree of association, and the ranking effect is better.
  • the specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first
  • the ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting
  • the analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed;
  • the quality of the pair and the row of the website to which it belongs The combination of names, sorting pairs of questions and answers to be analyzed, users can get better results sorting quality when using Q&A.
  • the device includes a question and answer knowledge base 710, a search unit 720, a calculation unit 730, and a search ranking unit 740.
  • the question and answer knowledge base 710 is adapted to store a plurality of question and answer knowledge records.
  • the question and answer knowledge base 710 of the present embodiment can be constructed by crawling a massive question and answer pair in a web page.
  • the searching unit 720 is adapted to receive a search request of the user, and obtain a plurality of question and answer pairs to be analyzed that match the search request according to the search request of the user.
  • the search unit 720 may be a question and answer pair search engine, and obtain a question and answer pair to be analyzed according to the user's search request; for example, the search unit 720 is a web search engine for question and answer search, and the receiving user passes The search request entered by the browser and the question and answer pair to be analyzed.
  • the calculating unit 730 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base 710.
  • the calculation unit 730 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
  • the question and answer knowledge base 710 constructs and includes a plurality of question and answer knowledge records using a large number of high quality question and answer pairs extracted from web pages, and can acquire semantics between problem words and answer words of multiple question and answer knowledge records based on learning of massive information. relativity.
  • the search ranking unit 740 is adapted to optimize the search ranking of the question and answer pair to be analyzed according to the degree of association of the question and answer pairs to be analyzed.
  • the specific method may be the search ranking of the question-and-answer pair to be analyzed in the order of the degree of association of the question-and-answer pairs to be analyzed, that is, the search ranking of the question-and-answer pair with a high degree of relevance is ranked first; or may be based on the search first
  • the ranking technique initially arranges the website to which the question and answer pair to be analyzed belongs, and calculates a search ranking of the pair of questions to be analyzed according to the degree of association between the sequence number of the preliminary arrangement and the question and answer pair to be analyzed, for example, the waiting
  • the analysis question and answer is multiplied by the degree of association of the preliminary arrangement of the website to which it belongs, and the order of the result of the multiplication operation is used as the search ranking of the question and answer pair to be analyzed.
  • the apparatus further includes a question and answer knowledge base construction unit 750, wherein the question and answer knowledge base construction unit 750 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
  • Recorded Q&A knowledge base In the device shown in FIG. 7, the Q&A knowledge base 710 is already existing. Since the information volume of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 710 often needs to be updated.
  • the knowledge base building unit 750 constructs (or updates) the question and answer knowledge base 710, which can ensure the immediacy and reliability of the content of the question and answer knowledge base 710.
  • the question and answer knowledge base construction unit 750 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
  • the calculation unit 630 in FIG. 7 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
  • the word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
  • the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
  • entity words eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.
  • the correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
  • the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed.
  • the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
  • the semantic relevance weights for example, the weight
  • Degree thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
  • FIG. 8 illustrates a flow chart of a method of determining a crawl frequency of a network resource point, in accordance with one embodiment of the present invention.
  • the method includes the following steps S810, S820, and S830:
  • the plurality of to-be-analyzed question and answer pairs are captured by the network resource point.
  • it may be a network resource point for determining a specific fetching frequency, for example, a Q&A community that needs to determine a fetching frequency, using a floor identification technology, according to the landlord (ie, the first post for a question)
  • the user asks questions, and the content of the reply on the 2nd floor of the 1st floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
  • the question content and the answer content of the question and answer pair may be analyzed semantically by using the question and answer knowledge base.
  • the analysis is performed to obtain the degree of correlation of the question and answer pairs to be analyzed, and the evaluation effect is better and easier to implement.
  • step S820 of the embodiment is substantially the same as the method of obtaining the degree of association of the question and answer pair as shown in FIG. repeat.
  • the question and answer knowledge base including a plurality of question and answer knowledge records is obtained by extracting a plurality of question and answer pairs from a webpage having a question and answer pair in advance, and constructing according to the extracted question and answer pairs.
  • the category corresponding to the question and answer pair is captured.
  • the question and answer knowledge record is constructed according to the question and answer pair and the category corresponding to the question and answer pair.
  • Each question and answer knowledge record in the obtained question and answer knowledge base corresponds to a category, which includes a question word (QW), an answer word (AW), and a semantic relevance between the question word and the answer word.
  • QW question word
  • AW answer word
  • semantic relevance between the question word and the answer word.
  • the semantics between problem words and answer words of multiple Q&A knowledge records can be obtained based on the learning of massive information. Correlation; by using the information extracted from the web page to build a question-and-answer knowledge base, the scope of application is broader, and the method is more versatile.
  • the method of the embodiment further includes the step of constructing a question and answer knowledge base, wherein the process of constructing the question and answer knowledge base is substantially the same as the process shown in FIG. 2; the interpretation model of the question and answer knowledge base of the present embodiment is as shown in FIG. 3
  • the explanatory models shown are roughly the same. It will not be repeated here.
  • the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points.
  • the specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
  • the frequency for example, the frequency at which the spider crawler crawls the network resource point is high
  • the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
  • An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree
  • the average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
  • the correlation degree of the question and answer pair to be analyzed is analyzed by the network resource point, and the crawling frequency of the network resource point is determined according to the degree of association, so that the accuracy of the crawling result can be improved.
  • the apparatus includes a question and answer knowledge base 91, a resource analysis unit 920, a calculation unit 930, and a capture frequency acquisition unit 940.
  • the Q&A knowledge base 910 is adapted to store a plurality of Q&A knowledge records.
  • the question and answer knowledge base 910 of the present embodiment can be constructed by crawling a large number of question and answer pairs in a web page.
  • the resource analysis unit 920 is adapted to capture a plurality of question and answer pairs to be analyzed by the network resource point.
  • the resource analysis unit 920 may determine a network resource point of a capture frequency for a specific need, for example, a question and answer community that needs to determine a crawl frequency, and use a floor identification technology according to the landlord (ie, for a problem first)
  • the user who posts the question) asks questions, and the content of the reply on the 1st floor and the 2nd floor (that is, the user who replies to the post in order) is the answer, to extract the question and answer pair to be analyzed.
  • the calculating unit 930 is adapted to obtain the degree of association of each question and answer pair to be analyzed according to the question and answer knowledge base.
  • the calculation unit 930 of the present invention can analyze the problem content and the answer content of the analysis question and answer pair from the semantic aspect by using the question and answer knowledge base to obtain the correlation degree of the question and answer pair to be analyzed, and the evaluation effect is better and easy to implement.
  • the Q&A knowledge base 910 is constructed using a large number of high-quality Q&A pairs extracted from web pages and includes a plurality of Q&A knowledge records, which can acquire semantics between problem words and answer words of multiple Q&A knowledge records based on learning of massive information. relativity.
  • the capture frequency determining unit 940 is adapted to determine a crawling frequency of the network resource point according to the correlation degree of the question and answer pair to be analyzed.
  • the quality of the network resource points can be determined by using the correlation degree of the plurality of question and answer pairs to be analyzed, thereby determining the frequency of the network resource points.
  • the specific method may be that the average value of the correlation degree of the pair of questions to be analyzed is used as the crawling frequency of the network resource point, that is, the network resource point with a large average value (ie, good quality) of the associated degree The higher the frequency (for example, the frequency at which the spider crawler crawls the network resource point is high); or the spider crawler may be used to obtain the initial crawl frequency of the network resource point, and calculate the correlation degree of the question and answer pair to be analyzed.
  • An average value, using the average value to adjust the initial crawl frequency to determine a crawl frequency of the network resource point for example, an spider crawler may be used to obtain an initial crawl frequency of the network resource point, using the correlation degree
  • the average value of the initial capture frequency is weighted (including multiplication, normalization, etc.) to determine the capture frequency of the network resource point, so that the capture frequency of the high-quality network resource point is improved, thereby optimizing Search quality.
  • the apparatus further includes a question and answer knowledge base construction unit 950, and the question and answer knowledge base construction unit 950 is adapted to extract a plurality of question and answer pairs from the webpage containing the question and answer pair in advance, and construct a plurality of question and answer knowledge according to the extracted question and answer pairs.
  • Recorded Q&A knowledge base the Q&A knowledge base 910 is existing. Since the amount of information of the actual network is increasing, the information content changes rapidly, and the content of the Q&A knowledge base 910 often needs to be updated.
  • the knowledge base building unit 950 builds (or updates) the Q&A knowledge base to ensure the immediacy and reliability of the content of the Q&A knowledge base.
  • the question and answer knowledge base construction unit 950 of the present embodiment is the same as the question and answer knowledge base construction unit 400 shown in FIG. 5, and the description thereof will not be repeated here.
  • the calculation unit 930 in FIG. 9 specifically includes a word extraction subunit and an associated degree calculation subunit (not shown).
  • the word extraction subunit is adapted to perform the word extraction operation on the question content and the answer content of the question and answer pair to be analyzed, and obtain at least one question word to be analyzed and at least one answer word to be analyzed.
  • the word extraction subunit is adapted to perform word segmentation, remove stop words, word join, and extract entity words (eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.) to obtain at least one word to be analyzed and at least one word to be analyzed.
  • entity words eg, nouns, the question content and the answer content of the question and answer pair to be analyzed. The operation of the verb, etc.
  • the correlation degree calculation subunit is adapted to select at least one question and answer knowledge record from the question and answer knowledge base according to the problem word to be analyzed and the answer word to be analyzed, and calculate the correlation degree of the question and answer pair to be analyzed according to the selected question and answer knowledge record.
  • the correlation degree calculation subunit is adapted to select a question and answer knowledge record whose question words are matched with the question word to be analyzed and the included answer words match the answer words to be analyzed.
  • the matching of the problem words and the problem words to be analyzed refers to the sub-strings of the problem words to be analyzed and the problem words to be analyzed or the problem words to be analyzed are problem words; the matching words and the words to be analyzed match the words to be analyzed and The answer word is the same or the answer word to be analyzed is a substring of the answer word; according to the Q&A knowledge record corresponding to the same category in the selected question and answer knowledge record, the relevance of the question and answer pair to be analyzed for each category is obtained, more specific And adding the semantic relevance weights (for example, the weights of 1 or 100) corresponding to the same category of question and answer knowledge records in the selected question and answer knowledge records to obtain the association of the question and answer pairs to be analyzed respectively for each category.
  • the semantic relevance weights for example, the weight
  • Degree thereby obtaining at least one (the number of degrees of association in the embodiment, that is, the number of categories to be analyzed, the number of categories to be analyzed) is associated; selecting the above-mentioned question and answer pairs to be analyzed is the largest degree of association for each category The value, with the maximum value as the degree of association of the question and answer pairs to be analyzed.
  • the various component embodiments of the present invention may be implemented in hardware, or in a software module running on one or more processors, or in a combination thereof. It should be understood by those skilled in the art that a microprocessor or digital signal processor (DSP) can be used in practice to implement a device for obtaining the degree of association of a question and answer pair according to an embodiment of the present invention, and a device for optimizing search ranking of a question and answer pair. And some or all of the functions of some or all of the means for determining the frequency of crawling of network resource points.
  • the invention can also be implemented as a device or device program (e.g., a computer program and a computer program product) for performing some or all of the methods described herein. Such a program implementing the invention may be stored on a computer readable medium or may be in the form of one or more signals. Such signals may be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.
  • FIG. 10 illustrates a method for performing an association degree of obtaining a question and answer pair according to the present invention, a method of optimizing a search ranking of a question and answer pair, and a server for determining a frequency of crawling a network resource point, such as an application server.
  • the application server traditionally includes a processor 1010 and a computer program product or computer readable medium in the form of a memory 1020.
  • the memory 1020 may be an electronic memory such as a flash memory, an EEPROM (Electrically Erasable Programmable Read Only Memory), an EPROM, a hard disk, or a ROM.
  • the memory 1020 has a memory space 1030 for executing program code 1031 of any of the above method steps.
  • storage space 1030 for program code may include various program code 1031 for implementing various steps in the above methods, respectively.
  • the program code can be read from or written to one or more computer program products.
  • These computer program products include program code carriers such as hard disks, compact disks (CDs), memory cards or floppy disks.
  • Such a computer program product is typically a portable or fixed storage unit as described with reference to FIG.
  • the storage unit may have a storage section, a storage space, and the like arranged similarly to the storage 1020 in the application server of FIG.
  • the program code can be compressed, for example, in an appropriate form.
  • the storage unit includes computer readable code 1131 ', ie, code that can be read by, for example, a processor, such as processor 1010, which, when executed by a server, causes the server to perform each of the methods described above. step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种获取问答对的相关联程度的装置和方法,一种优化问答对的搜索排名的装置和方法,以及一种确定网络资源点的抓取频率的装置和方法,其中,获取问答对的相关联程度的方法包括如下步骤:对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。该获取问答对的相关联程度的装置和方法,可以从语义方面评价问答对的质量,评价效果更好,而且容易实现、通用性强。

Description

获取问答对相关联程度、优化搜索排名的装置和方法 技术领域
本发明涉及网络数据通信技术领域,具体涉及一种获取问答对的相关联程度的装置和方法,一种优化问答对的搜索排名的装置和方法,以及一种确定网络资源点的抓取频率的装置和方法。
背景技术
问答社区是一种用户产生内容的网络应用,基本形式是由用户根据自己的需求提出问题,并由其他的用户来给出回答。这种形式为用户在网络上获取信息提供了新的渠道。然而由于任何用户都可以随意地创建内容,导致了问答社区中的信息质量差异非常大,以至于问答社区中出现了大量的低质量问答对。这不但给用户查找信息带来了诸多不便,同时也降低了问答社区的质量。同时,现有技术的判断问答对质量的方法,更多地依赖于问答对的非文本特征来评价问答对质量,会影响其通用性。
另外,使用现有的搜索技术进行问答搜索时,获取的搜索结果中存在部分低质量的问答对而现有技术的对搜索结果进行排序的方法,更多地依赖于问答对所属的网站和问答对的非文本特征来对问答对进行排序,会影响搜索结果的精确性和通用性。
同时地,使用现有的搜索技术进行问答搜索时,难以判断问答社区作为网络资源点的质量而现有技术(例如,爬虫蜘蛛)的对网络资源点设置抓取频率方法,更多地依赖于问答对网站的链接的分析,这样的方法用于问答搜索,不能从语义上分析问答对也不能根据网络资源点的质量调整抓取频率(或,爬取细度、爬取频率),会影响搜索结果的精确性和通用性。
发明内容
鉴于上述问题,提出了本发明以便提供一种克服上述问题或者至少部分地解决上述问题的一种获取问答对的相关联程度的装置和方法,一种优化问答对的搜索排名的装置和方法,以及一种确定网络资源点的抓取频率的装置和方法。
依据本发明的一个方面,提供了一种获取问答对的相关联程度的装置,该装置包括:问答知识库,适于存储多条问答知识记录;词语提取单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;相关联程度计算单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
根据本发明的另一方面,提供了一种优化问答对的搜索排名的装置,该装置包括:问答知识库,适于存储多条问答知识记录;搜索单元,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;搜索排名单元,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
根据本发明的又一方面,提供了一种确定网络资源点的抓取频率的装置,该装置包括:问答知识库,适于存储多条问答知识记录;资源分析单元,适于由网络资源点抓取多个待分析问答对;计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;抓取频率确定单元,根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
根据本发明的另一方面,提供了一种获取问答对的相关联程度的方法,该方法包括如下步骤:对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
根据本发明的又一方面,提供了一种优化问答对的搜索排名的方法,该方法包括如下步骤:接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
根据本发明的再一方面,提供了一种确定网络资源点的抓取频率的方法,该方法包括如下步骤:由网络资源点抓取多个待分析问答对;根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
根据本发明的技术方案,从含有问答对的网页提取出多个问答对并根据提取的问答对构建包括多条 问答知识记录的问答知识库,对待分析的问答对的问题内容和答案内容进行词语提取操作而得到至少一个待分析问题词语和至少一个待分析答案词语,进而根据待分析问题词语和待分析答案词语从问答知识库选择至少一条问答知识记录并根据所选择的问答知识记录计算待分析的问答对的相关联程度,可以从语义方面评价问答对的质量,解决了现有技术仅在词法层面上评价问答对的质量而导致的评价效果不佳的问题,同时,在根据使用者的搜索请求获取的与搜索请求匹配的多个待分析问答对的情况下,根据问答知识库获取每个待分析问答对的相关联程度并根据待分析问答对的相关联程度优化待分析问答对的搜索排名,可以从语义方面评价待分析问答对的质量,解决了现有技术依赖于问答对所属的网页和问答对的非文本特征来对问答对进行排序而导致的排序效果不佳的问题;进一步地,借助由网络资源点抓取多个待分析问答对,根据问答知识库获取每个待分析问答对的相关联程度并根据待分析问答对的相关联程度确定所述网络资源点的抓取频率,可以通过评价网络资源点的质量确定抓取频率,解决了现有技术不能根据网络资源点的质量调整抓取频率而导致的搜索效果不佳的问题。而且本申请的方案容易实现、通用性强。
上述说明仅是本发明技术方案的概述,为了能够更清楚了解本发明的技术手段,而可依照说明书的内容予以实施,并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂,以下特举本发明的具体实施方式。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本发明的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1示出了根据本发明一个实施例的获取问答对的相关联程度的方法的流程图;
图2示出了构建问答知识库的详细的流程图;
图3示出了使用如图2所示的步骤而得到的问答知识库的一个解释模型示意图;
图4示出了图1中步骤S200的详细的流程图;以及
图5示出了根据本发明一个实施例的获取问答对的相关联程度的装置的框图;
图6示出了根据本发明一个实施例的优化问答对的搜索排名的方法的流程图;
图7示出了根据本发明一个实施例的优化问答对的搜索排名的装置的框图;
图8示出了根据本发明一个实施例的确定网络资源点的抓取频率的方法的流程图;
图9示出了根据本发明一个实施例的确定网络资源点的抓取频率的装置的框图;
图10示出了用于执行根据本发明的方法的应用服务器的框图;以及
图11示出了用于保持或者携带实现根据本发明的方法的程序代码的存储单元。
附图实施例
现有的获取问答对的相关联程度的方法,是使用文本特征和非文本特征来描述问答对的问题和答案。类似地,现有的获取问答对的搜索排名的方法,是使用文本特征和非文本特征来描述问答对的问题和答案从而对问答对进行排名,或根据问答对所属的网站的排名对问答对进行排名。文本特征主要包括文本视觉特征(例如标点符号密度,平均词长,文本熵等)和文本内容特征(例如文本内容词比例,疑问词密度,相关词覆盖等),并提取中文自动差错广泛采用的特征(例如单字密度特征等);非文本特征包含用户的权成度指标,答案问题状态,答案回答时间,用户关系交互特征等。在对问题和答案分别提取出特征后,在训练集上分别学习出一个问题质量预测模型和答案质量预测模型,并使用两个模型的输出结果来评价问答对质量。然而,使用现有的获取问答对的相关联程度的方法对于答案质量进行评价时,仅仅使用了相关词覆盖特征来描述问题和答案问的语义匹配度,这不但仅仅是停留在词法层面上的,而且没有考虑问题和答案问的语义匹配度。然而问题和答案问的语义匹配度恰恰是问答对质量的核心,比如问题为“中国的首都是哪里?”,答案1为“北京”,答案2为“中国的首都是上海”。那么问题经过分词及丢弃停用词处理后,为“中国首都哪里”,答案1分词结果为“北京”,答案2分词结果为“中国首都上海”。现有技术中,语义匹配度可以定义为:问题和答案中共同出现的词语个数除以问题和答案中所有词语的个数。则问题和答案1的语义匹配度为:0/4=0。问题和答案2的语义匹配度为:2/4=0.5。使用现有技术,就会认为答案2和问题较为匹配。而我们知道这显然是不当的。
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。
图1示出了根据本发明一个实施例的获取问答对的相关联程度的方法的流程图。根据本发明的另一方面,提供了一种获取问答对的相关联程度的方法,该方法包括如下步骤S100和步骤S200:
S100、对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。
在本发明的一个实施例中,对待分析的问答对的问题内容和答案内容进行词语提取操作具体包括:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作。则由待分析的问答对的问题内容得到至少一个待分析问题词语,由待分析的问答对的答案内容得到至少一个待分析答案词语。
S200、根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
本实施例的步骤S200,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。
通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;而且通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。
图2示出了构建问答知识库的详细的流程图。具体包括以下步骤S310、步骤S320和步骤S330:
S310、预先从含有问答对的网页提取出多个问答对,抓取与所述问答对对应的类别。
本实施例中,可以通过使用网络爬虫,从互联网上含有高质量问答对的网页抓取数据并提取出问答对,以保证所提取的问答对的质量;所述含有高质量问答对的网页包括cQA(Customer Quality Assurance客户品质保证)社区、各大专业论坛等,则可以使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取问答对。由于所述含有高质量问答对的网页中包括对应于每个问答对的类别信息,所以可以在抓取问答对的同时一并抓取与所述问答对对应的类别。
S320、对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录。
在本发明的一个实施例中,对步骤S310中提取得到的所述问答对中的每一个问答对的问题内容和答案内容进行词语提取操作,具体包括,对问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
则由每一个问答对的问题内容得到至少一个问题词语,由每一个问答对的答案内容得到至少一个答案词语,则可以得到针对该问答对的类别集合<C1,…,Ck,…,Cp>、问题词语集合<QW1,…,QWi,…,QWm>和答案词语集合<AW1,…,AWj,…,AWn>。
通过令问题词语集合中的每个问题词语(QWi)与答案词语集合中的每个答案词语(AWj)分别在与该问答对对应的每个类别(Ck)上形成一条信息记录,例如<QWi,AWj,Ck>,则可以形成m*n*p条信息记录。
S330、对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录<QWi,AWj,weight(QWi,AWj)>或<QWi,AWj,Ck,weight(QWi,AWj)>。本实施例中的步骤S330,可以是在对从网页抓取的海量的问答对进行了如步骤S320所述的词语提取操作而得到海量的信息记录之后基于所述海量的信息记录进行的,则基于海量的信息记录而获取的语义相关度更准确。
较佳地,所述计算该答案词语属于该类别的概率,具体包括:
Figure PCTCN2014086838-appb-000001
所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:
Figure PCTCN2014086838-appb-000002
所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:
Figure PCTCN2014086838-appb-000003
将上述概率、专一程度和强度相乘,具体包括:
weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
#(AWj)表示答案词语为AWj的次数。
由步骤S310、步骤S320和步骤S330,可以得到问答知识记录而构建问答知识库。图3示出了使用如图2所示的步骤而得到的问答知识库的一个解释模型示意图。可知,对于每一问题词语QWi,可以针对类别集合<C1,…,Ck,…,Cp>中的每一类别,获得n条问答知识记录。当然,本领域技术人员可以了解的是,若计算得到的语义相关度为0,则可以删除相应的问答知识记录;再者,如果问答知识库中问答知识记录的数量过大而使得存储问答知识记录和计算待分析问答对的相关联程度的开销过大,可以预设一个阈值,将语义相关度小于阈值的问答知识记录删除以减小开销。
图4示出了图1中步骤S200的详细的流程图。在通过步骤S100得到至少一个待分析问题词语和至少一个待分析答案词语后,步骤S200具体包括以下步骤S210、步骤S220和步骤S230:
S210、选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串,本实施例通过步骤S210,使用字段匹配或字段搜索的方法,从问答知识库中选出部分与待分析的问答对相关的问答知识记录。
S220、根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
本实施例,将通过步骤S210选出的问答知识记录根据其所对应的类别进行分组,对应于相同类别的问答知识记录为一组;将每一组的问答知识记录的语义相关度加权(例如,权值为1或100)相加,得到该待分析的问答对针对该类别的相关联程度;由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度。
S230、选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
图5示出了根据本发明一个实施例的获取问答对的相关联程度的装置的框图。该装置包括问答知识库100、词语提取单元200和相关联程度计算单元300。
问答知识库100,适于存储多条问答知识记录;本实施例的问答知识库100能够通过抓取网页中的海量问答对构建得到。
词语提取单元200,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。
在本发明的一个实施例中,词语提取单元200,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。
相关联程度计算单元300,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
在本发明的一个实施例中,相关联程度计算单元300,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的 相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
利用问答知识库100、词语提取单元200和相关联程度计算单元300,通过利用待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,以及根据所选择的问答知识记录计算待分析的问答对的相关联程度,可以从语义方面对待分析问答对进行分析,评价效果更好而且容易实现,通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,通用性更强。
在本实施例中,该装置还包括问答知识库构建单元400,问答知识库构建单元400适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图5所示的装置中,问答知识库是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库的内容往往需要更新,通过增设问答知识库构建单元400构建(或者说更新)问答知识库,可以保证问答知识库的内容的即时性和可靠性。
较佳地,在从含有问答对的网页提取出多个问答对时,问答知识库构建单元400抓取与所述问答对对应的类别。本实施例中,可以通过使用网络爬虫,从互联网上含有高质量问答对的网页抓取数据并提取出问答对,以保证所提取的问答对的质量;所述含有高质量问答对的网页包括cQA社区、各大专业论坛等。由于所述含有高质量问答对的网页中包括对应于每个问答对的类别信息,所以问答知识库构建单元400可以在抓取问答对的同时一并抓取与所述问答对对应的类别。
在本实施例中,问答知识库构建单元400,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合,具体地,问答知识库构建单元400对提取得到的所述问答对中的每一个问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作而得到问题词语和答案词语;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录。问答知识库构建单元400,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
更具体地,问答知识库构建单元400,适于按照如下的方法计算该答案词语属于该类别的概率:
Figure PCTCN2014086838-appb-000004
更具体地,问答知识库构建单元400,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:
Figure PCTCN2014086838-appb-000005
更具体地,问答知识库构建单元400,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:
Figure PCTCN2014086838-appb-000006
更具体地,问答知识库构建单元400,适于按照如下的方法将上述概率、专一程度和强度相乘:
weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
#(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
#(AWj)表示答案词语为AWj的次数。
以下通过一个例子说明使用本发明的实施例所能达到的效果,比如有如下问答对,类别为“医疗健康”:
Figure PCTCN2014086838-appb-000007
Figure PCTCN2014086838-appb-000008
通过分词技术处理,得到待分析问题词语和待分析答案词语如下:
Figure PCTCN2014086838-appb-000009
从分词结果可以看出,问题和答案中没有相关词覆盖,因此如果使用现有技术则容易认为该问答对相关联程度低,质量不高。但是实际上使用人工判断明显可知该问答对是一个高质量的问答对。
若使用本发明的方法和装置处理上述问答对,首先,可以调取已有的问答知识库,或者通过抓取cQA社区、各大专业论坛的问答对,构建问答知识库;
第二步,对上述待分析的问答对,经过词语提取操作得到待分析问题词语集合<孩子,咳嗽,鼻涕>、待分析答案词语集合<症状,药物,治疗,抗病毒,小儿感冒颗粒,说明,剂量,止咳,中药,冲剂,抗生素,阿莫西林,阿莫西林颗粒,颗粒,口服,罗红霉素,疗效>,并且得到待分析的问答对的类别为“医疗健康”;
第三步,根据各个待分析问题词语以及该类别,从问答知识库中选择得到问题词语与待分析问题词语匹配的若干问答知识记录,从而得到如下答案词语及语义相关度(为了方便阅读,下表中的语义相关度的数值是进行了适当的归一化处理后的数值):
Figure PCTCN2014086838-appb-000010
Figure PCTCN2014086838-appb-000011
第四步,根据待分析答案词语集合中的待分析答案词语,在第三步所选择得到的问答知识记录的基础上筛选出其包括的答案词语与待分析答案词语匹配的问答知识记录,进而得到所筛选出的问答知识记录的语义相关度。经分析可知,本例中与问答知识记录中的答案词语匹配的待分析答案词语包括:<口服,咳喘,小儿感冒颗粒,检查,止咳,治疗,流感症状,感冒颗粒>。
再计算上述待分析的问答对的相关联程度可以得出,该待分析的问答对的相关联程度达到了0.9(在相关联程度取值范围为0~1的条件下)。
图6示出了根据本发明一个实施例的优化问答对的搜索排名的方法的流程图。该方法包括如下步骤S610、步骤S620和步骤S630:
S610、接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对。
在本发明的一个实施例中,可以是使用网络搜索技术,例如使用问答对搜索引擎,根据使用者的搜索请求获取待分析问答对。
S620、根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度。
本实施例的步骤S620,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。
更具体地,本实施例的步骤S620的获得待分析问答对的相关联程度的具体实施方式,与如图1、4所示的获取问答对的相关联程度的方法大致相同,此处不再重复。
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。
更具体地,本实施例的方法还包括构建问答知识库的步骤,构建问答知识库的流程与图2所示的流程大致相同;本实施例的问答知识库的解释模型与如图3所示的解释模型大致相同。此处不再重复。
S630、根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
由于待分析问答对的相关联程度反映了质量,所以可以利用相关联程度优化所述待分析问答对的搜索排名,排名效果更好。
具体的方法,可以是以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名,即相关联程度高的问答对的搜索排名靠前;也可以是先根据搜索排列技术初步排列所述待分析问答对所属的网站,根据该初步排列的次序号与所述待分析问答对的相关联程度计算所述待分析问答对的搜索排名,例如,可以将所述待分析问答对所属的网站的初步排列的次序号与所述待分析问答对的相关联程度相乘,以相乘运算的结果的次序作为所述待分析问答对的搜索排名;通过将待分析问答对的质量和其所属网站的排 名结合,以对待分析问答对进行排序,使用者使用问答对搜索时,能够获得更好的结果排序的质量。
图7示出了根据本发明一个实施例的优化问答对的搜索排名的装置的框图。该装置包括问答知识库710、搜索单元720、计算单元730和搜索排名单元740。
问答知识库710,适于存储多条问答知识记录。本实施例的问答知识库710能够通过抓取网页中的海量问答对构建得到。
搜索单元720,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对。
在本发明的一个实施例中,搜索单元720可以是问答对搜索引擎,根据使用者的搜索请求获取待分析问答对;例如搜索单元720是用于问答对搜索的网络搜索引擎,接收使用者通过浏览器输入的搜索请求并获取待分析问答对。
计算单元730,适于根据问答知识库710获取每个待分析问答对的相关联程度。
本发明的计算单元730可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。问答知识库710利用由网页提取的海量的、高质量的问答对构建并且包括多条问答知识记录,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度。
搜索排名单元740,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
由于待分析问答对的相关联程度反映了质量,所以可以利用相关联程度优化所述待分析问答对的搜索排名,排名效果更好。具体的方法,可以是以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名,即相关联程度高的问答对的搜索排名靠前;也可以是先根据搜索排列技术初步排列所述待分析问答对所属的网站,根据该初步排列的次序号与所述待分析问答对的相关联程度计算所述待分析问答对的搜索排名,例如,可以将所述待分析问答对所属的网站的初步排列的次序号与所述待分析问答对的相关联程度相乘,以相乘运算的结果的次序作为所述待分析问答对的搜索排名。
在本实施例中,该装置还包括问答知识库构建单元750,问答知识库构建单元750适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图7所示的装置中,问答知识库710是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库710的内容往往需要更新,本实施例通过增设问答知识库构建单元750构建(或者说更新)问答知识库710,可以保证问答知识库710的内容的即时性和可靠性。本实施例的问答知识库构建单元750与如图5所示的问答知识库构建单元400相同,此处不再重复说明。
图7中的计算单元630具体包括词语提取子单元和相关联程度计算子单元(图未示)。
词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。
在本发明的一个实施例中,词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。
相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
在本发明的一个实施例中,相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
图8示出了根据本发明一个实施例的确定网络资源点的抓取频率的方法的流程图。该方法包括如下步骤S810、步骤S820和步骤S830:
S810、由网络资源点抓取多个待分析问答对。
在本发明的一个实施例中,可以是对于特定的需要确定抓取频率的网络资源点,例如需要确定抓取频率的问答社区,使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取待分析问答对。
S820、根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度。
本实施例的步骤S820,可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进 行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。
更具体地,本实施例的步骤S820的获得待分析问答对的相关联程度的具体实施方式,与如图1、4所示的获取问答对的相关联程度的方法大致相同,此处不再重复。
进一步地,所述包括多条问答知识记录的问答知识库,是通过预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建而得到的。在本发明的一个实施例中,在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别。则在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录。得到的问答知识库之中的每个问答知识记录对应于一个类别,分别包括一个问题词语(QW)、一个答案词语(AW),以及所述问题词语和所述答案词语之间的语义相关度。通过利用由网页提取的海量的、高质量的问答对构建包括多条问答知识记录的问答知识库,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度;通过利用从网页提取得到的信息构建问答知识库,适用的范围更广,方法的通用性更强。
更具体地,本实施例的方法还包括构建问答知识库的步骤,其中构建问答知识库的流程与图2所示的流程大致相同;本实施例的问答知识库的解释模型与如图3所示的解释模型大致相同。此处不再重复。
S830、根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
由于待分析问答对的相关联程度反映了质量,所以可以利用多个待分析问答对的相关联程度确定网络资源点的质量,进而确定网络资源点的抓取频率。
具体的方法,可以是以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率,即相关联程度的平均值大(即质量好)的网络资源点的抓取频率越高(例如,蜘蛛爬虫爬取该网络资源点的频率高);也可以是使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率,例如,可以使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,使用所述相关联程度的平均值对上述初始抓取频率进行加权(包括相乘、归一化等)而确定所述网络资源点的抓取频率,使得质量高的网络资源点的抓取频率得到提高,从而可以优化搜索质量。
本实施例通过分析由网络资源点抓取待分析问答对的相关联程度,并根据相关联程度确定网络资源点的抓取频率,可以提高抓取结果的准确性。
图9示出了根据本发明一个实施例的确定网络资源点的抓取频率的装置的框图。该装置包括问答知识库91、资源分析单元920、计算单元930和抓取频率获取单元940。
问答知识库910,适于存储多条问答知识记录。本实施例的问答知识库910能够通过抓取网页中的海量问答对构建得到。
资源分析单元920,适于由网络资源点抓取多个待分析问答对。
在本发明的一个实施例中,资源分析单元920可以对于特定的需要确定抓取频率的网络资源点,例如需要确定抓取频率的问答社区,使用楼层识别技术,根据楼主(即针对一个问题首个发出帖子的使用者)提问题,1楼2楼(即依序回复帖子的使用者)等回复的内容为答案的方式,来提取待分析问答对。
计算单元930,适于根据问答知识库获取每个待分析问答对的相关联程度。
本发明的计算单元930可以通过利用问答知识库从语义方面对待分析问答对的问题内容和答案内容进行分析以获得待分析问答对的相关联程度,评价效果更好而且容易实现。问答知识库910利用由网页提取的海量的、高质量的问答对构建并且包括多条问答知识记录,可以基于对海量信息的学习而获取多条问答知识记录的问题词语和答案词语之间的语义相关度。
抓取频率确定单元940,适于根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
由于待分析问答对的相关联程度反映了质量,所以可以利用多个待分析问答对的相关联程度确定网络资源点的质量,进而确定网络资源点的抓取频率。具体的方法,可以是以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率,即相关联程度的平均值大(即质量好)的网络资源点的抓取频率越高(例如,蜘蛛爬虫爬取该网络资源点的频率高);也可以是使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率,例如,可以使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,使用所述相关联程度的平均值对上述初始抓取频率进行加权(包括相乘、归一化等)而确定所述网络资源点的抓取频率,使得质量高的网络资源点的抓取频率得到提高,从而可以优化搜索质量。
在本实施例中,该装置还包括问答知识库构建单元950,问答知识库构建单元950适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库。在图9所示的装置中,问答知识库910是已有的,由于实际网络的信息量不断增加,信息内容的变化速度快,问答知识库910的内容往往需要更新,本实施例通过增设问答知识库构建单元950构建(或者说更新)问答知识库,可以保证问答知识库的内容的即时性和可靠性。本实施例的问答知识库构建单元950与如图5所示的问答知识库构建单元400相同,此处不再重复说明。
图9中计算单元930具体包括词语提取子单元和相关联程度计算子单元(图未示)。
词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语。
在本发明的一个实施例中,词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并(word join),和提取实体词(例如名词、动词等)的操作,以得到至少一个待分析问题词语和至少一个待分析答案词语。
相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
在本发明的一个实施例中,相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录。本实施例中,问题词语与待分析问题词语匹配是指待分析问题词语与问题词语相同或待分析问题词语是问题词语的子串;答案词语与待分析答案词语匹配是指待分析答案词语与答案词语相同或待分析答案词语是答案词语的子串;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度,更具体地,是将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权(例如,权值为1或100)相加而得到该待分析的问答对分别针对各个类别的相关联程度,由此得到至少一个(本实施例中的相关联程度的数目即待分析问答对对应的类别的数目)相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
本发明的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的获取问答对的相关联程度的装置,优化问答对的搜索排名的装置,以及确定网络资源点的抓取频率的装置中的一些或者全部部件的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序(例如,计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
例如,图10示出了用于执行根据本发明的获取问答对的相关联程度的方法,优化问答对的搜索排名的方法,以及确定网络资源点的抓取频率的方法的服务器,例如应用服务器的框图。该应用服务器传统上包括处理器1010和以存储器1020形式的计算机程序产品或者计算机可读介质。存储器1020可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。存储器1020具有用于执行上述方法中的任何方法步骤的程序代码1031的存储空间1030。例如,用于程序代码的存储空间1030可以包括分别用于实现上面的方法中的各种步骤的各个程序代码1031。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。这些计算机程序产品包括诸如硬盘,紧致盘(CD)、存储卡或者软盘之类的程序代码载体。这样的计算机程序产品通常为如参考图11所述的便携式或者固定存储单元。该存储单元可以具有与图10的应用服务器中的存储器1020类似布置的存储段、存储空间等。程序代码可以例如以适当形式进行压缩。通常,存储单元包括计算机可读代码1131’,即可以由例如诸如处理器1010之类的处理器读取的代码,这些代码当由服务器运行时,导致该服务器执行上面所描述的方法中的各个步骤。
本文中所称的“一个实施例”、“实施例”或者“一个或者多个实施例”意味着,结合实施例描述的特定特征、结构或者特性包括在本发明的至少一个实施例中。此外,请注意,这里“在一个实施例中”的词语例子不一定全指同一个实施例。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下被实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
此外,还应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。

Claims (52)

  1. 一种获取问答对的相关联程度的装置,该装置包括:
    问答知识库,适于存储多条问答知识记录;
    词语提取单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    相关联程度计算单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
  2. 根据权利要求1所述的装置,其中,该装置进一步包括问答知识库构建单元,
    所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  3. 根据权利要求1或2所述的装置,其中,
    所述相关联程度计算单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  4. 根据权利要求2所述的装置,其中,
    所述问答知识库构建单元,适于对每个问答对执行以下操作:
    对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;
    所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:
    计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  5. 根据权利要求1至4任一权利要求所述的装置,其中,
    所述相关联程度计算单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  6. 根据权利要求1至5任一权利要求所述的装置,其中,
    可选地,所述词语提取单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  7. 根据权利要求1至6任一权利要求所述的装置,其中,
    所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:
    Figure PCTCN2014086838-appb-100001
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:
    Figure PCTCN2014086838-appb-100002
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:
    Figure PCTCN2014086838-appb-100003
    所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  8. 一种优化问答对的搜索排名的装置,该装置包括:
    问答知识库,适于存储多条问答知识记录;
    搜索单元,适于接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;
    计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;
    搜索排名单元,适于根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
  9. 根据权利要求8所述的装置,其中,所述计算单元包括:
    词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
  10. 根据权利要求8或9所述的装置,其中,
    所述搜索排名单元,适于以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名。
  11. 根据权利要求8至10任一项所述的装置,其中,该装置还包括问答知识库构建单元,
    所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  12. 根据权利要求8至11任一项所述的装置,其中,
    所述相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  13. 根据权利要求8至12任一项所述的装置,其中,
    所述相关联程度计算子单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  14. 根据权利要求8至13任一项所述的装置,其中,
    所述词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  15. 根据权利要求8至14任一项所述的装置,其中,
    所述问答知识库构建单元,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;
    所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  16. 根据权利要求8至15任一项所述的装置,其中,
    所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:
    Figure PCTCN2014086838-appb-100004
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:
    Figure PCTCN2014086838-appb-100005
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:
    Figure PCTCN2014086838-appb-100006
    所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  17. 一种确定网络资源点的抓取频率的装置,该装置包括:
    问答知识库,适于存储多条问答知识记录;
    资源分析单元,适于由网络资源点抓取多个待分析问答对;
    计算单元,适于根据问答知识库获取每个待分析问答对的相关联程度;
    抓取频率确定单元,根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
  18. 根据权利要求17所述的装置,其中,所述计算单元包括:
    词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    相关联程度计算子单元,适于根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
  19. 根据权利要求17或18所述的装置,其中,
    所述抓取频率确定单元,适于以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率;或,使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率。
  20. 根据权利要求17至19任一项所述的装置,其中,该装置还包括问答知识库构建单元,
    所述问答知识库构建单元,适于预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    所述问答知识库构建单元,进一步适于在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    所述问答知识库构建单元,进一步适于在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  21. 根据权利要求17至20任一项所述的装置,其中,
    所述相关联程度计算子单元,适于选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  22. 根据权利要求17至21任一项所述的装置,其中,
    所述相关联程度计算子单元,适于将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  23. 根据权利要求17至22任一项所述的装置,其中,
    所述词语提取子单元,适于对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  24. 根据权利要求17至23任一项所述的装置,其中,
    所述问答知识库构建单元,适于对每个问答对执行以下操作:对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;
    所述问答知识库构建单元,适于对每一条信息记录,执行以下操作:计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词 语进行解释的强度;将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  25. 根据权利要求17至24任一项所述的装置,其中,
    所述问答知识库构建单元,适于按照如下的方法计算该答案词语属于该类别的概率:
    Figure PCTCN2014086838-appb-100007
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上各个答案词语对该问题词语的解释的专一程度:
    Figure PCTCN2014086838-appb-100008
    所述问答知识库构建单元,适于按照如下的方法计算在该类别上该问题词语用各个答案词语进行解释的强度:
    Figure PCTCN2014086838-appb-100009
    所述问答知识库构建单元,适于按照如下的方法将上述概率、专一程度和强度相乘:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*soecific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  26. 一种获取问答对的相关联程度的方法,该方法包括如下步骤:
    对待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    根据待分析问题词语和待分析答案词语,从包括多条问答知识记录的问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度。
  27. 根据权利要求26所述的方法,其中,该方法进一步包括:
    预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;
    每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  28. 根据权利要求26或27所述的方法,其中,
    所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:
    选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;
    选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  29. 根据权利要求26至28任一权利要求所述的方法所述的方法,其中,
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:
    将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  30. 根据权利要求26至29任一权利要求所述的方法所述的方法,其中,所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:
    对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;
    令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个 类别上形成一条信息记录;
    对每一条信息记录,执行以下操作:
    计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;
    将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;
    令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  31. 根据权利要求26至30任一权利要求所述的方法所述的方法,其中,
    所述计算该答案词语属于该类别的概率,具体包括:
    Figure PCTCN2014086838-appb-100010
    所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:
    Figure PCTCN2014086838-appb-100011
    所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:
    Figure PCTCN2014086838-appb-100012
    将上述概率、专一程度和强度相乘,具体包括:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  32. 根据权利要求26至31任一权利要求所述的方法,其中,
    所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  33. 一种优化问答对的搜索排名的方法,该方法包括如下步骤:
    接收使用者的搜索请求,根据使用者的搜索请求,获取与搜索请求匹配的多个待分析问答对;
    根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;
    根据所述待分析问答对的相关联程度优化所述待分析问答对的搜索排名。
  34. 根据权利要求33所述的方法,其中,所述根据包括多条问答知识记录的问答知识库获取每个待分析问答对的相关联程度,包括对每个待分析问答对执行以下操作:
    对该待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算该待分析的问答对的相关联程度。
  35. 根据权利要求33或34所述的方法,其中,所述根据所述待分析问答对的相关联程度调整所述待分析问答对的搜索排名,具体包括:
    以所述待分析问答对的相关联程度的次序作为所述待分析问答对的搜索排名。
  36. 根据权利要求33至35任一项所述的方法,其中,该方法进一步包括:
    预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;
    每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  37. 根据权利要求33至36任一项所述的方法,其中,
    所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:
    选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;
    选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  38. 根据权利要求33至37任一项所述的方法,其中,
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:
    将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  39. 根据权利要求33至38任一项所述的方法,其中,
    所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:
    对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  40. 根据权利要求33至39任一项所述的方法,其中,
    所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:
    对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;
    令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;
    对每一条信息记录,执行以下操作:
    计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;
    将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;
    令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  41. 根据权利要求33至40任一项所述的方法,其中,
    所述计算该答案词语属于该类别的概率,具体包括:
    Figure PCTCN2014086838-appb-100013
    所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:
    Figure PCTCN2014086838-appb-100014
    所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:
    Figure PCTCN2014086838-appb-100015
    将上述概率、专一程度和强度相乘,具体包括:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specfic(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  42. 一种确定网络资源点的抓取频率的方法,该方法包括如下步骤:
    由网络资源点抓取多个待分析问答对;
    根据包括多条问答知识记录的问答知识库,获取每个待分析问答对的相关联程度;
    根据所述待分析问答对的相关联程度确定所述网络资源点的抓取频率。
  43. 根据权利要求42所述的方法,其中,所述根据包括多条问答知识记录的问答知识库获取每个待分析问答对的相关联程度,包括对每个待分析问答对执行以下操作:
    对该待分析的问答对的问题内容和答案内容进行词语提取操作,得到至少一个待分析问题词语和至少一个待分析答案词语;
    根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算该待分析的问答对的相关联程度。
  44. 根据权利要求42或43所述的方法,其中,所述根据所述待分析问答对的相关联程度确定所述网 络资源点的抓取频率,具体包括:
    以所述待分析问答对的相关联程度的平均值作为所述网络资源点的抓取频率;
    或,
    使用蜘蛛爬虫获取所述网络资源点的初始抓取频率,计算所述待分析问答对的相关联程度的平均值,使用该平均值调整所述初始抓取频率而确定所述网络资源点的抓取频率。
  45. 根据权利要求42至44任一项所述的方法,其中,该方法进一步包括:
    预先从含有问答对的网页提取出多个问答对,根据提取的问答对构建包括多条问答知识记录的问答知识库;
    在从含有问答对的网页提取出多个问答对时,抓取与所述问答对对应的类别;
    在根据提取的问答对构建问答知识库时,根据问答对和与所述问答对对应的类别构建问答知识记录;
    每个问答知识记录对应于一个类别,分别包括一个问题词语、一个答案词语,以及所述问题词语和所述答案词语之间的语义相关度。
  46. 根据权利要求42至45任一项所述的方法,其中,
    所述根据待分析问题词语和待分析答案词语,从问答知识库选择至少一条问答知识记录,根据所选择的问答知识记录计算待分析的问答对的相关联程度,具体包括:
    选取其包括的问题词语与待分析问题词语匹配且包括的答案词语与待分析答案词语匹配的问答知识记录;
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对针对各个类别的相关联程度;
    选取上述该待分析的问答对针对各个类别的相关联程度的最大值,以该最大值作为待分析的问答对的相关联程度。
  47. 根据权利要求42至46任一项所述的方法,其中,
    根据所述选取的问答知识记录中对应于相同类别的问答知识记录,得到该待分析的问答对分别针对各个类别的相关联程度,具体包括:
    将选取的问答知识记录中对应于相同类别的问答知识记录的语义相关度加权相加,得到该待分析的问答对分别针对各个类别的相关联程度。
  48. 根据权利要求42至47任一项所述的方法,其中,
    所述对所述待分析的问答对的问题内容和答案内容进行词语提取操作,具体包括:
    对待分析的问答对的问题内容和答案内容进行分词、去除停用词、词合并,和提取实体词的操作。
  49. 根据权利要求42至48任一项所述的方法,其中,
    所述根据问答对和与所述问答对对应的类别构建问答知识库,具体包括:
    对每个问答对,对该问答对的问题内容和答案内容进行词语提取操作,得到问题词语集合和答案词语集合;
    令问题词语集合中的每个问题词语与答案词语集合中的每个答案词语分别在与该问答对对应的每个类别上形成一条信息记录;
    对每一条信息记录,执行以下操作:
    计算该答案词语属于该类别的概率,计算在该类别上该答案词语对该问题词语的解释的专一程度,计算在该类别上该问题词语用该答案词语进行解释的强度;
    将上述概率、专一程度和强度相乘,所得到的乘积是该答案词语和该问题词语的语义相关度;
    令该问题词语、该答案词语和其语义相关度形成一条对应于该类别的问答知识记录。
  50. 根据权利要求42至49任一项所述的方法,其中,
    所述计算该答案词语属于该类别的概率,具体包括:
    Figure PCTCN2014086838-appb-100016
    所述计算在该类别上各个答案词语对该问题词语的解释的专一程度,具体包括:
    Figure PCTCN2014086838-appb-100017
    所述计算在该类别上该问题词语用各个答案词语进行解释的强度,具体包括:
    Figure PCTCN2014086838-appb-100018
    将上述概率、专一程度和强度相乘,具体包括:
    weight(QWi,AWj|C=Ck)=P(Ck|AWj)*specific(QWi,AWj|C=Ck)*interpret(QWi,AWj|C=Ck);
    其中,P(Ck)表示类别Ck出现的概率;P(AWj)表示答案为AWj的概率;P(AWj|Ck)表示Ck类别属于AWj的概率;
    #(QWi,AWj)表示问题词语为QWi且答案词语为AWj的次数;
    #(AWj)表示答案词语为AWj的次数。
  51. 一种计算机程序,包括计算机可读代码,当所述计算机可读代码在计算设备上运行时,导致所述计算设备执行根据权利要求26至50中的任一个所述的方法。
  52. 一种计算机可读介质,其中存储了如权利要求51所述的计算机程序。
PCT/CN2014/086838 2013-10-21 2014-09-18 获取问答对相关联程度、优化搜索排名的装置和方法 WO2015058604A1 (zh)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
CN201310495881.4 2013-10-21
CN201310495856.6A CN103577557B (zh) 2013-10-21 2013-10-21 一种确定网络资源点的抓取频率的装置和方法
CN201310495641.4 2013-10-21
CN201310495856.6 2013-10-21
CN201310495641.4A CN103577556B (zh) 2013-10-21 2013-10-21 一种获取问答对的相关联程度的装置和方法
CN201310495881.4A CN103577558B (zh) 2013-10-21 2013-10-21 一种优化问答对的搜索排名的装置和方法

Publications (1)

Publication Number Publication Date
WO2015058604A1 true WO2015058604A1 (zh) 2015-04-30

Family

ID=52992233

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/086838 WO2015058604A1 (zh) 2013-10-21 2014-09-18 获取问答对相关联程度、优化搜索排名的装置和方法

Country Status (1)

Country Link
WO (1) WO2015058604A1 (zh)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9760627B1 (en) 2016-05-13 2017-09-12 International Business Machines Corporation Private-public context analysis for natural language content disambiguation
CN108717433A (zh) * 2018-05-14 2018-10-30 南京邮电大学 一种面向程序设计领域问答***的知识库构建方法及装置
CN109934347A (zh) * 2017-12-18 2019-06-25 上海智臻智能网络科技股份有限公司 扩展问答知识库的装置
CN110019729A (zh) * 2017-12-25 2019-07-16 上海智臻智能网络科技股份有限公司 智能问答方法及存储介质、终端
CN110019838A (zh) * 2017-12-25 2019-07-16 上海智臻智能网络科技股份有限公司 智能问答***及智能终端
US10361981B2 (en) 2015-05-15 2019-07-23 Microsoft Technology Licensing, Llc Automatic extraction of commitments and requests from communications and content
CN110334272A (zh) * 2019-05-29 2019-10-15 平安科技(深圳)有限公司 基于知识图谱的智能问答方法、装置及计算机存储介质
CN110580313A (zh) * 2018-06-08 2019-12-17 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN111382235A (zh) * 2018-12-27 2020-07-07 上海智臻智能网络科技股份有限公司 一种问答知识库的优化方法及其装置
CN111552789A (zh) * 2020-04-27 2020-08-18 中国银行股份有限公司 一种客服知识库自学习方法及装置
CN111984768A (zh) * 2019-05-24 2020-11-24 北京京东尚科信息技术有限公司 语料处理及问答交互方法、装置、计算机设备及存储介质
US10984387B2 (en) 2011-06-28 2021-04-20 Microsoft Technology Licensing, Llc Automatic task extraction and calendar entry
CN113239164A (zh) * 2021-05-13 2021-08-10 杭州摸象大数据科技有限公司 多轮对话流程构建方法、装置、计算机设备及存储介质
CN113807512A (zh) * 2020-06-12 2021-12-17 株式会社理光 机器阅读理解模型的训练方法、装置及可读存储介质
CN117763116A (zh) * 2023-12-26 2024-03-26 中数通信息有限公司 一种面向用户问答的知识文本抽取方法及***

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101520802A (zh) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 一种问答对的质量评价方法和***
CN101986293A (zh) * 2010-09-03 2011-03-16 百度在线网络技术(北京)有限公司 用于在搜索界面中呈现搜索答案信息的方法及设备
US20120078826A1 (en) * 2010-09-29 2012-03-29 International Business Machines Corporation Fact checking using and aiding probabilistic question answering
US8346701B2 (en) * 2009-01-23 2013-01-01 Microsoft Corporation Answer ranking in community question-answering sites
CN102884527A (zh) * 2010-04-06 2013-01-16 新加坡国立大学 根据基于社区的问题回答档案库的自动常问问题汇编

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8346701B2 (en) * 2009-01-23 2013-01-01 Microsoft Corporation Answer ranking in community question-answering sites
CN101520802A (zh) * 2009-04-13 2009-09-02 腾讯科技(深圳)有限公司 一种问答对的质量评价方法和***
CN102884527A (zh) * 2010-04-06 2013-01-16 新加坡国立大学 根据基于社区的问题回答档案库的自动常问问题汇编
CN101986293A (zh) * 2010-09-03 2011-03-16 百度在线网络技术(北京)有限公司 用于在搜索界面中呈现搜索答案信息的方法及设备
US20120078826A1 (en) * 2010-09-29 2012-03-29 International Business Machines Corporation Fact checking using and aiding probabilistic question answering

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10984387B2 (en) 2011-06-28 2021-04-20 Microsoft Technology Licensing, Llc Automatic task extraction and calendar entry
US10361981B2 (en) 2015-05-15 2019-07-23 Microsoft Technology Licensing, Llc Automatic extraction of commitments and requests from communications and content
US9760627B1 (en) 2016-05-13 2017-09-12 International Business Machines Corporation Private-public context analysis for natural language content disambiguation
CN109934347B (zh) * 2017-12-18 2024-02-02 上海智臻智能网络科技股份有限公司 扩展问答知识库的装置
CN109934347A (zh) * 2017-12-18 2019-06-25 上海智臻智能网络科技股份有限公司 扩展问答知识库的装置
CN110019729A (zh) * 2017-12-25 2019-07-16 上海智臻智能网络科技股份有限公司 智能问答方法及存储介质、终端
CN110019838A (zh) * 2017-12-25 2019-07-16 上海智臻智能网络科技股份有限公司 智能问答***及智能终端
CN110019729B (zh) * 2017-12-25 2024-03-15 上海智臻智能网络科技股份有限公司 智能问答方法及存储介质、终端
CN108717433A (zh) * 2018-05-14 2018-10-30 南京邮电大学 一种面向程序设计领域问答***的知识库构建方法及装置
CN110580313A (zh) * 2018-06-08 2019-12-17 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN110580313B (zh) * 2018-06-08 2024-02-02 北京搜狗科技发展有限公司 一种数据处理方法、装置和用于数据处理的装置
CN111382235A (zh) * 2018-12-27 2020-07-07 上海智臻智能网络科技股份有限公司 一种问答知识库的优化方法及其装置
CN111984768A (zh) * 2019-05-24 2020-11-24 北京京东尚科信息技术有限公司 语料处理及问答交互方法、装置、计算机设备及存储介质
CN110334272B (zh) * 2019-05-29 2022-04-12 平安科技(深圳)有限公司 基于知识图谱的智能问答方法、装置及计算机存储介质
CN110334272A (zh) * 2019-05-29 2019-10-15 平安科技(深圳)有限公司 基于知识图谱的智能问答方法、装置及计算机存储介质
CN111552789A (zh) * 2020-04-27 2020-08-18 中国银行股份有限公司 一种客服知识库自学习方法及装置
CN111552789B (zh) * 2020-04-27 2024-05-10 中国银行股份有限公司 一种客服知识库自学习方法及装置
CN113807512A (zh) * 2020-06-12 2021-12-17 株式会社理光 机器阅读理解模型的训练方法、装置及可读存储介质
CN113807512B (zh) * 2020-06-12 2024-01-23 株式会社理光 机器阅读理解模型的训练方法、装置及可读存储介质
CN113239164B (zh) * 2021-05-13 2023-07-04 杭州摸象大数据科技有限公司 多轮对话流程构建方法、装置、计算机设备及存储介质
CN113239164A (zh) * 2021-05-13 2021-08-10 杭州摸象大数据科技有限公司 多轮对话流程构建方法、装置、计算机设备及存储介质
CN117763116A (zh) * 2023-12-26 2024-03-26 中数通信息有限公司 一种面向用户问答的知识文本抽取方法及***

Similar Documents

Publication Publication Date Title
WO2015058604A1 (zh) 获取问答对相关联程度、优化搜索排名的装置和方法
US10831769B2 (en) Search method and device for asking type query based on deep question and answer
US9558264B2 (en) Identifying and displaying relationships between candidate answers
JP7153004B2 (ja) コミュニティ質問応答データの検証方法、装置、コンピュータ機器、及び記憶媒体
CN103577558B (zh) 一种优化问答对的搜索排名的装置和方法
US8255414B2 (en) Search assist powered by session analysis
Hartawan et al. Using vector space model in question answering system
CN105138558B (zh) 基于用户访问内容的实时个性化信息采集方法
US20180204106A1 (en) System and method for personalized deep text analysis
CN104376115B (zh) 一种基于全局搜索的模糊词确定方法及装置
CN107784069B (zh) 一种用于智能诊断学生知识能力的方法
CN108280081B (zh) 生成网页的方法和装置
US20190294705A1 (en) Image annotation
CN103577557A (zh) 一种确定网络资源点的抓取频率的装置和方法
WO2017000659A1 (zh) 一种富集化url的识别方法和装置
US10783140B2 (en) System and method for augmenting answers from a QA system with additional temporal and geographic information
KR102126911B1 (ko) KeyplayerRank를 이용한 소셜 미디어상의 주제별 키플레이어 탐지 방법
CN113010639A (zh) 一种基于电商平台的商品分析方法及装置
CN117454217A (zh) 一种基于深度集成学习的抑郁情绪识别方法、装置及***
CN112131354A (zh) 答案筛选方法、装置、终端设备和计算机可读存储介质
WO2019192122A1 (zh) 文档主题参数提取方法、产品推荐方法、设备及存储介质
CN113569044B (zh) 一种基于自然语言处理技术的网页文本内容的分类方法
CN104933097A (zh) 一种用于检索的数据处理方法和装置
US20200372051A1 (en) Enhanced item development using automated knowledgebase search
Rashid et al. Quax: Mining the web for high-utility faq

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14856111

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14856111

Country of ref document: EP

Kind code of ref document: A1