WO2016101812A1 - 用于对搜索数据进行处理的方法及设备 - Google Patents

用于对搜索数据进行处理的方法及设备 Download PDF

Info

Publication number
WO2016101812A1
WO2016101812A1 PCT/CN2015/097481 CN2015097481W WO2016101812A1 WO 2016101812 A1 WO2016101812 A1 WO 2016101812A1 CN 2015097481 W CN2015097481 W CN 2015097481W WO 2016101812 A1 WO2016101812 A1 WO 2016101812A1
Authority
WO
WIPO (PCT)
Prior art keywords
query sequence
entity information
historical query
information corresponding
candidate entity
Prior art date
Application number
PCT/CN2015/097481
Other languages
English (en)
French (fr)
Inventor
谢朋峻
周鑫
郎君
Original Assignee
阿里巴巴集团控股有限公司
谢朋峻
周鑫
郎君
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司, 谢朋峻, 周鑫, 郎君 filed Critical 阿里巴巴集团控股有限公司
Priority to US15/538,727 priority Critical patent/US10635678B2/en
Priority to JP2017532636A priority patent/JP6728178B2/ja
Publication of WO2016101812A1 publication Critical patent/WO2016101812A1/zh
Priority to US16/822,431 priority patent/US11347758B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2457Query processing with adaptation to user needs
    • G06F16/24578Query processing with adaptation to user needs using ranking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Definitions

  • the present application relates to the field of communications and computers, and in particular, to a method and apparatus for processing search data.
  • search As a habitual shopping portal for many users, you will enter a variety of query queries (query) of interest in the search box. After the user enters the query, the shopping website will provide relevant shopping guide information to help the user to clear the user's intention.
  • query queries query queries
  • the shopping website will provide relevant shopping guide information to help the user to clear the user's intention.
  • search results page guides There are two forms of commonly used search results page guides:
  • the navigation area allows users to identify the items that need to be purchased step by step, which is an effective way to help users determine their shopping intentions.
  • the query keyword is comprehensively considered.
  • Historical factors such as clicks, purchases, and the number of items related to the query terms provide the categories or attributes most relevant to the search intent, and ultimately help the user to clarify the user's intent in the form of navigation.
  • the related search is to provide a query similar to or related to the current query for the user to jump after searching for a query.
  • Patent No. CN 103279486 A in the patent application entitled “A Method and Apparatus for Providing Related Searches", by making other queries in the same session with the current query constitute a candidate recommendation of the current query; Clustering candidate recommendations, Get the candidate recommendation cluster of the current query.
  • the recommended query cluster is obtained by combining the semantics of the input query, and then the query is finally recommended to the user according to the number of search times in each cluster candidate.
  • the current navigation (such as product navigation) essentially recalls the results (such as goods) through the query keyword, and then calculates the different CPV based on the user's feedback on the recall results, such as the CPV (category, attribute, attribute value) of the product collection. Importance is recommended to the user based on importance.
  • the drawback of this approach is that it relies entirely on the collection of recall results (such as goods) and the category attribute system of the results (such as goods) itself. When the query containing the knowledge requirement is long and the recall result (such as goods) is small or the result (such as commodity) category attribute is broad, the information provided by the navigation area is very poor.
  • the query containing the knowledge requirement is a gift for the boyfriend, and the attributes of the recalled product category are broader, as shown in Figure 2.
  • the query containing the knowledge requirement is the Hangzhou specialty product, and the recalled product is more Less, the information provided by the navigation area is not ideal.
  • the recommended candidate for the relevant search comes from the query entered by the user, and because of this, it is subject to user perception. As shown in FIG. 3, when searching for a query containing knowledge requirements, the related search presents similar queries, which cannot satisfy the user's need to obtain an answer.
  • the purpose of the application is to provide a method and a device for processing search data, and mining entity information for a historical query sequence as an answer to a historical query sequence containing knowledge requirements. It is recommended to the user to improve the accuracy of the entity information recommended to the user.
  • the problem of poor search results of the historical query sequence containing the knowledge requirement is solved, such as the problem that the knowledge shopping query information is poor.
  • the present application provides a method for processing search data, including:
  • the entity information corresponding to the historical query sequence is determined according to the candidate entity information corresponding to each historical query sequence.
  • extracting the candidate entity information corresponding to the historical query sequence from the search result information corresponding to each historical query sequence includes:
  • the candidate entity information corresponding to the historical query sequence is extracted from the search result information corresponding to the historical query sequence according to the manner in which the candidate entity information is extracted.
  • All candidate entity information corresponding to each historical query sequence is used as the entity information corresponding to the historical query sequence.
  • the search result information corresponding to each historical query sequence obtained includes the text content, the website, the support number and the anti-number of the answer corresponding to the historical query sequence.
  • the candidate entity information corresponding to the historical query sequence is extracted from the text content of the answer corresponding to each historical query sequence.
  • determining the entity information corresponding to the historical query sequence according to the candidate entity information corresponding to each historical query sequence includes:
  • the entity information corresponding to the historical query sequence is filtered from the candidate entity information corresponding to each historical query sequence.
  • the method further includes:
  • the score of the candidate entity information corresponding to each historical query sequence is calculated according to the following formula:
  • entity1 represents an entity word
  • m represents the total number of sites
  • i represents a site in m sites
  • n represents the total number of responses for a site i
  • j represents one of the n answers
  • E ij denotes entity1 appears in answer j website i's, there was 1, does not appear, compared with 0, weight1 i represents the weight of the site i weight
  • weight2 j represents the right answer j weight
  • the anti-number determines that Weight2 j is a positive integer greater than or equal to 1, and the default value of Weight2 j is 1.
  • the entity information corresponding to the historical query sequence is filtered from the candidate entity information corresponding to each historical query sequence,
  • a candidate entity letter corresponding to each historical query sequence based on the score of each candidate entity information
  • the entity information corresponding to the historical query sequence is filtered.
  • the method further includes:
  • a score of the corresponding corresponding entity information is obtained according to the score of each candidate entity information.
  • the method further includes:
  • the method further includes:
  • the application also provides an apparatus for processing search data, including:
  • a first device configured to acquire search result information corresponding to each history query sequence that includes a knowledge requirement
  • a second device configured to extract, from the search result information corresponding to each historical query sequence, candidate entity information corresponding to the historical query sequence;
  • the third device is configured to determine, according to candidate entity information corresponding to each historical query sequence, entity information corresponding to the historical query sequence.
  • the second device includes:
  • a first unit configured to determine, according to a type of each historical query sequence, a manner of extracting candidate entity information corresponding to the historical query sequence
  • a second unit configured to extract, according to the method for extracting candidate entity information corresponding to each historical query sequence, the search result corresponding to the historical query sequence from the search result information corresponding to the historical query sequence Candidate entity information.
  • the third device is configured to use, as the entity information corresponding to the historical query sequence, all candidate entity information corresponding to each historical query sequence.
  • the search result information corresponding to each historical query sequence acquired by the first device includes the text content, the website, the support number, and the anti-number of the answer corresponding to the historical query sequence.
  • the second device extracts candidate entity information corresponding to the historical query sequence from the text content of the response corresponding to each historical query sequence
  • the third device filters the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence.
  • the fourth device is further configured to calculate a score of the candidate entity information corresponding to each historical query sequence.
  • the fourth device calculates a score of the candidate entity information corresponding to each historical query sequence according to the following formula:
  • entity1 represents an entity word
  • m represents the total number of sites
  • i represents a site in m sites
  • n represents the total number of responses for a site i
  • j represents one of the n answers
  • E ij denotes entity1 appears in answer j website i's, there was 1, does not appear, compared with 0, weight1 i represents the weight of the site i weight
  • weight2 j represents the right answer j weight
  • the anti-number determines that Weight2 j is a positive integer greater than or equal to 1, and the default value of Weight2 j is 1.
  • the third device is configured to filter, according to the score of each candidate entity information, the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence.
  • the third device is further configured to obtain a score of the filtered corresponding entity information according to the score of each candidate entity information.
  • a fifth device configured to search for a corresponding historical query sequence according to a current query sequence that includes a knowledge requirement
  • the sixth device is configured to obtain entity information corresponding to the searched historical query sequence.
  • the sixth device is further configured to obtain a score of the entity information corresponding to the searched historical query sequence, and sort the entity information according to the score of each entity information.
  • the present application can extract the entity information for the historical query sequence as an answer to the user for the historical query sequence including the knowledge requirement, so as to improve the accuracy of the entity information recommended to the user, and solve the current knowledge requirement.
  • the present application filters the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence, so as to delete the inaccurate or inaccurate candidate entity information, and select the accurate candidate entity information as Entity information, resulting in more optimized, more accurate entity information for the user.
  • the present application calculates a score of the candidate entity information corresponding to each historical query sequence, so as to further filter the entity information from the candidate entity information according to the scoring, or sort the filtered entity information and provide the information to the user.
  • FIG. 1 shows a search result graph of an existing navigation method
  • FIG. 2 is a diagram showing another search result of the existing navigation method
  • FIG. 3 shows a search result graph of an existing related search method
  • FIG. 4 illustrates a flow chart of a method for processing search data in accordance with an aspect of the present application
  • Figure 5 shows a search result graph of the present application
  • Figure 6 shows another search result graph of the present application
  • FIG. 7 shows a flow chart of a method for processing search data in a preferred embodiment of the present application
  • FIG. 8 shows a flow chart of a method for processing search data in another preferred embodiment of the present application.
  • FIG. 9 shows a schematic diagram of an apparatus for processing search data in another aspect of the present application.
  • FIG. 10 is a schematic diagram of an apparatus for processing search data according to a preferred embodiment of the present application.
  • FIG. 11 shows a schematic diagram of an apparatus for processing search data in another preferred embodiment of the present application.
  • Figure 12 shows a schematic diagram of an apparatus for processing search data in yet another preferred embodiment of the present application.
  • the terminal, the device of the service network, and the trusted party each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-persistent memory, random access memory (RAM), and/or non-volatile memory in a computer readable medium, such as read only memory (ROM) or flash memory.
  • RAM random access memory
  • ROM read only memory
  • Memory is an example of a computer readable medium.
  • Computer readable media includes both permanent and non-persistent, removable and non-removable media.
  • Information storage can be implemented by any method or technology.
  • the information can be computer readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory. (ROM), electrically erasable programmable read only memory (EEPROM), flash memory or other memory technology, compact disk read only memory (CD-ROM), digital versatile disk (DVD) or other optical storage,
  • computer readable media does not include non-transitory computer readable media, such as modulated data signals and carrier waves.
  • the present application provides a method for processing search data, including:
  • Step S1 acquiring search result information corresponding to each historical query sequence containing knowledge requirements
  • Step S2 extracting candidate entity information corresponding to the historical query sequence from the search result information corresponding to each historical query sequence;
  • Step S3 Determine entity information corresponding to the historical query sequence according to candidate entity information corresponding to each historical query sequence.
  • the present application can refer to the historical query sequence including the knowledge requirement, and extract the entity information for the historical query sequence as an answer to the user, so as to improve the accuracy of the entity information recommended to the user, and solve the historical query sequence that currently contains the knowledge requirement. Poor search results.
  • This application can adopt the method of information extraction to first identify the history containing knowledge needs. The query extracts the search result information of the external network community data related to the historical query containing the knowledge requirement, and extracts the desired entity information from the search result information as an answer to a knowledge base. Therefore, when the subsequent user can search for the corresponding historical query sequence according to the current query sequence including the knowledge requirement, the entity information corresponding to the searched historical query sequence can be recommended to the user based on the knowledge base.
  • the entity information may be information of an object that exists objectively and can be distinguished from each other.
  • the entity information may be information of a specific person, thing, or thing, or may be an abstract concept or contact information.
  • the historical query sequence containing the knowledge requirements may be for an intellectual shopping query, such as "a practical gift for parents" in FIG. 5, or "a gift to a boyfriend" as shown in FIG.
  • the method of the present application can be used to extract the entity information from the community data of the website as an answer to the user, so as to improve the accuracy of the entity information recommended to the user, and the corresponding entity information is a targeted recommendation product, which can solve the current knowledge. Sex shopping query guide information is poor.
  • the user may obtain N-level entity information, where N is a positive integer, and the entity information of the latter stage is obtained by the entity information of the previous level, for example, the entity obtained by the previous N-1 level.
  • the information can be a new historical query sequence containing knowledge requirements, so that the next level of entity information is obtained according to the historical query sequence of the previous level.
  • the entity information of the next level is also a historical query. The sequence obtains the next-level entity information by the next-level historical query sequence, and so on, until the N-th level entity information (in this case, a historical query sequence) is obtained at the Nth level.
  • a specific entity information such as a commodity information
  • the entity information obtained in the previous N-1 level can be displayed to the user in the form of a multi-level recommendation label.
  • the user clicks on the recommendation label of a certain level the user can jump to the next level.
  • the recommended label can lead the user to obtain the exact desired entity information by means of this step-by-step jump until the final N-level specific entity information, such as specific product information, is obtained.
  • step S2 of FIG. 4 includes:
  • Step S21 Determine, according to the type of each historical query sequence, a manner of extracting candidate entity information corresponding to the historical query sequence;
  • Step S22 Extract candidate entity information corresponding to the historical query sequence from the search result information corresponding to the historical query sequence according to the manner of extracting candidate entity information corresponding to each historical query sequence.
  • step S21 all historical query sequences may be analyzed and summarized, and the types of different historical query sequences including knowledge requirements are extracted, and then in step S21, the historical query sequence is determined according to the type of each historical query sequence. Corresponding way of extracting candidate entity information. For example, you can classify the types of historical query sequences that contain knowledge requirements into the following categories:
  • Place name + "specialty product” indicates that you want to acquire the knowledge of the specialty of a certain place
  • Category word + "brand” indicates that you want to obtain a best-selling brand of a certain category
  • Category word + "accessory” indicates that you want to obtain other accessories of a certain category.
  • the way to extract the candidate entity information corresponding to the historical query sequence is to extract the name of the special product as the entity information; for the historical query of "send" + address + "gift” type
  • the sequence is determined by extracting the candidate entity information corresponding to the historical query sequence by extracting the name of the gift as the entity information; for the historical query sequence of the category word + "brand” type, determining the extraction candidate corresponding to the historical query sequence of the category
  • the way of entity information is to extract the name of the brand as the entity information; for the historical query sequence of the category word + "accessory” type, the way to determine the candidate entity information corresponding to the historical query sequence is to extract the accessory
  • the name is used as the entity information.
  • step S3 of FIG. 4 all candidate entity information corresponding to each history query sequence is used as entity information corresponding to the history query sequence.
  • all the candidate entity information can be directly recommended as the entity information to the user without being deleted, so as to save data processing amount and improve the recommendation speed.
  • the search result information corresponding to each historical query sequence obtained includes the text content, the website, the support number, and the anti-number of the answer corresponding to the historical query sequence.
  • general crawling technology can be used to search for search result information corresponding to the historical query sequence (Query) containing knowledge requirements on community websites such as Baidu, search and Q&A, and Taobao Q&A, and to crawl the historical query sequence.
  • the corresponding search result information is parsed as the webpage data, and the text content of the answer of the webpage data is not only parsed, but also the information such as the website, the support number and the anti-counter number are parsed for subsequent extraction of the candidate entity information and the candidate entity. Information is scored for use.
  • Table 1 An example of the result data of the crawl is shown in Table 1:
  • search result information is only an example, and other existing or future search result information may be applicable to the present application, and should also be included in the protection scope of the present application.
  • the reference is included here.
  • step S2 of FIG. 4 extracts candidate entity information corresponding to the historical query sequence from the text content of the answer corresponding to each historical query sequence.
  • the answer corresponding to each historical query sequence can be The candidate entity information corresponding to the historical query sequence is extracted from the text content.
  • candidate entity information is extracted from the text content.
  • candidate entity information can be identified from the text content of the answer, such as rule-based methods, hidden Markov model-based methods, conditional random field-based methods, and so on.
  • candidate entity information extracted from the text content of the answer In a specific application scenario, in order to solve the shopping guide problem of the knowledge shopping query, the category entity needs to be filtered out, and the result style of the candidate entity information can be as shown in Table 2. Shown as follows:
  • step S3 of FIG. 4 includes step S31, and the entity information corresponding to the historical query sequence is filtered from the candidate entity information corresponding to each historical query sequence.
  • the candidate entity information may be checked according to the historical query sequence, the candidate entity information that is inaccurate or not accurate is deleted, and the accurate candidate entity information is selected as the entity information, thereby obtaining more optimized and more accurate entity information to provide To the user.
  • the method further includes:
  • the search result information similar to that of each query shown in Table 1 and the candidate entity information extracted from the search result information similar to those shown in Table 2 can be used to score the candidate entity information.
  • the entity information is further filtered from the candidate entity information according to the scoring, or the filtered entity information is sorted and then provided to the user. For example, a score similar to the candidate entity information corresponding to each historical query sequence shown in Table 3 can be obtained:
  • the score of the candidate entity information corresponding to each historical query sequence can be calculated according to the following formula:
  • entity1 represents an entity word
  • m represents the total number of sites
  • i represents a site in m sites
  • n represents the total number of responses for a site i
  • j represents one of the n answers
  • E ij denotes entity1 appears in answer j website i's, there was 1, does not appear, compared with 0, weight1 i represents the weight of the site i weight
  • weight2 j represents the right answer j weight
  • the anti-number determines that Weight2 j is a positive integer greater than or equal to 1, and the default value of Weight2 j is 1.
  • the value of Weight2 j is obtained by subtracting the argument from the support number. If the support number minus the argument is less than or equal to zero, the default value of Weight2 j is 1.
  • Weight1 i can be obtained by default or based on the pagerank algorithm.
  • the entity information corresponding to the historical query sequence is filtered from the candidate entity information corresponding to each historical query sequence according to the score of each candidate entity information.
  • the candidate entity information with higher scores may be filtered out from the candidate entity information of each historical query sequence as the entity information corresponding to the historical query sequence.
  • the method further includes:
  • a score of the corresponding corresponding entity information is obtained according to the score of each candidate entity information.
  • candidate entity information and scores such as "watch: 55; wallet: 46: lighter: 32; belt: 22; scarf: 22; razor: 20; bracelet: 18; belt: 18 ; tie: 18"
  • the filtered entity information and scores are "watch: 55; wallet: 46: lighter: 32; belt: 22; scarf: 22; razor: 20", candidate entity information and scores to be retained As the filtered entity information and scores.
  • the above-mentioned search result information, candidate entity information, and entity information and the acquisition of the scores involve large-scale data processing, and there is a demand for large-scale parallel computing.
  • the cloud can be passed through the cloud.
  • the computing platform is implemented.
  • the method further includes:
  • Step S4 searching for a corresponding historical query sequence according to a current query sequence containing knowledge requirements
  • Step S5 Obtain entity information corresponding to the searched historical query sequence.
  • the processes of step S4 and step S5 can be implemented by an online server, and the historical query sequence and the corresponding entity information are pre-stored in a knowledge base, and the user can submit the current search including the knowledge requirement to the line server through the terminal.
  • the query sequence searches for a request for the corresponding historical query sequence. If the online server finds the corresponding historical query sequence from the knowledge base, the online entity directly presents the corresponding entity information to the user in the navigation area in the form of a label, and the user can click the label. Continue with network operations such as shopping behavior.
  • the online server may split the current query sequence including the knowledge requirement into multiple keyword sequences, and then search the corresponding historical query sequence according to the multiple keyword sequences to improve the hit rate of the historical query sequence.
  • the method further includes:
  • the process of searching for a corresponding historical query sequence and corresponding entity information may be implemented by a keyvalue system that supports real-time query.
  • an apparatus 100 for processing search data includes:
  • the first device 1 is configured to obtain search result information corresponding to each historical query sequence containing knowledge requirements
  • the second device 2 is configured to extract, from the search result information corresponding to each historical query sequence, candidate entity information corresponding to the historical query sequence;
  • the third device 3 is configured to determine, according to candidate entity information corresponding to each historical query sequence, entity information corresponding to the historical query sequence.
  • the present application can refer to the historical query sequence including the knowledge requirement, and extract the entity information for the historical query sequence as an answer to the user, so as to improve the accuracy of the entity information recommended to the user, and solve the historical query sequence that currently contains the knowledge requirement. Poor search results.
  • the application can adopt the method of information extraction, firstly identifying the historical query containing the knowledge requirement, and then extracting the search result information from the external network community data related to the historical query containing the knowledge requirement, and mining the desired entity from the search result information. Information is deposited as an answer to a knowledge base. Therefore, when the subsequent user can search for the corresponding historical query sequence according to the current query sequence including the knowledge requirement, the entity information corresponding to the searched historical query sequence can be recommended to the user based on the knowledge base.
  • the entity information may be information of an object that exists objectively and can be distinguished from each other.
  • the entity information may be information of a specific person, thing, or thing, or may be an abstract concept or contact information.
  • the historical query sequence containing the knowledge requirements may be for an intellectual shopping query, such as "a practical gift for parents" in FIG. 5, or "a gift to a boyfriend" as shown in FIG.
  • the method of the present application can be used to extract the entity information from the community data of the website as an answer to the user, so as to improve the accuracy of the entity information recommended to the user, and the corresponding entity information is a targeted recommendation product, which can solve the current knowledge. Sex shopping query guide information is poor.
  • the user may obtain N-level entity information, where N is a positive integer, and the entity information of the latter stage is obtained by the entity information of the previous level, for example, the entity obtained by the previous N-1 level.
  • the information can be a new historical query sequence containing knowledge requirements, so that the next level of entity information is obtained according to the historical query sequence of the previous level, except for the entity information of the Nth level, the entity information of the next level.
  • the information is also a historical query sequence, and then the next level of the historical query sequence to obtain the next level of entity information, and so on, until the N-1 level of entity information (in this case, a historical query sequence)
  • a specific entity information of the Nth level, such as the commodity information, the entity information obtained by the first N-1 level may be displayed to the user in the form of a multi-level recommendation label, and may be hopped when the user clicks on the recommendation label of a certain level.
  • the second device 2 includes:
  • a first unit 21 configured to determine, according to a type of each historical query sequence, a manner of extracting candidate entity information corresponding to the historical query sequence;
  • the second unit 22 is configured to extract candidate entity information corresponding to the historical query sequence from the search result information corresponding to the historical query sequence according to the manner of extracting candidate entity information corresponding to each historical query sequence.
  • all the historical query sequences may be analyzed and summarized, and the types of different historical query sequences containing knowledge requirements are extracted, and then the first unit 21 determines the corresponding extraction of the historical query sequence according to the type of each historical query sequence.
  • the way in which the entity information is candidated For example, you can classify the types of historical query sequences that contain knowledge requirements into the following categories:
  • Place name + "specialty product” indicates that you want to acquire the knowledge of the specialty of a certain place
  • Category word + "brand” indicates that you want to obtain a best-selling brand of a certain category
  • Category word + "accessory” indicates that you want to obtain other accessories of a certain category.
  • the way to extract the candidate entity information corresponding to the historical query sequence is to extract the name of the special product as the entity information; for the historical query of "send" + address + "gift” type
  • the sequence is determined by extracting the candidate entity information corresponding to the historical query sequence by extracting the name of the gift as the entity information; for the historical query sequence of the category word + "brand” type, determining the extraction candidate corresponding to the historical query sequence of the category
  • the way of the entity information is to extract the name of the brand as the entity information; for the historical query sequence of the category word + "accessory” type, the way to extract the candidate entity information corresponding to the historical query sequence is to extract the name of the accessory as the entity information.
  • the third device 3 is configured to use all candidate entity information corresponding to each historical query sequence as the entity information corresponding to the historical query sequence.
  • all candidate entity information corresponding to each history query sequence is used as entity information corresponding to the history query sequence.
  • the search result information corresponding to each historical query sequence acquired by the first device 1 includes the text content, the website, the support number, and the anti-number of the answer corresponding to the historical query sequence.
  • general crawling technology can be used to search for search result information corresponding to the historical query sequence (Query) containing knowledge requirements on community websites such as Baidu, search and Q&A, and Taobao Q&A, and to crawl the historical query sequence.
  • the corresponding search result information is parsed as the webpage data, and the text content of the answer of the webpage data is not only analyzed, but also the website and the number of supports are answered.
  • the information such as the anti-logarithm is parsed for subsequent extraction of the candidate entity information and scoring the candidate entity information.
  • Table 1 An example of the result data of the crawl is shown in Table 1:
  • search result information is only an example, and other existing or future search result information may be applicable to the present application, and should also be included in the protection scope of the present application.
  • the reference is included here.
  • the second device 2 extracts candidate entity information corresponding to the historical query sequence from the text content of the answer corresponding to each historical query sequence.
  • the search result information similar to each query structured as shown in Table 1, it is also necessary to extract the required candidate entity information from the search result information, where the answer corresponding to each historical query sequence can be
  • the candidate entity information corresponding to the historical query sequence is extracted from the text content.
  • candidate entity information There are many ways to identify candidate entity information from the text content of the answer, such as rule-based methods, hidden Markov model-based methods, conditional random field-based methods, and so on.
  • candidate entity information extracted from the text content of the answer There are many types of candidate entity information extracted from the text content of the answer.
  • the category entity needs to be filtered out, and the result style of the candidate entity information can be as shown in Table 2. Shown as follows:
  • the third device 3 filters the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence.
  • the candidate entity information may be checked according to the historical query sequence, the candidate entity information that is inaccurate or not accurate is deleted, and the accurate candidate entity information is selected as the entity information, thereby obtaining more optimized and more accurate entity information to provide To the user.
  • the device further includes a fourth device 4, configured to calculate a score of candidate entity information corresponding to each historical query sequence.
  • the search result information similar to that of each query shown in Table 1 and the candidate entity information extracted from the search result information similar to those shown in Table 2 can be used to score the candidate entity information.
  • the entity information is further filtered from the candidate entity information according to the scoring, or the filtered entity information is sorted and then provided to the user. For example, a score similar to the candidate entity information corresponding to each historical query sequence shown in Table 3 can be obtained:
  • the fourth device 4 calculates a score of candidate entity information corresponding to each historical query sequence according to the following formula:
  • entity1 represents an entity word
  • m represents the total number of sites
  • i represents a site in m sites
  • n represents the total number of responses for a site i
  • j represents one of the n answers
  • E ij denotes entity1 appears in answer j website i's, there was 1, does not appear, compared with 0, weight1 i represents the weight of the site i weight
  • weight2 j represents the right answer j weight
  • the anti-number determines that Weight2 j is a positive integer greater than or equal to 1, and the default value of Weight2 j is 1.
  • the value of Weight2 j is obtained by subtracting the argument from the support number. If the support number minus the argument is less than or equal to zero, the default value of Weight2 j is 1.
  • Weight1 i can be obtained by default or based on the pagerank algorithm.
  • the third device 3 is configured to filter, according to the score of each candidate entity information, the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence.
  • the candidate entity information with higher scores may be filtered out from the candidate entity information of each historical query sequence as the entity information corresponding to the historical query sequence.
  • the third device 3 is further configured to use each candidate entity letter.
  • the score of the interest is obtained by the score of the corresponding entity information after screening.
  • candidate entity information and scores such as "watch: 55; wallet: 46: lighter: 32; belt: 22; scarf: 22; razor: 20; bracelet: 18; belt: 18 ; tie: 18"
  • the filtered entity information and scores are "watch: 55; wallet: 46: lighter: 32; belt: 22; scarf: 22; razor: 20", candidate entity information and scores to be retained As the filtered entity information and scores.
  • the above-mentioned search result information, the candidate entity information, and the entity information and the acquisition of the scores are involved in the large-scale data processing, and the demand for the large-scale parallel computing can be realized by the cloud computing platform in an embodiment of the present application.
  • the device further includes:
  • a fifth device 5 configured to search for a corresponding historical query sequence according to a current query sequence that includes a knowledge requirement
  • the sixth device 6 is configured to obtain entity information corresponding to the searched historical query sequence.
  • the functions of the fifth device 5 and the sixth device 6 can be implemented by an online server, and the historical query sequence and the corresponding entity information are pre-stored in a knowledge base, and the user can submit the search to the line server through the terminal.
  • the current query sequence of the knowledge request finds the request of the corresponding historical query sequence, and the online server directly presents the corresponding entity information in the navigation area to the user in the form of a label, if the corresponding historical query sequence is found from the knowledge base. Users can click on the label to continue the network operation behavior such as shopping behavior.
  • the online server may split the current query sequence including the knowledge requirement into multiple keyword sequences, and then search the corresponding historical query sequence according to the multiple keyword sequences to improve the hit rate of the historical query sequence.
  • the sixth device 6 is further configured to obtain a score of the entity information corresponding to the searched historical query sequence, and sort the entity information according to the score of each entity information. For example, you can rank the high-ranking entity information in front and the low-score entity letter. The interest rate is provided to the user later to improve the efficiency of the user's choice of entity information.
  • the process of searching for the corresponding historical query sequence and the corresponding entity information by the fifth device 5 and the sixth device 6 may be implemented by a keyvalue system supporting real-time query.
  • the community information such as Baidu, search and answer questions, and Taobao question and answer can be firstly retrieved as shown in Table 1.
  • Historical query sequence such as "Gift for Boyfriend” corresponds to search result information such as "website”, “answer text”, “support number” and “anti-number”, and then from the search result information of Table 1
  • the candidate entity information corresponding to the historical query sequence is extracted, such as "clothing, tie, belt, watch, briefcase, pen", and then the candidate entity information can be separately scored according to the score, "clothing, tie”
  • the candidate entity information of the belt, watch, briefcase, and pen is filtered.
  • the “pen” is deleted, and the entity information and scores similar to those in Table 3 are obtained, such as an entity information as “clothing”.
  • tie, belt, watch, briefcase, follow-up can also be based on the "clothes, ties, belts, watches, briefcases" scores Information is sorted, the high score entity information came in the front position, to make it easier to see and are selected to improve the recommendation accuracy.
  • the present application can extract the entity information for the historical query sequence as an answer to the user for the historical query sequence containing the knowledge requirement, so as to improve the accuracy of the entity information recommended to the user, and currently includes the history of knowledge requirements.
  • the problem of poor search results for the query sequence can extract the entity information for the historical query sequence as an answer to the user for the historical query sequence containing the knowledge requirement, so as to improve the accuracy of the entity information recommended to the user, and currently includes the history of knowledge requirements.
  • the present application filters the entity information corresponding to the historical query sequence from the candidate entity information corresponding to each historical query sequence, so as to delete the inaccurate or inaccurate candidate entity information.
  • the screening obtains accurate candidate entity information as entity information, thereby obtaining more optimized and more accurate entity information for providing to the user.
  • the present application calculates a score of the candidate entity information corresponding to each historical query sequence, so as to further filter the entity information from the candidate entity information according to the scoring, or sort the filtered entity information and provide the information to the user.
  • the present application can be implemented in software and/or a combination of software and hardware, for example, using an application specific integrated circuit (ASIC), a general purpose computer, or any other similar hardware device.
  • the software program of the present application can be executed by a processor to implement the steps or functions described above.
  • the software programs (including related data structures) of the present application can be stored in a computer readable recording medium such as a RAM memory, a magnetic or optical drive or a floppy disk and the like.
  • some of the steps or functions of the present application may be implemented in hardware, for example, as a circuit that cooperates with a processor to perform various steps or functions.
  • a portion of the present application can be applied as a computer program product, such as computer program instructions, which, when executed by a computer, can invoke or provide a method and/or technical solution in accordance with the present application.
  • the program instructions for invoking the method of the present application may be stored in a fixed or removable recording medium, and/or transmitted by a data stream in a broadcast or other signal bearing medium, and/or stored in a The working memory of the computer device in which the program instructions are run.
  • an embodiment in accordance with the present application includes a device including a memory for storing computer program instructions and a process for executing program instructions
  • the apparatus when the computer program instructions are executed by the processor, triggers the apparatus to operate based on the methods and/or technical solutions described above in accordance with various embodiments of the present application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

一种用于对搜索数据进行处理的方法及设备。针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率,解决目前包含知识需求的历史查询序列的搜索结果差的问题。

Description

用于对搜索数据进行处理的方法及设备 技术领域
本申请涉及通信及计算机领域,尤其涉及一种用于对搜索数据进行处理的方法及设备。
背景技术
随着电子商务应用的日益普及,网络购物慢慢融入普通用户的衣食住行中。搜索作为很多用户习惯性的购物入口,大家会在搜索框输入各种感兴趣的查询序列(query)。在用户输入query后,购物网站会提供相关的导购信息帮助用户明确用户意图。常用的搜索结果页导购方式有两种形态:
1.导航
导航区通过筛选的方式让用户一步一步明确需要购买的商品,是帮助用户确定购物意图的一种有效方式。例如,公开号为CN 103218719 A、发明名称为“一种电子商务网站导航方法及***”的专利申请中,通过汲取类目点击导航和类目商品数量导航的精华,综合考虑query关键词对应的点击、购买等历史因素,以及查询词相关的商品数量信息等,提供与搜索意图最相关的类目或属性,最终以导航的形式帮助用户明确用户意图。
2.相关搜索
相关搜索是在用户搜索某query后,提供跟当前query相似或者相关的query供用户跳转。公开号CN 103279486 A、发明名称为“一种提供相关搜索的方法和装置”的专利申请中,通过将与当前query共现于同一会话的其他query构成当前query的候选推荐项;再根据相似度对候选推荐项进行聚类,就 得到当前query的候选推荐簇。在线推荐的时候,结合输入query的语义得到推荐的query簇,再根据各个簇内候选项的搜索次数,最终将query推荐给用户。
对普通的明确型query,上述现有的导航和相关搜索方案都能提供良好的导购信息给用户。但是对于包含知识需求的query,现有的导航和相关搜索方案都不能很好的满足用户的意图。
1.导航的缺点
当前的导航(如商品导航)本质是通过query关键字召回结果(如商品),再根据用户对召回结果如商品集合的CPV(类目、属性、属性值)的点击反馈来计算不同的CPV的重要度,根据重要度推荐给用户。这种方式的缺陷是它完全依赖召回结果(如商品)的集合和结果(如商品)自身的类目属性体系。当包含知识需求的query较长导致召回结果(如商品)较少或者结果(如商品)类目属性较宽泛时,导航区提供的信息导购性就很差。例如,如图1所示,包含知识需求的query为送给男朋友的礼物,召回商品类目属性较宽泛,再如图2所示,包含知识需求的query为杭州特产有哪些,召回商品较少,导航区提供的信息导购性都不理想。
2.相关搜索的缺点
相关搜索的推荐候选项来自于用户输入的query,正因为此,它受制于用户认知。如图3所示,当搜索包含知识需求的query时,相关搜索呈现的都是类似的query,不能满足用户获取答案的需求。
发明内容
本申请的目的是提供一种用于对搜索数据进行处理的方法及设备,针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案 推荐给用户,以提高向用户推荐的实体信息的准确率解决目前包含知识需求的历史查询序列的搜索结果差的问题,如目前知识性购物query导购信息较差的问题。
有鉴于此,本申请提供一种用于对搜索数据进行处理的方法,包括:
获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。
进一步的,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息包括:
根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式;
根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息。
进一步的,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息中,
将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。
进一步的,获取每个包含知识需求的历史查询序列所对应的搜索结果信息中,
获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。
进一步的,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息中,
从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。
进一步的,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息包括:
从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
进一步的,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息之后还包括:
计算每个历史查询序列所对应的候选实体信息的分数。
进一步的,根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000001
式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数,j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。
进一步的,从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息中,
根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信 息中筛选该历史查询序列所对应的实体信息。
进一步的,根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息之后还包括:
根据每个候选实体信息的分数得到筛选后的对应实体信息的分数。
进一步的,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息之后还包括:
根据包含知识需求的当前查询序列查找对应的历史查询序列;
获取查找到的历史查询序列所对应的实体信息。
进一步的,获取查找到的历史查询序列所对应的实体信息之后还包括:
获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。
申请另一方面还提供一种用于对搜索数据进行处理的设备,包括:
第一装置,用于获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
第二装置,用于从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
第三装置,用于根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。
进一步的,所述第二装置包括:
第一单元,用于根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式;
第二单元,用于根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的 候选实体信息。
进一步的,所述第三装置,用于将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。
进一步的,所述第一装置获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。
进一步的,所述第二装置从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息
进一步的,所述第三装置从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
进一步的,还包括第四装置,用于计算每个历史查询序列所对应的候选实体信息的分数。
进一步的,所述第四装置根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000002
式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数,j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。
进一步的,所述第三装置,用于根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
进一步的,所述第三装置还用于根据每个候选实体信息的分数得到筛选后的对应实体信息的分数。
进一步的,还包括:
第五装置,用于根据包含知识需求的当前查询序列查找对应的历史查询序列;
第六装置,用于获取查找到的历史查询序列所对应的实体信息。
进一步的,所述第六装置,还用于获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。
与现有技术相比,本申请可以针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率,解决目前包含知识需求的历史查询序列的搜索结果差的问题。
进一步的,本申请从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息,以删去不准确或不够准确的候选实体信息,筛选得到准确的候选实体信息作为实体信息,从而得到更优化、更准确的实体信息以提供给用户。
进一步的,本申请通过计算每个历史查询序列所对应的候选实体信息的分数,以供后续根据打分从候选实体信息中进一步筛选实体信息,或者对筛选后的实体信息进行排序后提供给用户,从而向用户提供更准确的推荐结果
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1示出现有的导航方式的一种搜索结果图;
图2示出现有的导航方式的另一种搜索结果图;
图3示出现有的相关搜索方式的一种搜索结果图;
图4示出根据本申请一个方面的用于对搜索数据进行处理的方法流程图;
图5示出本申请的一种搜索结果图;
图6示出本申请的另一种搜索结果图;
图7示出本申请的一优选的实施例用于对搜索数据进行处理的方法流程图;
图8示出本申请的另一优选的实施例用于对搜索数据进行处理的方法流程图;
图9示出本申请的另一个方面用于对搜索数据进行处理的设备示意图;
图10示出本申请的一优选的实施例用于对搜索数据进行处理的设备示意图;
图11示出本申请的另一个优选的实施例用于对搜索数据进行处理的设备示意图;
图12示出本申请的再一个优选的实施例用于对搜索数据进行处理的设备示意图。
附图中相同或相似的附图标记代表相同或相似的部件。
具体实施方式
在本申请一个典型的配置中,终端、服务网络的设备和可信方均包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括非暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
如图4所示,本申请提供一种用于对搜索数据进行处理的方法,包括:
步骤S1,获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
步骤S2,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
步骤S3,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。具体的,本申请可以针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率解决目前包含知识需求的历史查询序列的搜索结果差的问题。本申请可采用信息抽取的方式,先识别包含知识需求的历史 query,再对与包含知识需求的历史query相关的外网社区数据进行搜索结果信息抽取,从搜索结果信息中挖掘出想要的实体信息作为答案沉淀到一知识库。从而后续用户可在线上根据包含知识需求的当前查询序列查找对应的历史查询序列时,就可以基于所述知识库推荐查找到的历史查询序列所对应的实体信息给用户。
在此,所述实体信息可为客观存在并可相互区别的事物的信息,实体信息可以是具体的人、事、物的信息,也可以是抽象的概念或联系的信息。在一购物应用场景中,包含知识需求的历史查询序列可以是针对知识性购物query,如图5中的“送给父母的实用礼物”,或如图6中的“送给男朋友的礼物”,利用本申请的方法可以从网站的社区数据中挖掘实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率相应得到的实体信息为有针对性的推荐商品,可以解决目前知识性购物query导购信息较差的问题。在另一应用场景中,用户可依次得到N级的实体信息,其中,N为正整数,后一级的实体信息依赖前一级的实体信息得到,例如,前N-1级相应得到的实体信息可分别为一新的包含知识需求的历史查询序列,这样根据前一级的历史查询序列得到下一级的实体信息,除了第N级的实体信息,下一级的实体信息也是一历史查询序列,再以该下一级的历史查询序列得到再下一级的实体信息,依此类推,直到将第N-1级的实体信息(此时为一历史查询序列)得到第N级的某一具体的实体信息如具休商品信息,前N-1级相应得到的实体信息可以多级推荐标签的形式展示给用户,当用户点击某一级的推荐标签时,可以跳转到下一级的推荐标签直到得到最后第N级的具体的实体信息如具体商品信息,通过这种逐级跳转的方式,可以引导用户得到精确想要的实体信息。本领域技术人员应能理解上述应用场景的描述仅为举例,其他现有的或 今后可能出现的应用场景如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
如图7所示,本申请一优选的实施例中,图4的步骤S2包括:
步骤S21,根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式;
步骤S22,根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息。
在此,步骤S21之前,可对所有历史查询序列进行分析总结,提炼出不同包含知识需求的历史查询序列的类型,然后步骤S21中,根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式。例如,可将包含知识需求的历史查询序列的类型分为如下几种:
(1)地名+“特产”:表示希望获取某地的特产知识;
(2)“送”+称呼+“礼物”:表示希望获取送礼的导购知识;
(3)品类词+“品牌”:表示希望获取某品类的畅销品牌;
(4)品类词+“配件”:表示希望获取某品类的其它配件。
对于地名+“特产”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取特产的名称作为实体信息;对于“送”+称呼+“礼物”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取礼物的名称作为实体信息;对于品类词+“品牌”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取品牌的名称作为实体信息;对于品类词+“配件”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取配件 的名称作为实体信息。本领域技术人员应能理解上述抽取候选实体信息的方式的描述仅为举例,其他现有的或今后可能出现的抽取候选实体信息的方式如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
本申请一优选的实施例中,图4的步骤S3中,将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。在此,如果候选实体信息的数据量不是很大也足够精确,则可直接将所有候选实体信息不经过删选而直接作为实体信息推荐给用户,以节省数据处理量,提高推荐速度。
本申请一优选的实施例中,图4的步骤S1中,获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。在此,可采用通用的爬虫技术,在社区网站比如百度知道、搜搜问答和淘宝问答抓取包含知识需求的历史查询序列(Query)所对应的搜索结果信息,并对抓取的历史查询序列所对应的搜索结果信息如网页数据进行解析,不光解析网页数据的回答的文本内容,也会把回答的网站、支持数和反对数等信息解析出来,以供后续抽取候选实体信息和对候选实体信息打分使用。抓取的结果数据示例如表1所示:
Figure PCTCN2015097481-appb-000003
Figure PCTCN2015097481-appb-000004
表1
本领域技术人员应能理解上述搜索结果信息的描述仅为举例,其他现有的或今后可能出现的搜索结果信息如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
本申请一更优的实施例中,图4的步骤S2,从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。有了类似于表1所示的每个query结构化好后的搜索结果信息,还需要从搜索结果信息中抽取需要的候选实体信息,在此,可以从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。从回答的文本内容中识别候选实体信息的方法很多,比如基于规则的方法、基于隐马尔科夫模型的方法、基于条件随机场的方法等。从回答的文本内容抽取的候选实体信息会有很多类型,在一具体的应用场景中如是为了解决知识性购物query的导购问题,需要将品类实体筛选出来,候选实体信息的结果样式可如表2所示:
Figure PCTCN2015097481-appb-000005
表2
本申请一优选的实施例中,图4的步骤S3包括步骤S31,从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。在此,可以根据历史查询序列对候选实体信息进行检查,删去不准确或不够准确的候选实体信息,筛选得到准确的候选实体信息作为实体信息,从而得到更优化、更准确的实体信息以提供给用户。
本申请一优选的实施例中,图4的步骤S3之后还包括:
计算每个历史查询序列所对应的候选实体信息的分数。在此,有了类似于表1所示的每个query结构化好后的搜索结果信息和类似于表2所示的从搜索结果信息中抽取候选实体信息,还可以对候选实体信息进行打分,以供后续根据打分从候选实体信息中进一步筛选实体信息,或者对筛选后的实体信息进行排序后提供给用户。例如可以得到类似表3所示的每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000006
表3
有了候选实体信息,就可以结合回答的网站的质量、回答的支持度(权重)来对候选实体信息进行打分,具体如支持度=支持数-反对数,本申请一更优的实施例中,可以根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000007
式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数, j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。例如,Weight2j的值由支持数减反对数得到,如支持数减反对数小于等于零,则Weight2j的默认值为1。其中,Weight1i可以通过预设或者基于pagerank算法得到。
本领域技术人员应能理解上述计算每个历史查询序列所对应的候选实体信息的分数的描述仅为举例,其他现有的或今后可能出现的计算每个历史查询序列所对应的候选实体信息的分数如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
相应的,本申请一优选的实施例中,步骤S31中,根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。在此,可以将分数较高的候选实体信息从每个历史查询序列的候选实体信息中筛选出来作为该历史查询序列所对应的实体信息。
本申请一更优的实施例中,步骤S31之后还包括:
根据每个候选实体信息的分数得到筛选后的对应实体信息的分数。具体的,如表3所示,如候选实体信息及分数如“手表:55;钱包:46:打火机:32;腰带:22;围巾:22;剃须刀:20;手链:18;皮带:18;领带:18”,筛选后的实体信息及分数为“手表:55;钱包:46:打火机:32;腰带:22;围巾:22;剃须刀:20”,即将保留的候选实体信息及分数作为筛选后的实体信息及分数。
上述搜索结果信息、候选实体信息和实体信息及分数的获取牵涉到大规模的数据处理,有大规模并行计算的需求,本申请一实施例中,可以通过云 计算平台来实现。
如图8所示,本申请一优选的实施例中,图4的步骤S3之后还包括:
步骤S4,根据包含知识需求的当前查询序列查找对应的历史查询序列;
步骤S5,获取查找到的历史查询序列所对应的实体信息。在此,步骤S4和步骤S5的过程可通过一在线服务器实现,历史查询序列及对应的实体信息已经预存于一知识库中,用户可通过终端向在所述线服务器提交搜索包含知识需求的当前查询序列查找对应的历史查询序列的请求,在线服务器如从所述知识库中查找到对应的历史查询序列,就直接将对应的实体信息以标签的形式在导航区呈现给用户,用户可以点击标签继续进行网络操作行为如购物行为。另外,所述在线服务器可将包含知识需求的当前查询序列拆分为多个关键字序列,然后根据多个关键字序列查找对应的历史查询序列,以提高历史查询序列的命中率。
本申请一更优的实施例中,图8的步骤S5之后还包括:
获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。例如,可以将分数高的实体信息排在前面,将分数低的实体信息排在后面提供给用户,以提高用户选择实体信息的效率。
本申请一实施例中,上述查找对应的历史查询序列及对应的实体信息的过程可以通过一支持实时查询的keyvalue***来实现。
如图9所示,根据本申请的另一面还提供一种用于对搜索数据进行处理的设备100,包括:
第一装置1,用于获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
第二装置2,用于从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
第三装置3,用于根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。具体的,本申请可以针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率解决目前包含知识需求的历史查询序列的搜索结果差的问题。本申请可采用信息抽取的方式,先识别包含知识需求的历史query,再对与包含知识需求的历史query相关的外网社区数据进行搜索结果信息抽取,从搜索结果信息中挖掘出想要的实体信息作为答案沉淀到一知识库。从而后续用户可在线上根据包含知识需求的当前查询序列查找对应的历史查询序列时,就可以基于所述知识库推荐查找到的历史查询序列所对应的实体信息给用户。
在此,所述实体信息可为客观存在并可相互区别的事物的信息,实体信息可以是具体的人、事、物的信息,也可以是抽象的概念或联系的信息。在一购物应用场景中,包含知识需求的历史查询序列可以是针对知识性购物query,如图5中的“送给父母的实用礼物”,或如图6中的“送给男朋友的礼物”,利用本申请的方法可以从网站的社区数据中挖掘实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率相应得到的实体信息为有针对性的推荐商品,可以解决目前知识性购物query导购信息较差的问题。在另一应用场景中,用户可依次得到N级的实体信息,其中,N为正整数,后一级的实体信息依赖前一级的实体信息得到,例如,前N-1级相应得到的实体信息可分别为一新的包含知识需求的历史查询序列,这样根据前一级的历史查询序列得到下一级的实体信息,除了第N级的实体信息,下一级的实体信 息也是一历史查询序列,再以该下一级的历史查询序列得到再下一级的实体信息,依此类推,直到将第N-1级的实体信息(此时为一历史查询序列)得到第N级的某一具体的实体信息如具休商品信息,前N-1级相应得到的实体信息可以多级推荐标签的形式展示给用户,当用户点击某一级的推荐标签时,可以跳转到下一级的推荐标签直到得到最后第N级的具体的实体信息如具体商品信息,通过这种逐级跳转的方式,可以引导用户得到精确想要的实体信息。本领域技术人员应能理解上述应用场景的描述仅为举例,其他现有的或今后可能出现的应用场景如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
如图10所示,本申请一优选的实施例中,所述第二装置2包括:
第一单元21,用于根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式;
第二单元22,用于根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息。
在此,可先对所有历史查询序列进行分析总结,提炼出不同包含知识需求的历史查询序列的类型,然后第一单元21根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式。例如,可将包含知识需求的历史查询序列的类型分为如下几种:
(5)地名+“特产”:表示希望获取某地的特产知识;
(6)“送”+称呼+“礼物”:表示希望获取送礼的导购知识;
(7)品类词+“品牌”:表示希望获取某品类的畅销品牌;
(8)品类词+“配件”:表示希望获取某品类的其它配件。
对于地名+“特产”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取特产的名称作为实体信息;对于“送”+称呼+“礼物”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取礼物的名称作为实体信息;对于品类词+“品牌”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取品牌的名称作为实体信息;对于品类词+“配件”类型的历史查询序列,确定该类历史查询序列所对应的抽取候选实体信息的方式为抽取配件的名称作为实体信息。本领域技术人员应能理解上述抽取候选实体信息的方式的描述仅为举例,其他现有的或今后可能出现的抽取候选实体信息的方式如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
本申请一优选的实施例中,所述第三装置3,用于将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。本申请一优选的实施例中,图4的步骤S3中,将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。在此,如果候选实体信息的数据量不是很大也足够精确,则可直接将所有候选实体信息不经过删选而直接作为实体信息推荐给用户,以节省数据处理量,提高推荐速度。
本申请一优选的实施例中,所述第一装置1获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。在此,可采用通用的爬虫技术,在社区网站比如百度知道、搜搜问答和淘宝问答抓取包含知识需求的历史查询序列(Query)所对应的搜索结果信息,并对抓取的历史查询序列所对应的搜索结果信息如网页数据进行解析,不光解析网页数据的回答的文本内容,也会把回答的网站、支持数 和反对数等信息解析出来,以供后续抽取候选实体信息和对候选实体信息打分使用。抓取的结果数据示例如表1所示:
Figure PCTCN2015097481-appb-000008
表1
本领域技术人员应能理解上述搜索结果信息的描述仅为举例,其他现有的或今后可能出现的搜索结果信息如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
相应的,本申请一优选的实施例中,所述第二装置2从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。有了类似于表1所示的每个query结构化好后的搜索结果信息,还需要从搜索结果信息中抽取需要的候选实体信息,在此,可以从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。从回答的文本内容中识别候选实体信息的方法很多,比如基于规则的方法、基于隐马尔科夫模型的方法、基于条件随机场的方法等。从回答的文本内容抽取的候选实体信息会有很多类型,在一具体的应用场景中如是为了解决知识性购物query的导购问题,需要将品类实体筛选出来,候选实体信息的结果样式可如表2所示:
Figure PCTCN2015097481-appb-000009
Figure PCTCN2015097481-appb-000010
表2
本申请一优选的实施例中,所述第三装置3从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。在此,可以根据历史查询序列对候选实体信息进行检查,删去不准确或不够准确的候选实体信息,筛选得到准确的候选实体信息作为实体信息,从而得到更优化、更准确的实体信息以提供给用户。
本申请一优选的实施例中,如图11所示,所述设备还包括第四装置4,用于计算每个历史查询序列所对应的候选实体信息的分数。在此,有了类似于表1所示的每个query结构化好后的搜索结果信息和类似于表2所示的从搜索结果信息中抽取候选实体信息,还可以对候选实体信息进行打分,以供后续根据打分从候选实体信息中进一步筛选实体信息,或者对筛选后的实体信息进行排序后提供给用户。例如可以得到类似表3所示的每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000011
表3
有了候选实体信息,就可以结合回答的网站的质量、回答的支持度(权重)来对候选实体信息进行打分,具体如支持度=支持数-反对数,本申请一更优的实施例中,所述第四装置4根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
Figure PCTCN2015097481-appb-000012
式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数,j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。例如,Weight2j的值由支持数减反对数得到,如支持数减反对数小于等于零,则Weight2j的默认值为1。其中,Weight1i可以通过预设或者基于pagerank算法得到。
本领域技术人员应能理解上述计算每个历史查询序列所对应的候选实体信息的分数的描述仅为举例,其他现有的或今后可能出现的计算每个历史查询序列所对应的候选实体信息的分数如可适用于本申请,也应包含在本申请保护范围以内,并在此以引用方式包含于此。
本申请一优选的实施例中,所述第三装置3,用于根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。在此,可以将分数较高的候选实体信息从每个历史查询序列的候选实体信息中筛选出来作为该历史查询序列所对应的实体信息。
本申请一更优的实施例中,所述第三装置3还用于根据每个候选实体信 息的分数得到筛选后的对应实体信息的分数。具体的,如表3所示,如候选实体信息及分数如“手表:55;钱包:46:打火机:32;腰带:22;围巾:22;剃须刀:20;手链:18;皮带:18;领带:18”,筛选后的实体信息及分数为“手表:55;钱包:46:打火机:32;腰带:22;围巾:22;剃须刀:20”,即将保留的候选实体信息及分数作为筛选后的实体信息及分数。
上述搜索结果信息、候选实体信息和实体信息及分数的获取牵涉到大规模的数据处理,有大规模并行计算的需求,本申请一实施例中,可以通过云计算平台来实现。
如图12所示,本申请一优选的实施例中,所述设备还包括:
第五装置5,用于根据包含知识需求的当前查询序列查找对应的历史查询序列;
第六装置6,用于获取查找到的历史查询序列所对应的实体信息。在此,第五装置5和第六装置6的功能可通过一在线服务器实现,历史查询序列及对应的实体信息已经预存于一知识库中,用户可通过终端向在所述线服务器提交搜索包含知识需求的当前查询序列查找对应的历史查询序列的请求,在线服务器如从所述知识库中查找到对应的历史查询序列,就直接将对应的实体信息以标签的形式在导航区呈现给用户,用户可以点击标签继续进行网络操作行为如购物行为。另外,所述在线服务器可将包含知识需求的当前查询序列拆分为多个关键字序列,然后根据多个关键字序列查找对应的历史查询序列,以提高历史查询序列的命中率。
本申请一更优的实施例中,所述第六装置6,还用于获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。例如,可以将分数高的实体信息排在前面,将分数低的实体信 息排在后面提供给用户,以提高用户选择实体信息的效率。
本申请一实施例中,第五装置5和第六装置6的查找对应的历史查询序列及对应的实体信息的过程可以通过一支持实时查询的keyvalue***来实现。
以下结合具体的应用实施例进一步说明本申请所述的用于对搜索数据进行处理的方法及设备。
在一具体的应用实施例中,为解决目前知识性购物query导购信息较差的问题,可以先从社区网站比如百度知道、搜搜问答和淘宝问答抓取如表1所示的包含知识需求的历史查询序列(Query)如“送给男朋友的礼物”所对应的搜索结果信息如“网站”、“回答文本”、“支持数”和“反对数”,然后从表1的搜索结果信息的“回答文本”中抽取该历史查询序列所对应的候选实体信息如“衣服、领带、皮带、手表、公事包、钢笔”,接着可以对候选实体信息分别进行打分,根据分数高低对“衣服、领带、皮带、手表、公事包、钢笔”的候选实体信息进行筛选,如钢笔的分数很低,则将“钢笔”删去,得到类似表3筛选后的实体信息及分数,如一实体信息为“衣服、领带、皮带、手表、公事包”,后续还可以根据“衣服、领带、皮带、手表、公事包”的分数对实体信息进行排序,将分数高的实体信息排在靠前的位置,以便用户更容易看到并进行选择,从而提高推荐的准确率。
综上所述,本申请可以针对包含知识需求的历史查询序列,挖掘针对历史查询序列的实体信息作为答案推荐给用户,以提高向用户推荐的实体信息的准确率解决,目前包含知识需求的历史查询序列的搜索结果差的问题。
进一步的,本申请从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息,以删去不准确或不够准确的候选实体信息, 筛选得到准确的候选实体信息作为实体信息,从而得到更优化、更准确的实体信息以提供给用户。
进一步的,本申请通过计算每个历史查询序列所对应的候选实体信息的分数,以供后续根据打分从候选实体信息中进一步筛选实体信息,或者对筛选后的实体信息进行排序后提供给用户,从而向用户提供更准确的推荐结果
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。
需要注意的是,本申请可在软件和/或软件与硬件的组合体中被实施,例如,可采用专用集成电路(ASIC)、通用目的计算机或任何其他类似硬件设备来实现。在一个实施例中,本申请的软件程序可以通过处理器执行以实现上文所述步骤或功能。同样地,本申请的软件程序(包括相关的数据结构)可以被存储到计算机可读记录介质中,例如,RAM存储器,磁或光驱动器或软磁盘及类似设备。另外,本申请的一些步骤或功能可采用硬件来实现,例如,作为与处理器配合从而执行各个步骤或功能的电路。
另外,本申请的一部分可被应用为计算机程序产品,例如计算机程序指令,当其被计算机执行时,通过该计算机的操作,可以调用或提供根据本申请的方法和/或技术方案。而调用本申请的方法的程序指令,可能被存储在固定的或可移动的记录介质中,和/或通过广播或其他信号承载媒体中的数据流而被传输,和/或被存储在根据所述程序指令运行的计算机设备的工作存储器中。在此,根据本申请的一个实施例包括一个装置,该装置包括用于存储计算机程序指令的存储器和用于执行程序指令的处理 器,其中,当该计算机程序指令被该处理器执行时,触发该装置运行基于前述根据本申请的多个实施例的方法和/或技术方案。
对于本领域技术人员而言,显然本申请不限于上述示范性实施例的细节,而且在不背离本申请的精神或基本特征的情况下,能够以其他的具体形式实现本申请。因此,无论从哪一点来看,均应将实施例看作是示范性的,而且是非限制性的,本申请的范围由所附权利要求而不是上述说明限定,因此旨在将落在权利要求的等同要件的含义和范围内的所有变化涵括在本申请内。不应将权利要求中的任何附图标记视为限制所涉及的权利要求。此外,显然“包括”一词不排除其他单元或步骤,单数不排除复数。装置权利要求中陈述的多个单元或装置也可以由一个单元或装置通过软件或者硬件来实现。第一,第二等词语用来表示名称,而并不表示任何特定的顺序。

Claims (24)

  1. 一种用于对搜索数据进行处理的方法,其中,包括:
    获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
    从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
    根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。
  2. 如权利要求1所述的方法,其中,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息包括:
    根据每个历史查询序列的类型的确定该历史查询序列所对应的抽取候选实体信息的方式;
    根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息。
  3. 如权利要求1或2所述的方法,其中,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息中,
    将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。
  4. 如权利要求1至3中任一项所述的方法,其中,获取每个包含知识需求的历史查询序列所对应的搜索结果信息中,
    获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。
  5. 如权利要求4所述的方法,其中,从每个历史查询序列所对应的搜索 结果信息中抽取该历史查询序列所对应的候选实体信息中,
    从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。
  6. 如权利要求4或5所述的方法,其中,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息包括:
    从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
  7. 如权利要求6所述的用于对搜索数据进行处理的方法,其中,从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息之后还包括:
    计算每个历史查询序列所对应的候选实体信息的分数。
  8. 如权利要求7所述的用于对搜索数据进行处理的方法,其中,根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
    Figure PCTCN2015097481-appb-100001
    式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数,j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。
  9. 如权利要求7或8任一项所述的方法,其中,从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息中,
    根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信 息中筛选该历史查询序列所对应的实体信息。
  10. 如权利要求9所述的方法,其中,根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息之后还包括:
    根据每个候选实体信息的分数得到筛选后的对应实体信息的分数。
  11. 如权利要求10中任一项所述的方法,其中,根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息之后还包括:
    根据包含知识需求的当前查询序列查找对应的历史查询序列;
    获取查找到的历史查询序列所对应的实体信息。
  12. 如权利要求11所述的方法,其中,获取查找到的历史查询序列所对应的实体信息之后还包括:
    获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。
  13. 一种用于对搜索数据进行处理的设备,其中,包括:
    第一装置,用于获取每个包含知识需求的历史查询序列所对应的搜索结果信息;
    第二装置,用于从每个历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息;
    第三装置,用于根据每个历史查询序列所对应的候选实体信息确定该历史查询序列所对应的实体信息。
  14. 如权利要求13所述的设备,其中,所述第二装置包括:
    第一单元,用于根据每个历史查询序列的类型的确定该历史查询序列所 对应的抽取候选实体信息的方式;
    第二单元,用于根据每个历史查询序列所对应的抽取候选实体信息的方式从该历史查询序列所对应的搜索结果信息中抽取该历史查询序列所对应的候选实体信息。
  15. 如权利要求13或14所述的设备,其中,所述第三装置,用于将每个历史查询序列所对应的所有候选实体信息作为该历史查询序列所对应的实体信息。
  16. 如权利要求13至15中任一项所述的设备,其中,所述第一装置获取到的每个历史查询序列所对应的搜索结果信息包括该历史查询序列所对应的回答的文本内容、网站、支持数和反对数。
  17. 如权利要求16所述的设备,其中,所述第二装置从每个历史查询序列所对应的回答的文本内容中抽取该历史查询序列所对应的候选实体信息。
  18. 如权利要求16或17所述的设备,其中,所述第三装置从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
  19. 如权利要求18所述的设备,其中,还包括第四装置,用于计算每个历史查询序列所对应的候选实体信息的分数。
  20. 如权利要求19所述的设备,其中,所述第四装置根据如下公式计算每个历史查询序列所对应的候选实体信息的分数:
    Figure PCTCN2015097481-appb-100002
    式中,entity1表示某个实体词,m表示网站的总数,i表示m个网站中的某个网站,n表示某个网站i的回答的总数,j表示n个回答中的某个回答,Eij表示entity1是否在网站i的答案j中出现,出现则为1,不出现则为0,Weight1i表示网站i的权重,Weight2j表示回答j 的权重,Weight2j的值由回答j的支持数和反对数确定,Weight2j为大于等于1的正整数,Weight2j的默认值为1。
  21. 如权利要求19或20所述的设备,其中,所述第三装置,用于根据每个候选实体信息的分数从每个历史查询序列所对应的候选实体信息中筛选该历史查询序列所对应的实体信息。
  22. 如权利要求21所述的设备,其中,所述第三装置还用于根据每个候选实体信息的分数得到筛选后的对应实体信息的分数。
  23. 如权利要求22所述的设备,其中,还包括:
    第五装置,用于根据包含知识需求的当前查询序列查找对应的历史查询序列;
    第六装置,用于获取查找到的历史查询序列所对应的实体信息。
  24. 如权利要求23所述的设备,其中,所述第六装置,还用于获取查找到的历史查询序列所对应的实体信息的分数,根据每个实体信息的分数高低对实体信息进行排序。
PCT/CN2015/097481 2014-12-23 2015-12-15 用于对搜索数据进行处理的方法及设备 WO2016101812A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US15/538,727 US10635678B2 (en) 2014-12-23 2015-12-15 Method and apparatus for processing search data
JP2017532636A JP6728178B2 (ja) 2014-12-23 2015-12-15 検索データを処理するための方法及び装置
US16/822,431 US11347758B2 (en) 2014-12-23 2020-03-18 Method and apparatus for processing search data

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410836116.9 2014-12-23
CN201410836116.9A CN105786936A (zh) 2014-12-23 2014-12-23 用于对搜索数据进行处理的方法及设备

Related Child Applications (2)

Application Number Title Priority Date Filing Date
US15/538,727 A-371-Of-International US10635678B2 (en) 2014-12-23 2015-12-15 Method and apparatus for processing search data
US16/822,431 Continuation US11347758B2 (en) 2014-12-23 2020-03-18 Method and apparatus for processing search data

Publications (1)

Publication Number Publication Date
WO2016101812A1 true WO2016101812A1 (zh) 2016-06-30

Family

ID=56149237

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2015/097481 WO2016101812A1 (zh) 2014-12-23 2015-12-15 用于对搜索数据进行处理的方法及设备

Country Status (4)

Country Link
US (2) US10635678B2 (zh)
JP (2) JP6728178B2 (zh)
CN (1) CN105786936A (zh)
WO (1) WO2016101812A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018209966A1 (zh) * 2017-05-16 2018-11-22 武汉斗鱼网络科技有限公司 一种信息搜索中关键词淘汰方法及装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106383865B (zh) * 2016-09-05 2020-03-27 北京百度网讯科技有限公司 基于人工智能的推荐数据的获取方法及装置
KR102676115B1 (ko) * 2016-12-12 2024-06-19 삼성전자주식회사 위치 데이터를 제공하는 전자 장치 및 그 방법
CN107454613A (zh) * 2017-09-06 2017-12-08 上海斐讯数据通信技术有限公司 一种无线网络的优化方法和***
CN108256970A (zh) * 2018-01-15 2018-07-06 北京值得买科技股份有限公司 一种基于购物需求进行产品推荐的方法
CN109033140B (zh) * 2018-06-08 2020-05-29 北京百度网讯科技有限公司 一种确定搜索结果的方法、装置、设备和计算机存储介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178718A (zh) * 2007-05-17 2008-05-14 腾讯科技(深圳)有限公司 一种知识共享***及问题搜索方法、问题发布方法
CN103229223A (zh) * 2010-09-28 2013-07-31 国际商业机器公司 使用多个候选答案评分模型提供问题答案
CN103455535A (zh) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 基于历史咨询数据构建知识库的方法

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6963867B2 (en) 1999-12-08 2005-11-08 A9.Com, Inc. Search query processing to provide category-ranked presentation of search results
US20030217052A1 (en) * 2000-08-24 2003-11-20 Celebros Ltd. Search engine method and apparatus
JP2003296628A (ja) * 2002-04-01 2003-10-17 Knowledgescience Corp 電子商取引システムにおける知識ベースを利用した購買支援方法
JP4654776B2 (ja) * 2005-06-03 2011-03-23 富士ゼロックス株式会社 質問応答システム、およびデータ検索方法、並びにコンピュータ・プログラム
KR100721406B1 (ko) 2005-07-27 2007-05-23 엔에이치엔(주) 카테고리별 검색 로직을 이용한 상품 검색 시스템 및 방법
US20070050332A1 (en) 2005-08-26 2007-03-01 Grenzberg Christopher G Method and apparatus for providing a comparative product information of related products
US7668823B2 (en) 2007-04-03 2010-02-23 Google Inc. Identifying inadequate search content
US8200663B2 (en) * 2007-04-25 2012-06-12 Chacha Search, Inc. Method and system for improvement of relevance of search results
WO2009057205A1 (ja) * 2007-10-31 2009-05-07 Pioneer Corporation 施設検索装置、施設検索方法、施設検索プログラム、および記録媒体
US7877389B2 (en) 2007-12-14 2011-01-25 Yahoo, Inc. Segmentation of search topics in query logs
US8346701B2 (en) * 2009-01-23 2013-01-01 Microsoft Corporation Answer ranking in community question-answering sites
US8458171B2 (en) * 2009-01-30 2013-06-04 Google Inc. Identifying query aspects
US9639609B2 (en) * 2009-02-24 2017-05-02 Microsoft Technology Licensing, Llc Enterprise search method and system
JP5096411B2 (ja) * 2009-05-22 2012-12-12 ヤフー株式会社 ネットショッピング管理装置
WO2011053830A2 (en) * 2009-10-30 2011-05-05 Google Inc. Social search engine
US8725717B2 (en) 2009-12-23 2014-05-13 Palo Alto Research Center Incorporated System and method for identifying topics for short text communications
US8150859B2 (en) 2010-02-05 2012-04-03 Microsoft Corporation Semantic table of contents for search results
US9317613B2 (en) * 2010-04-21 2016-04-19 Yahoo! Inc. Large scale entity-specific resource classification
US8606739B2 (en) * 2010-06-30 2013-12-10 Microsoft Corporation Using computational engines to improve search relevance
US9098569B1 (en) 2010-12-10 2015-08-04 Amazon Technologies, Inc. Generating suggested search queries
US8775431B2 (en) 2011-04-25 2014-07-08 Disney Enterprises, Inc. Systems and methods for hot topic identification and metadata
US9098600B2 (en) 2011-09-14 2015-08-04 International Business Machines Corporation Deriving dynamic consumer defined product attributes from input queries
JP2013077056A (ja) * 2011-09-29 2013-04-25 Ntt Docomo Inc アプリケーション推薦装置及びアプリケーション推薦方法
US9665643B2 (en) * 2011-12-30 2017-05-30 Microsoft Technology Licensing, Llc Knowledge-based entity detection and disambiguation
CN103218719B (zh) 2012-01-19 2016-12-07 阿里巴巴集团控股有限公司 一种电子商务网站导航方法及***
US8620951B1 (en) 2012-01-28 2013-12-31 Google Inc. Search query results based upon topic
US8768910B1 (en) 2012-04-13 2014-07-01 Google Inc. Identifying media queries
US9129020B2 (en) 2012-12-21 2015-09-08 Microsoft Technology Licensing, Llc Search results through interest circles
US10394816B2 (en) 2012-12-27 2019-08-27 Google Llc Detecting product lines within product search queries
US9251474B2 (en) * 2013-03-13 2016-02-02 International Business Machines Corporation Reward based ranker array for question answer system
US9336269B1 (en) * 2013-03-14 2016-05-10 Google Inc. Determining question and answer alternatives
US9213748B1 (en) 2013-03-14 2015-12-15 Google Inc. Generating related questions for search queries
US10394901B2 (en) 2013-03-20 2019-08-27 Walmart Apollo, Llc Method and system for resolving search query ambiguity in a product search engine
CN103279486B (zh) 2013-04-24 2019-03-08 百度在线网络技术(北京)有限公司 一种提供相关搜索的方法和装置
CN103294814A (zh) * 2013-06-07 2013-09-11 百度在线网络技术(北京)有限公司 搜索结果推荐方法、***和搜索引擎
CN103914543B (zh) * 2014-04-03 2017-12-26 北京百度网讯科技有限公司 搜索结果的展现方法和装置
CN103914554A (zh) * 2014-04-14 2014-07-09 百度在线网络技术(北京)有限公司 搜索推荐方法和装置

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101178718A (zh) * 2007-05-17 2008-05-14 腾讯科技(深圳)有限公司 一种知识共享***及问题搜索方法、问题发布方法
CN103229223A (zh) * 2010-09-28 2013-07-31 国际商业机器公司 使用多个候选答案评分模型提供问题答案
CN103455535A (zh) * 2013-05-08 2013-12-18 深圳市明唐通信有限公司 基于历史咨询数据构建知识库的方法

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2018209966A1 (zh) * 2017-05-16 2018-11-22 武汉斗鱼网络科技有限公司 一种信息搜索中关键词淘汰方法及装置

Also Published As

Publication number Publication date
CN105786936A (zh) 2016-07-20
US11347758B2 (en) 2022-05-31
US10635678B2 (en) 2020-04-28
JP2018504686A (ja) 2018-02-15
US20180011857A1 (en) 2018-01-11
JP6966158B2 (ja) 2021-11-10
JP2020170538A (ja) 2020-10-15
US20200226142A1 (en) 2020-07-16
JP6728178B2 (ja) 2020-07-22

Similar Documents

Publication Publication Date Title
US20240029464A1 (en) Method, apparatus, and computer program product for classification of documents
US10289700B2 (en) Method for dynamically matching images with content items based on keywords in response to search queries
WO2016101812A1 (zh) 用于对搜索数据进行处理的方法及设备
US10296538B2 (en) Method for matching images with content based on representations of keywords associated with the content in response to a search query
US9712588B1 (en) Generating a stream of content for a channel
US10489448B2 (en) Method and system for dynamically ranking images to be matched with content in response to a search query
US20150347420A1 (en) Performing Application Searches
US10565255B2 (en) Method and system for selecting images based on user contextual information in response to search queries
US20170154116A1 (en) Method and system for recommending contents based on social network
CN107766399B (zh) 用于使图像与内容项目匹配的方法和***及机器可读介质
US10540365B2 (en) Federated search
US20200294071A1 (en) Determining user intents related to websites based on site search user behavior
US20200050678A1 (en) Detecting topical similarities in knowledge databases
US10275472B2 (en) Method for categorizing images to be associated with content items based on keywords of search queries
US10235387B2 (en) Method for selecting images for matching with content based on metadata of images and content in real-time in response to search queries
CN107491465B (zh) 用于搜索内容的方法和装置以及数据处理***
US10289642B2 (en) Method and system for matching images with content using whitelists and blacklists in response to a search query
US20130138429A1 (en) Method and Apparatus for Information Searching
WO2016162843A1 (en) Processing a search query and retrieving targeted records from a networked database system
US9483559B2 (en) Reformulating query terms in structured search
JP6568284B1 (ja) 提供装置、提供方法及び提供プログラム
US20170103073A1 (en) Identifying Expert Reviewers
US20160063109A1 (en) Query-breadth selected search result sorting mechanism
Khatiban Building reputation and trust using federated search and opinion mining
Wang et al. News Insider: Innovating News Understanding to Improve the Quality of Reading Experience

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15871880

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017532636

Country of ref document: JP

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 15538727

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 15871880

Country of ref document: EP

Kind code of ref document: A1