WO2017024884A1 - Search intention identification method and device - Google Patents

Search intention identification method and device Download PDF

Info

Publication number
WO2017024884A1
WO2017024884A1 PCT/CN2016/085338 CN2016085338W WO2017024884A1 WO 2017024884 A1 WO2017024884 A1 WO 2017024884A1 CN 2016085338 W CN2016085338 W CN 2016085338W WO 2017024884 A1 WO2017024884 A1 WO 2017024884A1
Authority
WO
WIPO (PCT)
Prior art keywords
search
special
category
preset
sentence
Prior art date
Application number
PCT/CN2016/085338
Other languages
French (fr)
Chinese (zh)
Inventor
康昭委
李亚楠
曾洪雷
Original Assignee
广州神马移动信息科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州神马移动信息科技有限公司 filed Critical 广州神马移动信息科技有限公司
Publication of WO2017024884A1 publication Critical patent/WO2017024884A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor

Definitions

  • the present invention relates to the field of Internet technologies, and in particular, to a search intent identification method and apparatus.
  • the existing general search engines all have vertical search function; unlike the traditional universal search method, vertical search only searches for a specific category search and user-entered search sentences (vocabulary, phrases, etc.) Related web pages, such as music search, video search, novel search, etc.
  • the search engine needs to have the search intent recognition function, that is, according to the target search sentence, the specific category that the user wants to search is identified; for example, the target search sentence is “Dragon”
  • the corresponding special category can be identified by the search intention as a video or a novel, and then the search engine performs video search and novel search respectively.
  • the existing intent identification method is usually based on a white list and combines fuzzy matching and pattern matching.
  • a white list of as many search sentences (vocabulary, phrase, etc.) as possible covering the novel category is set in advance, and based on this, a fuzzy query threshold can also be set, and
  • the keyword matching keywords related to the special category such as "free reading", “free download”, “txt download”, “online reading”, etc.); at the time of searching, the user's search intention is determined by at least one of the following methods, and each query is performed. Whether the target search statement exists in the whitelist corresponding to the special category, or whether the target search statement is more similar to a search query in the whitelist than the fuzzy query threshold, or whether the target search statement includes a mode corresponding to a specific category. Match keywords.
  • the search sentences input by the user are ever-changing, and the white list, fuzzy query threshold, and pattern matching keywords used by the above intent recognition method are manually set, and the search sentences covered by them are limited, and the generalization is poor. It is difficult to accurately identify the specific category corresponding to each target search statement.
  • the embodiment of the present application provides a search intent identification method and device, which solves the problem that the search intent recognition method in the prior art has poor generalization and low recognition accuracy.
  • a first aspect of the present application provides a search intent identification method, the method comprising: obtaining a first historical search sentence set in a first preset time, and classifying a historical search sentence in the first historical search sentence set Obtaining a special search vocabulary corresponding to each preset special category; establishing a classification model according to the special search vocabulary, and obtaining a candidate search sentence corresponding to each preset special category by using the classification model, and recording the candidate search sentence Entering a special search term library of the corresponding category; determining at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  • the method further includes: acquiring a second historical search sentence set in a second preset time, and training according to the second historical search sentence set The classification model to update the special search term library.
  • the historical search statement in the first historical search sentence set is classified to obtain each pre-
  • the special search term database corresponding to the special category includes: obtaining a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time; determining a click for each historical search sentence
  • the preset special category to which each clicked webpage belongs in the webpage combination calculates the proportion of clicks of the clickpage webpage corresponding to each preset special category in the clicked webpage combination, and the preset special category corresponding to the click ratio of the preset threshold is
  • each historical search sentence is respectively recorded in the special search vocabulary corresponding to its intended category.
  • determining a preset special category to which each click webpage belongs in the click webpage combination of the historical search statement includes: Each click on the webpage obtains the URL; the host name corresponding to the clicked webpage is determined according to the URL; the special site list corresponding to each preset special category is queried, and the preset corresponding to the special site list of the hostname is determined.
  • the special category is used as the default special category to which the corresponding click page belongs.
  • the classification model is established according to the special search vocabulary, including: respectively, for each historical search sentence in the special search vocabulary, respectively obtaining a URL corresponding to the clicked webpage, a webpage title; and the special search term
  • Each history search sentence in the library and the corresponding webpage title and URL are separately subjected to segmentation processing; each character obtained by dividing the historical search term and the webpage title, and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space;
  • the classification model based on the maximum entropy model is established according to the feature vector and the click proportion corresponding to the preset special category of the clicked webpage as the weight of the related feature vector.
  • the method further includes: searching for a statement according to the target The intent category performs a vertical search on the target search sentence to obtain a target webpage related to the target search sentence; determining a search intent level of the intent category according to a click ratio corresponding to the intent category; determining according to the search intent level The order in which the landing pages correspond to the respective intent categories, and generates a search result page in the form of Aladdin.
  • a second aspect of the present application provides a search intent identification device, the device comprising: a sample acquisition unit, a model control unit, and an intention recognition unit; the sample acquisition unit is configured to obtain a first history in a first preset time Searching a set of sentences, classifying the historical search sentences in the first historical search sentence set, and obtaining a special search vocabulary corresponding to each preset special category; the model control unit is configured to establish according to the special search vocabulary Classifying a model, and obtaining candidate search sentences corresponding to each preset special category by using the classification model, and recording the candidate search statements The intent recognition unit is configured to determine at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  • the device further includes: an updating unit, configured to acquire a second historical search sentence set in a second preset time, and according to the second The historical search sentence set trains the classification model to update the special search term library.
  • the sample obtaining unit includes: clicking a webpage obtaining unit, configured to acquire the first pre- Setting a click webpage combination corresponding to each historical search sentence in the first historical search sentence set; and clicking a webpage analyzing unit, for determining, for each historical search sentence, a preset special item to which each clicked webpage belongs in the clicked webpage combination a category, calculating a proportion of clicks of the clicked webpages corresponding to the respective preset special categories in the clicked webpage combination, and using a preset special category corresponding to the click ratio of the preset threshold as the intent category of the corresponding historical search sentence, and Each historical search sentence is respectively recorded in a special search vocabulary corresponding to its intent category.
  • the click webpage analyzing unit includes: clicking a webpage classification module; the click webpage classification module is configured to: click Each click page in the webpage combination obtains its URL, determines the host name corresponding to the corresponding clicked webpage according to the URL, and queries a list of special sites corresponding to each preset special category, and determines the location.
  • the preset special category corresponding to the special site list of the host name is used as the preset special category to which the corresponding clicked webpage belongs.
  • the model control unit includes: a sample data acquiring unit, configured to search for each historical search in the special search term library a statement, respectively obtaining a URL corresponding to the clicked webpage and a webpage title; the feature vector generating unit is configured to separately segment each historical search sentence and the corresponding webpage title and URL in the special search vocabulary, and divide the history Each word obtained by searching for a word and a web page title, and a character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space; a model establishing unit is configured to use the preset special item to which the clicked webpage belongs according to the feature vector The proportion of clicks corresponding to the category is used as the weight of the relevant feature vector, and a classification model based on the maximum entropy model is established.
  • the device further includes: a search display unit; the search display unit includes: a vertical search unit, configured to An intent category of the target search sentence performs a vertical search on the target search sentence to obtain a target webpage related to the target search sentence; a level determining unit is configured to determine the intent category according to a click ratio corresponding to the intent category Searching the intent level; the sorting unit is configured to determine a display order of the target webpage corresponding to each intent category according to the search intent level, and generate a search result page in the form of Aladdin.
  • a search display unit includes: a vertical search unit, configured to An intent category of the target search sentence performs a vertical search on the target search sentence to obtain a target webpage related to the target search sentence; a level determining unit is configured to determine the intent category according to a click ratio corresponding to the intent category Searching the intent level; the sorting unit is configured to determine a display order of the target webpage corresponding to each intent category according to the search intent level, and generate a search result page in the form of Aladd
  • a third aspect of the present application provides a search intent identification method, the method comprising: obtaining a target search sentence; determining at least one preset special category according to a historical search sentence and a candidate search sentence in the special search dictionary, in the special item Search term
  • the historical search statement corresponds to at least one preset special category
  • the candidate search statement is obtained by a classification model established according to the special search vocabulary; and according to the determined at least one preset special category, the click volume and the location The ratio of the total amount of clicks of the target search statement determines the intent category of the target search statement.
  • a fourth aspect of the present application provides a search intention identifying apparatus, the apparatus comprising: a sentence receiving unit, a category determining unit, and an intention identifying unit, wherein the sentence receiving unit is configured to obtain a target search sentence; the category determining unit is configured to: Determining at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset special category, and the candidate search statement passes Obtaining according to the classification model established by the special search vocabulary; the intention identification unit is configured to determine a target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence. Intention category.
  • the embodiment of the present application obtains a large number of historical search sentences in a historical search record and classifies them, and initially obtains a special search term database of each preset special category; and further establishes a classification model according to the special search term database.
  • the candidate search sentence related to each historical search sentence is obtained by mining the classification model, and the corresponding search term is supplemented and improved by the candidate search sentence, that is, the special search term library of a preset special category includes
  • the historical search statement belonging to the preset special category further includes a corresponding candidate search statement; the search statement in the special search lexicon is compared with the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like.
  • the embodiment can perform the search intent identification according to the special search vocabulary, and can more accurately identify the intent category of the target search sentence, and avoid the erroneous recognition caused by the manual designation rule being inconsistent with the actual judgment standard of the user.
  • FIG. 1 is a schematic diagram of an application environment of a search intent identification method and apparatus according to an embodiment of the present application
  • FIG. 2 is a structural block diagram of a server provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of a method for identifying a search intention according to an embodiment of the present application
  • FIG. 5 is a flowchart of a process for classifying a historical search sentence in a search intent identification method according to an embodiment of the present application
  • FIG. 6 is a flowchart of a process for establishing a classification model in a search intent identification method according to an embodiment of the present application
  • FIG. 7 is a flowchart of a process for generating a search result page in a search intent identification method according to an embodiment of the present application
  • FIG. 8 is a schematic diagram of a search result page according to an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another search result page according to an embodiment of the present application.
  • FIG. 10 is a schematic structural diagram of a search intention identification apparatus according to an embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of another search intention identification apparatus according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of still another search intention identification apparatus according to an embodiment of the present application.
  • server 100 is in communication with one or more user terminals 200 over network 300 for data communication or interaction.
  • a client is installed in the user terminal 200, and the client may be a third-party application software (such as a search engine) corresponding to the server 100, thereby providing a service (such as information query) for the user.
  • the server 100 may be a plurality of servers such as a database server, an instant messaging server, a network server, an authentication server, or the like, or may be a server.
  • the user terminal 200 may be a personal computer (PC), a tablet computer, a smart phone, a personal digital assistant (PDA), an e-book reader, a laptop portable computer, a car computer, or the like.
  • the network 300 can be a wireless network or a wired network, such as a wireless network. Therefore, it is not limited to Wi-Fi (Wireless Fidelity) network, 2G/3G/4G network, and the like.
  • FIG. 2 shows a block diagram of a structure of a server 100 that can be applied to an embodiment of the present application.
  • the server 100 may include a search intent identification method and apparatus provided by an embodiment of the present application, a memory 102, a storage controller 103, a processor 104, and a network module 105.
  • the components of the memory 102, the memory controller 103, the processor 104, and the network module 105 are electrically connected directly or indirectly to enable data transmission or interaction.
  • these components can be electrically connected by one or more communication buses or signal buses.
  • the search intent identification method and apparatus include at least one software function module that can be stored in the memory 102 in the form of software or firmware, such as the search intent identification method and the software function module or computer program included in the apparatus.
  • the memory 102 can store various software programs and modules, such as the search intent identification method and the program instruction/module corresponding to the device provided by the embodiment of the present application.
  • the processor 104 executes each of the software programs and modules stored in the memory 102.
  • the function application and data processing that is, the search intent identification method in the embodiment of the present application.
  • the memory 102 can include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), erasable read-only Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), and the like.
  • Processor 104 can be an integrated circuit chip with signal processing capabilities.
  • the processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP processor, etc.), or a digital signal processor (DSP) or an application specific integrated circuit (ASIC). ), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • the network module 105 is for receiving and transmitting network signals.
  • the above network signal may include a wireless signal or a wired signal.
  • the structure shown in FIG. 2 is merely illustrative, and the server 100 may further include more or less components than those shown in FIG. 2 or have a different configuration from that shown in FIG. 2.
  • the components shown in Figure 2 can be implemented in hardware, software, or a combination thereof.
  • the server 100 in this embodiment of the present application may further include multiple servers with different functions.
  • FIG. 3 is a flowchart of a search intent identification method according to the present invention, the method includes the following steps:
  • S1 Obtain a first historical search sentence set in the first preset time period, classify the historical search sentences in the first historical search sentence set, and obtain a special search vocabulary corresponding to each preset special category.
  • the historical search record of each user Since the historical search record of each user is stored in the server, the historical search record The record includes the search statement input by the user, the query result returned by the server, and the webpage clicked by the user in the query result.
  • the first preset time may specifically be the latest year, a quarter, a month, etc., and the set of the search sentences recorded in all the historical search records in the first preset time constitutes the first historical search sentence set.
  • the preset special categories to which each historical search sentence belongs in the first historical search sentence set are respectively determined, and the classification operation is completed, that is, each special search term library is obtained.
  • the above-mentioned preset special categories include, but are not limited to, novels, audio, video, weather, exchange rates, ticket inquiries, and the like.
  • S2 Establish a classification model according to the special search vocabulary, and obtain candidate search sentences corresponding to each preset special category by using the classification model, and record the candidate search statements into a special search vocabulary of the corresponding category.
  • a special search vocabulary corresponding to each preset special category is taken as a sample, and a classification model is established.
  • a candidate search sentence related to each historical search sentence in the special search vocabulary is obtained, and the candidate search sentence is also saved. Enter the corresponding special search term library.
  • S3 Obtain a target search sentence, and determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary as the intent category of the target search statement.
  • the intent category of the target search sentence may be determined according to the ratio of the click amount of one or several preset special categories to the total click amount of the target search sentence.
  • the target search sentence "Tianlong Ba Bu” corresponds to preset categories such as "fiction", “video”, “audio”, etc.
  • the corresponding proportions are 0.35, 0.41, and 0.09, respectively, assuming a preset threshold of 0.3, that is, "fiction” If the proportion of the webpage corresponding to the "video" exceeds a preset threshold, "fiction” and "video” may be used as the intent category of the target search sentence.
  • the special search vocabulary is equivalent to a white list, and is used for matching the intent category corresponding to the target search sentence. Since the special model is established based on historical search sentences in the massive historical search record, the number of search sentences in the special search vocabulary corresponding to each preset special category is much larger than the number of search sentences in the manually set white list, overcoming the existing In the technology, the search statement is not comprehensively covered, and the generalization is poor, and the intent category corresponding to the target search sentence input by the user in real time can be accurately determined.
  • the embodiment of the present application obtains a large number of historical search sentences in the historical search record and classifies the above-mentioned massive historical search sentences, and initially obtains a special search term database of each preset special category; and further, according to the special search
  • the vocabulary builds a classification model, and through this classification model mining and historical search
  • the candidate search sentence related to the statement supplements and perfects the corresponding special search vocabulary by using the candidate search sentence, that is, the special search vocabulary of a certain preset special category includes the historical search statement belonging to the preset special category.
  • the method further includes a corresponding candidate search statement; the search statement in the special search vocabulary is more accurate, comprehensive, and generalized than the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like. Therefore, the embodiment of the present application performs the search intent identification according to the special search vocabulary, and can more accurately identify the intent category of the target search sentence, and avoid erroneous recognition caused by the manual designation rule being inconsistent with the actual judgment standard of the user.
  • FIG. 4 is a flowchart of another method for identifying a search intention according to an embodiment of the present application.
  • the search intent identification method shown in FIG. 4 includes, in addition to the above steps S1 to S3, the following:
  • Step S4 is performed before step S3, wherein step S4 may be performed in parallel with step S2, or may be performed after step S2.
  • the second preset time may be a shorter period of time such as the last 7 days.
  • the historical search sentences in the first historical search sentence set are classified according to the foregoing step S1, and a special search vocabulary corresponding to each preset special category is obtained.
  • Particular embodiments may include the following steps.
  • the above clicked webpage that is, the webpage that the user clicks on in the search result page, relative to other webpages
  • the clickpage webpage reflects the actual search intent of the user for the corresponding search sentence.
  • a historical search statement it may be searched multiple times by the same user or different users in the first preset time.
  • the search result page obtained by each search is not the same, and the user clicks on the opened webpage. (that is, the above click page) is not the same.
  • the webpage that the user finally clicks to open is A1 and A2
  • the webpage that the user finally clicks to open is A2 and A3, C1 related.
  • the click pages (including A1, A2, and A3) recorded in all historical search records constitute the click page combination corresponding to C1; in particular, since A2 exists in two different historical search records, the click page combination is also recorded twice.
  • the ratio of the hits that is, the number n of clicked webpages belonging to a certain preset special category B1, and the total number of clicked webpages in the clicked webpage combination, the default clicked webpage of the special category B1
  • the ratio K n / N.
  • the count is repeated.
  • the two historical search records of the historical search sentence C1 have the click webpage A2, and the click webpage A2 appears at least twice in the click webpage combination. Then, when the statistics N of the preset special category of N and A2 are counted, 2 is added.
  • determining the preset special category to which each clicked webpage belongs according to the foregoing step S102 may be specifically obtained by: obtaining a URL for each clicked webpage in the clicked webpage combination;
  • the URL determines a host name corresponding to the clicked webpage (ie, host); queries a special site list corresponding to each preset special category, determines a preset special category corresponding to the special site list of the host name, and uses the corresponding Click the preset special category to which the page belongs.
  • the URL of the webpage A1 is clicked as "http://music.***.com/", and the host "music" is extracted therefrom, and the special site list corresponding to each preset special category is selected. It is determined that “music” is in the special site list corresponding to the preset special category “audio”, thereby determining that the preset special category to which A1 belongs is “audio”.
  • the following method may be used to obtain a special site list of a preset special category B, that is, some search sentences whose search intention is determined as the preset special category B are sent to the search engine for searching, and then count the number of web hosts in the search result.
  • the M hosts with the most occurrences can constitute a special site list of the preset special category B.
  • the preset threshold may be set to 0.3. It is assumed that after the above statistics and calculations, the click page corresponding to the historical search sentence “Dragon” includes preset categories such as “fiction”, “video”, “audio”, and the corresponding The click ratios are 0.35, 0.41, and 0.09 respectively.
  • the embodiment of the present application determines the intent category of the corresponding historical search sentence by analyzing the click webpage of each historical search sentence in the historical search record, that is, the classification of the historical search sentence in this embodiment.
  • the classification basis also comes from the historical search record, which further avoids the influence of the manual topping rule on the special search vocabulary, and ensures the accuracy of the search intention recognition based on the special search vocabulary.
  • the embodiment of the present application can also analyze the display webpage of each historical search sentence in the historical search record, obtain the proportion of the display webpage corresponding to the preset special category in all the displayed webpages, and compare with the preset threshold. Since the processing steps of displaying the webpage are the same as the processing steps of clicking the webpage, no further description is made here.
  • the display page specifically refers to the web page displayed in the search result page after the target search sentence is input.
  • the classification model is established according to the special search vocabulary described in the foregoing step S2, and the specific implementation manner may include the following steps:
  • the character cells obtained by dividing the historical search term and the webpage title and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space.
  • the word segmentation method can be used for segmentation.
  • the word segmentation process can be divided into two words “Beijing” and “weather”, and then the corresponding two are obtained.
  • Feature vector For the URL of the clicked webpage, the host and other special strings can be extracted, and the host and other special strings can be respectively represented as feature vectors.
  • the above feature vectors are all samples for establishing and training a classification model.
  • the feature vector related to the preset special category is a positive sample, and other unrelated feature vectors are negative samples.
  • this embodiment establishes a classification model based on the maximum entropy model, which can better process massive historical data, ensure the generalization of the special search lexicon, and improve the accuracy of the intent recognition. degree.
  • the search intent identification method may further include the following steps:
  • S501 Perform a vertical search on the target search sentence according to an intent category of the target search statement, to obtain a target webpage related to the target search sentence.
  • the above steps S501 to S503 implement web page search and display based on the search intention recognition result.
  • the same historical search sentence may belong to a plurality of preset special categories, and is entered into a plurality of special search lexicons; correspondingly, after step S3, there may be multiple intent categories of the target search sentences.
  • “Tianlong Ba Bu” For example, when it is used as a target search sentence, it can be obtained that its intent category includes "video” and "fiction”, so in step S501, special searches are performed for "video” and "fiction” respectively, correspondingly Get two types of landing pages; then you need to consider how the two types of landing pages are arranged in the search results page.
  • the order of arrangement of the target web pages is determined according to the search intention levels of different intent categories.
  • the search intention level may be determined according to the click ratio calculated in the above step S102. For "Dragon”, the click ratio corresponding to "Fiction” is 0.35, and the click ratio corresponding to "Video” is 0.41. Therefore, the search intention level corresponding to "Video” is higher than "Fiction", that is, according to the history of the user.
  • the search data can analyze the probability that the user actually wants to get the "video” search result when he searches for "Dragon”, and the probability of getting the "fiction” search result.
  • the "video" class target page is located in front of the "fiction" class target page (as shown in the "Dragon” article search result page shown in Figure 8), which is more in line with the actual needs of the user and can make the user faster. Get the landing page you want to browse and improve your user experience.
  • the target search sentence “Ordinary World” also has two intent categories of “video” and “fiction”, but the “fiction” corresponds to a larger proportion of clicks than “video”, so the “fiction” category target webpage is located.
  • the target page sorting method described in the embodiment of the present application shown in FIG. 7 can obtain different sorting results for different target search sentences with the same intent category, and meet the corresponding target search sentence of the user. The actual search needs to enhance the user experience.
  • the click amount of the historical search sentence exceeds a preset threshold (for example, the preset threshold is set to 10)
  • the click amount of each preset special category corresponding to the historical search sentence may be separately related to the historical search.
  • the ratio of the total hits of the statement determines the search intent level of the intent category.
  • the click amount of the historical search statement is lower than the preset threshold, the ratio of the amount of presentation of the search result page and the total amount of the historical search sentence of each preset special category corresponding to the historical search sentence may be determined.
  • the search intent level of the intent category is determined.
  • the target webpage is displayed in the form of Aladdin in the search result page
  • the display content is not limited to the text content in the traditional display form (mainly including the webpage title, the webpage content including the target search sentence, the webpage), and It can display pictures, sub-links, etc. in the target webpage, such as the TV drama poster, the novel cover, the link of each episode of the TV series, etc., which can not only visually display the main content in the target webpage to the user, but also facilitate the user.
  • Determining whether the target webpage satisfies its own search intent for example, a video, a novel, etc., which can distinguish different content of the same name
  • Determining whether the target webpage satisfies its own search intent for example, a video, a novel, etc., which can distinguish different content of the same name
  • the sub-link corresponding to the fifth episode shown in 8 directly views the fifth episode, without first entering the online viewing page of the TV drama, and then clicking the corresponding link to switch to the fifth episode).
  • the final display order of the target webpage in another feasible embodiment of the present application, according to the search intention level.
  • the determined sorting order is used as an initial sequence, and the initial order is adjusted by clicking the weight adjustment to obtain each The order in which the landing pages are displayed in the search results page will result in a search results page that better matches the search needs of most users.
  • the embodiment of the present application further provides a search intention identification device.
  • the device includes: a sample acquisition unit 100, a model control unit 200, and an intention recognition unit 300.
  • the sample obtaining unit 100 is configured to obtain a first historical search sentence set in a first preset time period, and classify the historical search sentences in the first historical search sentence set to obtain a special item corresponding to each preset special category. Search the thesaurus.
  • the model control unit 200 is configured to establish a classification model according to the special search term library, and obtain candidate search sentences corresponding to each preset special category by using the classification model, and record the candidate search statements into special items of corresponding categories. Search the thesaurus.
  • the intention identification unit 300 is configured to determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary as the intent category corresponding to the target search sentence input by the user in real time.
  • the intent identification unit 300 may include a sentence receiving subunit 301, a category determining subunit 302, and an intent identifying subunit 303.
  • the sentence receiving subunit 301 is configured to obtain a target search sentence
  • the category determining sub-unit 302 is configured to determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset in the special search vocabulary a special category, the candidate search sentence is obtained by a classification model established according to the special search vocabulary Have
  • the intent identification sub-unit 303 is configured to determine an intent category of the target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence.
  • the intent identification sub-unit 303 is further configured to: calculate a ratio of the determined amount of the click of the at least one preset special category to the total click amount of the target search sentence; and preset a specific item corresponding to the preset threshold
  • the category is the intent category of the target search statement.
  • the embodiment of the present application obtains a large number of historical search sentences in a historical search record and classifies them, and initially obtains a special search term database of each preset special category; and further establishes a classification model according to the special search term database.
  • the candidate search sentence related to each historical search sentence is obtained by mining the classification model, and the corresponding search term is supplemented and improved by the candidate search sentence, that is, the special search term library of a preset special category includes
  • the historical search statement belonging to the preset special category further includes a corresponding candidate search statement; the search statement in the special search lexicon is compared with the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like.
  • the embodiment of the present application performs search intent recognition according to the special search vocabulary, which can more accurately identify the intent category of the target search sentence, and avoids the manual designation rule being inconsistent with the actual judgment standard of the user. The resulting misidentification.
  • FIG. 12 is a schematic structural diagram of another search intention identification device provided by the present application Figure.
  • the apparatus described with respect to FIG. 12 further includes an update unit 400.
  • the updating unit 400 is configured to acquire a second historical search sentence set in a second preset time, and train the classification model according to the second historical search sentence set to update the special search term library.
  • the timeliness of the special search vocabulary can be enhanced, and the search result page can be dynamically adjusted according to the migration of the time and the current hot change, thereby further improving the user experience.
  • the sample obtaining unit 100 may specifically include: a click webpage obtaining unit 101 and a click webpage analyzing unit 102.
  • the click webpage obtaining unit 101 is configured to acquire a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time;
  • the click webpage analyzing unit 102 is configured to determine, for each historical search sentence, a preset special category to which each clicked webpage belongs in the clicked webpage combination, and calculate a clicked webpage corresponding to each preset specialized category in the clicked webpage combination.
  • the proportion of the clicks is determined by using the preset special category corresponding to the click ratio of the preset threshold as the intent category of the corresponding historical search sentence, and each historical search sentence is respectively recorded in the special search vocabulary corresponding to the intent category.
  • the click webpage analyzing unit 102 may The click webpage classification module is configured to: obtain a URL for each click webpage in the click webpage combination, determine a host name corresponding to the clicked webpage according to the URL, and query each preset special category. Corresponding special site list, determining a preset special category corresponding to the special site list of the host name, and using it as a preset special category to which the corresponding clicked webpage belongs.
  • the click webpage obtaining unit 101 and the click webpage analyzing unit 102 can be used to determine the intent category of the corresponding historical search sentence by analyzing the clicked webpage of each historical search sentence in the historical search record, that is, the embodiment.
  • the classification basis for classifying historical search sentences also comes from historical search records, which further avoids the influence of manual topping rules on the special search lexicon, and ensures the accuracy of search intention recognition based on the special search vocabulary.
  • the model control unit 200 may specifically include: a sample data acquiring unit 201, a feature vector generating unit 202, and a model establishing unit 203.
  • the sample data obtaining unit 201 is configured to respectively obtain a URL corresponding to the clicked webpage and a webpage title for each historical search sentence in the special search vocabulary;
  • the feature vector generating unit 202 is configured to: search the special search Each historical search sentence in the vocabulary and the corresponding web page title and URL are separately segmented, and each character obtained by dividing the historical search term and the web page title, and the character string obtained by dividing the URL are respectively represented as features based on the feature space. vector;
  • the model establishing unit 203 is configured to, according to the feature vector, a click ratio corresponding to a preset special category to which the clicked webpage belongs, as a related feature vector
  • the weights are based on a classification model based on the maximum entropy model.
  • a classification model based on the maximum entropy model is established, which can better process massive historical data, ensure the generalization of the special search vocabulary, and thereby improve the accuracy of the intent recognition.
  • the search intent identifying apparatus may further include a search display unit 500.
  • the search and display unit 500 specifically includes: a vertical search unit 501, a level determining unit 502, and a sorting unit 503.
  • the vertical search unit 501 is configured to perform a vertical search based on an intent category corresponding to the target search sentence, respectively, to obtain a target webpage related to the target search sentence;
  • the level determining unit 502 is configured to determine a search intent level according to a click ratio corresponding to the intent category;
  • the sorting unit 503 is configured to determine a display order of the respective target web pages according to the search intention level, and generate a search result page in the form of Aladdin.
  • the target page sorting method described in the embodiment of the present application can obtain different sorting results for different target search sentences with the same intent category, which meets the actual search requirements of the user's corresponding target search sentence, and improves the user experience goodness.
  • the target webpage is displayed in the form of Aladdin in the search result page, which not only can more intuitively display the main content in the target webpage to the user, and is convenient for the user to determine whether the target webpage satisfies his or her own search intention. Can make the user pass the point Click on the link to browse the corresponding content and reduce the user steps.
  • the technology in the embodiments of the present invention can be implemented by means of software plus necessary general hardware including general-purpose integrated circuits, general-purpose CPUs, general-purpose memories, general-purpose components, and the like. It can be implemented by dedicated hardware including an application specific integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, etc., but in many cases the former is a better implementation. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product including a non-volatile processor-executable processor.
  • a computer readable medium of program code such as Read-Only Memory (ROM), Random Access Memory (RAM), a magnetic disk, an optical disk, etc., includes instructions for causing a computer device ( The method may be a personal computer, a server, or a network device, etc., performing the various embodiments of the present invention or portions of the embodiments.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A search intention identification method and device, comprising: acquiring a large amount of historical search terms in historical search records and classifying the same, and establishing a classification model based thereon, and mining via the classification model to obtain alternative search terms related to each historical search term, such that the historical search terms corresponding to the same preset special category and the alternative search terms thereof form a special search term database of the preset special categories. Compared with the prior art in which a white-list, a fuzzy query threshold, mode matching related keywords, etc., are manually set, the search terms in the special search term database are more accurate, more comprehensive, and highly generalizable. Therefore, identification of a search intention on the basis of the special term database can more accurately identify a target search term intention category, avoiding mistaken identifications caused by manually specified rules being inconsistent with actually-determined standards of a user.

Description

一种搜索意图识别方法及装置Search intent recognition method and device
本申请要求于2015年8月7日提交中国专利局、申请号为CN201510486646.X、发明名称为“一种搜索意图识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese Patent Application, filed on Aug. 7, 2015, with the application number of CN201510486646.X, entitled "A Search Intention Identification Method and Apparatus", the entire contents of which are incorporated by reference. In this application.
技术领域Technical field
本发明涉及互联网技术领域,特别是涉及一种搜索意图识别方法及装置。The present invention relates to the field of Internet technologies, and in particular, to a search intent identification method and apparatus.
背景技术Background technique
为提供更精确的搜索结果,现有各大通用搜索引擎均具备垂直搜索功能;与传统的通用搜索方式不同,垂直搜索只针对某个专项类别搜索与用户输入的搜索语句(词汇、短语等)相关的网页,如音乐搜索、视频搜索、小说搜索等。为使搜索引擎可以自动对目标搜索语句进行垂直搜索,搜索引擎需要同时具备搜索意图识别功能,即根据目标搜索语句识别出用户想要搜索的专项类别;例如,目标搜索语句为“天龙八部”,则可以通过搜索意图识别得出对应的专项类别为视频或小说,进而搜索引擎分别执行视频搜索和小说搜索。In order to provide more accurate search results, the existing general search engines all have vertical search function; unlike the traditional universal search method, vertical search only searches for a specific category search and user-entered search sentences (vocabulary, phrases, etc.) Related web pages, such as music search, video search, novel search, etc. In order to enable the search engine to automatically search the target search sentence vertically, the search engine needs to have the search intent recognition function, that is, according to the target search sentence, the specific category that the user wants to search is identified; for example, the target search sentence is “Dragon” Then, the corresponding special category can be identified by the search intention as a video or a novel, and then the search engine performs video search and novel search respectively.
现有意图识别方法,通常采用以白名单为基础,同时结合模糊匹配和模式匹配的方式。以小说这一专项类别的搜索为例,预先设置一个能覆盖小说类别的尽可能多的搜索语句(词汇、短语等)的白名单,在此基础上还可以设置模糊查询阈值,以及与该 专项类别相关的模式匹配关键词(如“免费阅读”、“免费下载”、“***”、“在线阅读”等);在搜索时,通过以下至少一种方式确定用户的搜索意图,查询各个专项类别对应的白名单中是否存在目标搜索语句,或者判断目标搜索语句是否与白名单中某个搜索语句的相似度大于模糊查询阈值,或者判断目标搜索语句中是否包含某个专项类别对应的模式匹配关键词。The existing intent identification method is usually based on a white list and combines fuzzy matching and pattern matching. Taking the search of the special category of the novel as an example, a white list of as many search sentences (vocabulary, phrase, etc.) as possible covering the novel category is set in advance, and based on this, a fuzzy query threshold can also be set, and The keyword matching keywords related to the special category (such as "free reading", "free download", "txt download", "online reading", etc.); at the time of searching, the user's search intention is determined by at least one of the following methods, and each query is performed. Whether the target search statement exists in the whitelist corresponding to the special category, or whether the target search statement is more similar to a search query in the whitelist than the fuzzy query threshold, or whether the target search statement includes a mode corresponding to a specific category. Match keywords.
实际应用中,用户输入的搜索语句***,而上述意图识别方法使用的白名单、模糊查询阈值、模式匹配关键词都是人工设置的,其所覆盖的搜索语句有限,泛化性较差,很难准确识别每个目标搜索语句对应的专项类别。In practical applications, the search sentences input by the user are ever-changing, and the white list, fuzzy query threshold, and pattern matching keywords used by the above intent recognition method are manually set, and the search sentences covered by them are limited, and the generalization is poor. It is difficult to accurately identify the specific category corresponding to each target search statement.
发明内容Summary of the invention
本申请实施例提供了一种搜索意图识别方法及装置,以解决现有技术中的搜索意图识别方式泛化性差、识别准确度低的问题。The embodiment of the present application provides a search intent identification method and device, which solves the problem that the search intent recognition method in the prior art has poor generalization and low recognition accuracy.
本申请第一方面提供了一种搜索意图识别方法,所述方法包括:获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别对应的专项搜索词库;根据所述专项搜索词库建立分类模型,并通过所述分类模型获取各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库;根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为目标搜索语句的意图类别。 A first aspect of the present application provides a search intent identification method, the method comprising: obtaining a first historical search sentence set in a first preset time, and classifying a historical search sentence in the first historical search sentence set Obtaining a special search vocabulary corresponding to each preset special category; establishing a classification model according to the special search vocabulary, and obtaining a candidate search sentence corresponding to each preset special category by using the classification model, and recording the candidate search sentence Entering a special search term library of the corresponding category; determining at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
结合第一方面,在第一方面第一种可行的实施方式中,所述方法还包括:获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。With reference to the first aspect, in a first possible implementation manner of the first aspect, the method further includes: acquiring a second historical search sentence set in a second preset time, and training according to the second historical search sentence set The classification model to update the special search term library.
结合第一方面,或者第一方面第一种可行的实施方式,在第一方面第二种可行的实施方式中,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别对应的专项搜索词库,包括:获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合;针对每个历史搜索语句,确定其点击网页组合中各个点击网页所属的预设专项类别,计算各个预设专项类别对应的点击网页在所述点击网页组合中所占的点击比例,将大于预设阈值的点击比例对应的预设专项类别作为相应历史搜索语句的意图类别,并将各个历史搜索语句分别记入其意图类别对应的专项搜索词库。With reference to the first aspect, or the first feasible implementation manner of the first aspect, in the second feasible implementation manner of the first aspect, the historical search statement in the first historical search sentence set is classified to obtain each pre- The special search term database corresponding to the special category includes: obtaining a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time; determining a click for each historical search sentence The preset special category to which each clicked webpage belongs in the webpage combination calculates the proportion of clicks of the clickpage webpage corresponding to each preset special category in the clicked webpage combination, and the preset special category corresponding to the click ratio of the preset threshold is As the intent category of the corresponding historical search statement, each historical search sentence is respectively recorded in the special search vocabulary corresponding to its intended category.
结合第一方面第二种可行的实施方式,在第一方面第三种可行的实施方式中,确定历史搜索语句的点击网页组合中各个点击网页所属的预设专项类别,包括:针对点击网页组合中的每个点击网页,获取其URL;根据所述URL确定相应的点击网页对应的主机名;查询各个预设专项类别对应的专项站点列表,确定所述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。With reference to the second possible implementation manner of the first aspect, in the third feasible implementation manner of the first aspect, determining a preset special category to which each click webpage belongs in the click webpage combination of the historical search statement includes: Each click on the webpage obtains the URL; the host name corresponding to the clicked webpage is determined according to the URL; the special site list corresponding to each preset special category is queried, and the preset corresponding to the special site list of the hostname is determined. The special category is used as the default special category to which the corresponding click page belongs.
结合第一方面第二种可行的实施方式,在第一方面第四种可 行的实施方式中,根据所述专项搜索词库建立分类模型,包括:针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题;将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理;将分割历史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量;根据所述特征向量,并以所述点击网页所属预设专项类别对应的点击比例作为相关特征向量的权重,建立基于最大熵模型的分类模型。In combination with the second feasible implementation of the first aspect, the fourth In the implementation of the line, the classification model is established according to the special search vocabulary, including: respectively, for each historical search sentence in the special search vocabulary, respectively obtaining a URL corresponding to the clicked webpage, a webpage title; and the special search term Each history search sentence in the library and the corresponding webpage title and URL are separately subjected to segmentation processing; each character obtained by dividing the historical search term and the webpage title, and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space; The classification model based on the maximum entropy model is established according to the feature vector and the click proportion corresponding to the preset special category of the clicked webpage as the weight of the related feature vector.
结合第一方面第二种可行的实施方式,在第一方面第五种可行的实施方式中,在确定所述目标搜索语句的意图类别后,所述方法还包括:根据所述目标搜索语句的意图类别对所述目标搜索语句进行垂直搜索,得到与所述目标搜索语句相关的目标网页;根据所述意图类别对应的点击比例确定所述意图类别的搜索意图等级;根据所述搜索意图等级确定各个意图类别对应的目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。With reference to the second possible implementation manner of the first aspect, in a fifth possible implementation manner of the first aspect, after determining an intent category of the target search statement, the method further includes: searching for a statement according to the target The intent category performs a vertical search on the target search sentence to obtain a target webpage related to the target search sentence; determining a search intent level of the intent category according to a click ratio corresponding to the intent category; determining according to the search intent level The order in which the landing pages correspond to the respective intent categories, and generates a search result page in the form of Aladdin.
本申请第二方面提供了一种搜索意图识别装置,所述装置包括:样本获取单元、模型控制单元和意图识别单元;所述样本获取单元用于,获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别对应的专项搜索词库;所述模型控制单元用于,根据所述专项搜索词库建立分类模型,并通过所述分类模型获取各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记 入相应类别的专项搜索词库;所述意图识别单元用于,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为目标搜索语句的意图类别。A second aspect of the present application provides a search intent identification device, the device comprising: a sample acquisition unit, a model control unit, and an intention recognition unit; the sample acquisition unit is configured to obtain a first history in a first preset time Searching a set of sentences, classifying the historical search sentences in the first historical search sentence set, and obtaining a special search vocabulary corresponding to each preset special category; the model control unit is configured to establish according to the special search vocabulary Classifying a model, and obtaining candidate search sentences corresponding to each preset special category by using the classification model, and recording the candidate search statements The intent recognition unit is configured to determine at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
结合第二方面,在第二方面第一种可行的实施方式中,所述装置还包括:更新单元,用于获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。With reference to the second aspect, in a first possible implementation manner of the second aspect, the device further includes: an updating unit, configured to acquire a second historical search sentence set in a second preset time, and according to the second The historical search sentence set trains the classification model to update the special search term library.
结合第二方面,或者第二方面第一种可行的实施方式,在第二方面第二种可行的实施方式中,所述样本获取单元包括:点击网页获取单元,用于获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合;点击网页分析单元,用于针对每个历史搜索语句,确定其点击网页组合中各个点击网页所属的预设专项类别,计算各个预设专项类别对应的点击网页在所述点击网页组合中所占的点击比例,将大于预设阈值的点击比例对应的预设专项类别作为相应历史搜索语句的意图类别,并将各个历史搜索语句分别记入其意图类别对应的专项搜索词库。With reference to the second aspect, or the first feasible implementation manner of the second aspect, in the second possible implementation manner of the second aspect, the sample obtaining unit includes: clicking a webpage obtaining unit, configured to acquire the first pre- Setting a click webpage combination corresponding to each historical search sentence in the first historical search sentence set; and clicking a webpage analyzing unit, for determining, for each historical search sentence, a preset special item to which each clicked webpage belongs in the clicked webpage combination a category, calculating a proportion of clicks of the clicked webpages corresponding to the respective preset special categories in the clicked webpage combination, and using a preset special category corresponding to the click ratio of the preset threshold as the intent category of the corresponding historical search sentence, and Each historical search sentence is respectively recorded in a special search vocabulary corresponding to its intent category.
结合第二方面第二种可行的实施方式,在第二方面第三种可行的实施方式中,所述点击网页分析单元包括:点击网页分类模块;所述点击网页分类模块被配置为:针对点击网页组合中的每个点击网页,获取其URL,根据所述URL确定相应的点击网页对应的主机名,查询各个预设专项类别对应的专项站点列表,确定所 述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。结合第二方面第二种可行的实施方式,在第二方面第四种可行的实施方式中,所述模型控制单元包括:样本数据获取单元,用于针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题;特征向量生成单元,用于将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理,并将分割历史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量;模型建立单元,用于根据所述特征向量,并以所述点击网页所属预设专项类别对应的点击比例作为相关特征向量的权重,建立基于最大熵模型的分类模型。With reference to the second possible implementation manner of the second aspect, in a third possible implementation manner of the second aspect, the click webpage analyzing unit includes: clicking a webpage classification module; the click webpage classification module is configured to: click Each click page in the webpage combination obtains its URL, determines the host name corresponding to the corresponding clicked webpage according to the URL, and queries a list of special sites corresponding to each preset special category, and determines the location. The preset special category corresponding to the special site list of the host name is used as the preset special category to which the corresponding clicked webpage belongs. With reference to the second possible implementation manner of the second aspect, in the fourth possible implementation manner of the second aspect, the model control unit includes: a sample data acquiring unit, configured to search for each historical search in the special search term library a statement, respectively obtaining a URL corresponding to the clicked webpage and a webpage title; the feature vector generating unit is configured to separately segment each historical search sentence and the corresponding webpage title and URL in the special search vocabulary, and divide the history Each word obtained by searching for a word and a web page title, and a character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space; a model establishing unit is configured to use the preset special item to which the clicked webpage belongs according to the feature vector The proportion of clicks corresponding to the category is used as the weight of the relevant feature vector, and a classification model based on the maximum entropy model is established.
结合第二方面第二种可行的实施方式,在第二方面第五种可行的实施方式中,所述装置还包括:搜索展示单元;所述搜索展示单元包括:垂直搜索单元,用于根据所述目标搜索语句的意图类别对所述目标搜索语句进行垂直搜索,得到与所述目标搜索语句相关的目标网页;等级确定单元,用于根据所述意图类别对应的点击比例确定所述意图类别的搜索意图等级;排序单元,用于根据所述搜索意图等级确定各个意图类别对应的目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。With reference to the second possible implementation of the second aspect, in a fifth possible implementation manner of the second aspect, the device further includes: a search display unit; the search display unit includes: a vertical search unit, configured to An intent category of the target search sentence performs a vertical search on the target search sentence to obtain a target webpage related to the target search sentence; a level determining unit is configured to determine the intent category according to a click ratio corresponding to the intent category Searching the intent level; the sorting unit is configured to determine a display order of the target webpage corresponding to each intent category according to the search intent level, and generate a search result page in the form of Aladdin.
本申请第三方面提供了一种搜索意图识别方法,所述方法包括:获得目标搜索语句;根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,在所述专项搜索词 库中,所述历史搜索语句对应至少一个预设专项类别,所述候补搜索语句通过根据所述专项搜索词库建立的分类模型获得;根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。A third aspect of the present application provides a search intent identification method, the method comprising: obtaining a target search sentence; determining at least one preset special category according to a historical search sentence and a candidate search sentence in the special search dictionary, in the special item Search term In the library, the historical search statement corresponds to at least one preset special category, and the candidate search statement is obtained by a classification model established according to the special search vocabulary; and according to the determined at least one preset special category, the click volume and the location The ratio of the total amount of clicks of the target search statement determines the intent category of the target search statement.
本申请第四方面提供了一种搜索意图识别装置,所述装置包括:语句接收单元、类别确定单元、意图识别单元,所述语句接收单元用于获得目标搜索语句;所述类别确定单元用于根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,在所述专项搜索词库中,所述历史搜索语句对应至少一个预设专项类别,所述候补搜索语句通过根据所述专项搜索词库建立的分类模型获得;所述意图识别单元用于根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。A fourth aspect of the present application provides a search intention identifying apparatus, the apparatus comprising: a sentence receiving unit, a category determining unit, and an intention identifying unit, wherein the sentence receiving unit is configured to obtain a target search sentence; the category determining unit is configured to: Determining at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset special category, and the candidate search statement passes Obtaining according to the classification model established by the special search vocabulary; the intention identification unit is configured to determine a target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence. Intention category.
由以上技术方案可知,本申请实施例通过获取历史搜索记录中的海量历史搜索语句并对其进行分类,初步得到各个预设专项类别的专项搜索词库;进而根据该专项搜索词库建立分类模型,并通过该分类模型挖掘得到与各历史搜索语句相关的候补搜索语句,以该候补搜索语句来补充、完善相应的专项搜索词库,即某一预设专项类别的专项搜索词库中既包含属于该预设专项类别的历史搜索语句,又包括相应的候补搜索语句;相对于现有技术人工设置的白名单、模糊查询阈值、模式匹配关键词等,所述专项搜索词库中的搜索语句更准确、更全面,泛化性强,因此,本申 请实施例依据该专项搜索词库进行搜索意图识别,可以更准确地识别目标搜索语句的意图类别,避免人工指定规则与用户实际判断标准不一致造成的错误识别。It can be seen from the above technical solutions that the embodiment of the present application obtains a large number of historical search sentences in a historical search record and classifies them, and initially obtains a special search term database of each preset special category; and further establishes a classification model according to the special search term database. And the candidate search sentence related to each historical search sentence is obtained by mining the classification model, and the corresponding search term is supplemented and improved by the candidate search sentence, that is, the special search term library of a preset special category includes The historical search statement belonging to the preset special category further includes a corresponding candidate search statement; the search statement in the special search lexicon is compared with the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like. More accurate, more comprehensive, and more generalized, therefore, this application The embodiment can perform the search intent identification according to the special search vocabulary, and can more accurately identify the intent category of the target search sentence, and avoid the erroneous recognition caused by the manual designation rule being inconsistent with the actual judgment standard of the user.
附图说明DRAWINGS
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the embodiments or the description of the prior art will be briefly described below, and it will be apparent to those skilled in the art that In other words, other drawings can be obtained based on these drawings without paying for creative labor.
图1为本申请实施例提供的搜索意图识别方法及装置的应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a search intent identification method and apparatus according to an embodiment of the present application;
图2示出了本申请实施例提供的服务器的结构框图;FIG. 2 is a structural block diagram of a server provided by an embodiment of the present application;
图3为本申请实施例提供的一种搜索意图识别方法的流程图;FIG. 3 is a flowchart of a method for identifying a search intention according to an embodiment of the present application;
图4为本申请实施例提供的另一种搜索意图识别方法的流程图;4 is a flowchart of another search intent identification method according to an embodiment of the present application;
图5为本申请实施例提供的搜索意图识别方法中历史搜索语句分类过程的流程图;FIG. 5 is a flowchart of a process for classifying a historical search sentence in a search intent identification method according to an embodiment of the present application;
图6为本申请实施例提供的搜索意图识别方法中分类模型建立过程的流程图;FIG. 6 is a flowchart of a process for establishing a classification model in a search intent identification method according to an embodiment of the present application;
图7为本申请实施例提供的搜索意图识别方法中搜索结果页面生成过程的流程图; FIG. 7 is a flowchart of a process for generating a search result page in a search intent identification method according to an embodiment of the present application;
图8为本申请实施例提供的一种搜索结果页面的示意图;FIG. 8 is a schematic diagram of a search result page according to an embodiment of the present application;
图9为本申请实施例提供的另一种搜索结果页面的示意图;FIG. 9 is a schematic diagram of another search result page according to an embodiment of the present application;
图10为本申请实施例提供的一种搜索意图识别装置的结构示意图;FIG. 10 is a schematic structural diagram of a search intention identification apparatus according to an embodiment of the present application;
图11为本申请实施例提供的另一种搜索意图识别装置的结构示意图;FIG. 11 is a schematic structural diagram of another search intention identification apparatus according to an embodiment of the present application;
图12为本申请实施例提供的又一种搜索意图识别装置的结构示意图。FIG. 12 is a schematic structural diagram of still another search intention identification apparatus according to an embodiment of the present application.
具体实施方式detailed description
本申请下述各实施例均可应用于如图1所示的环境中。如图1所示,服务器100通过网络300与一个或多个用户终端200进行通信连接,以进行数据通信或交互。在本申请实施例中,用户终端200中安装有客户端,该客户端可以是第三方应用软件(如搜索引擎),与服务器100相对应,从而为用户提供服务(例如信息查询)。所述服务器100可以是数据库服务器、即时通信服务器、网络服务器、验证服务器等多个服务器,也可以是一个服务器。所述用户终端200可以是个人电脑(personal computer,PC)、平板电脑、智能手机、个人数字助理(personal digital assistant,PDA)、电子书阅读器、膝上型便携计算机、车载电脑等。所述网络300可以是无线网络或有线网络,例如无线网络可 以是,但不限于Wi-Fi(无线保真)网络、2G/3G/4G网络等。The following embodiments of the present application can be applied to the environment as shown in FIG. As shown in FIG. 1, server 100 is in communication with one or more user terminals 200 over network 300 for data communication or interaction. In the embodiment of the present application, a client is installed in the user terminal 200, and the client may be a third-party application software (such as a search engine) corresponding to the server 100, thereby providing a service (such as information query) for the user. The server 100 may be a plurality of servers such as a database server, an instant messaging server, a network server, an authentication server, or the like, or may be a server. The user terminal 200 may be a personal computer (PC), a tablet computer, a smart phone, a personal digital assistant (PDA), an e-book reader, a laptop portable computer, a car computer, or the like. The network 300 can be a wireless network or a wired network, such as a wireless network. Therefore, it is not limited to Wi-Fi (Wireless Fidelity) network, 2G/3G/4G network, and the like.
图2示出了一种可应用于本申请实施例中的服务器100的结构框图。如图2所示,所述服务器100可以包括本申请实施例提供的搜索意图识别方法及装置、存储器102、存储控制器103、处理器104和网络模块105。FIG. 2 shows a block diagram of a structure of a server 100 that can be applied to an embodiment of the present application. As shown in FIG. 2, the server 100 may include a search intent identification method and apparatus provided by an embodiment of the present application, a memory 102, a storage controller 103, a processor 104, and a network module 105.
存储器102、存储控制器103、处理器104、网络模块105各元件之间直接或间接地电连接,以实现数据的传输或交互。例如,这些元件之间可以通过一条或多条通讯总线或信号总线实现电连接。所述搜索意图识别方法及装置包括至少一个可以以软件或固件(firmware)的形式存储于存储器102中的软件功能模块,例如所述搜索意图识别方法及装置包括的软件功能模块或计算机程序。The components of the memory 102, the memory controller 103, the processor 104, and the network module 105 are electrically connected directly or indirectly to enable data transmission or interaction. For example, these components can be electrically connected by one or more communication buses or signal buses. The search intent identification method and apparatus include at least one software function module that can be stored in the memory 102 in the form of software or firmware, such as the search intent identification method and the software function module or computer program included in the apparatus.
存储器102可以存储各种软件程序以及模块,如本申请实施例提供的搜索意图识别方法及装置对应的程序指令/模块,处理器104通过运行存储在存储器102中的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现本申请实施例中的搜索意图识别方法。存储器102可以包括但不限于随机存取存储器(Random Access Memory,RAM),只读存储器(Read Only Memory,ROM),可编程只读存储器(Programmable Read-Only Memory,PROM),可擦除只读存储器(Erasable Programmable Read-Only Memory,EPROM),电可擦除只读存储器(Electric Erasable Programmable Read-Only Memory,EEPROM)等。 The memory 102 can store various software programs and modules, such as the search intent identification method and the program instruction/module corresponding to the device provided by the embodiment of the present application. The processor 104 executes each of the software programs and modules stored in the memory 102. The function application and data processing, that is, the search intent identification method in the embodiment of the present application. The memory 102 can include, but is not limited to, Random Access Memory (RAM), Read Only Memory (ROM), Programmable Read-Only Memory (PROM), erasable read-only Erasable Programmable Read-Only Memory (EPROM), Electric Erasable Programmable Read-Only Memory (EEPROM), and the like.
处理器104可以是一种集成电路芯片,具有信号处理能力。上述处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。其可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。 Processor 104 can be an integrated circuit chip with signal processing capabilities. The processor may be a general-purpose processor, including a central processing unit (CPU), a network processor (NP processor, etc.), or a digital signal processor (DSP) or an application specific integrated circuit (ASIC). ), off-the-shelf programmable gate arrays (FPGAs) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components. The methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed. The general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
网络模块105用于接收以及发送网络信号。上述网络信号可包括无线信号或者有线信号。The network module 105 is for receiving and transmitting network signals. The above network signal may include a wireless signal or a wired signal.
可以理解,图2所示的结构仅为示意,服务器100还可以包括比图2中所示更多或者更少的组件,或者具有与图2所示不同的配置。图2中所示的各组件可以采用硬件、软件或其组合实现。另外,本申请实施例中的服务器100还可以包括多个具体不同功能的服务器。It can be understood that the structure shown in FIG. 2 is merely illustrative, and the server 100 may further include more or less components than those shown in FIG. 2 or have a different configuration from that shown in FIG. 2. The components shown in Figure 2 can be implemented in hardware, software, or a combination thereof. In addition, the server 100 in this embodiment of the present application may further include multiple servers with different functions.
首先对本发明一种搜索意图识别方法的实施例进行说明。图3为本发明一种搜索意图识别方法的流程图,该方法包括如下步骤:First, an embodiment of the search intent identification method of the present invention will be described. FIG. 3 is a flowchart of a search intent identification method according to the present invention, the method includes the following steps:
S1:获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别分别对应的专项搜索词库。S1: Obtain a first historical search sentence set in the first preset time period, classify the historical search sentences in the first historical search sentence set, and obtain a special search vocabulary corresponding to each preset special category.
由于服务器中会存储各用户的历史搜索记录,该历史搜索记 录中包括用户输入的搜索语句、服务器返回的查询结果、查询结果中被用户点击浏览的网页等数据。上述第一预设时间具体可以为最近一年、一个季度、一个月等,第一预设时间内的所有历史搜索记录中记录的搜索语句的集合即构成上述第一历史搜索语句集合。分别确定第一历史搜索语句集合中每个历史搜索语句所属的预设专项类别,完成分类操作,也即得到各个专项搜索词库。其中,上述预设专项类别包括但不限于小说、音频、视频、天气、汇率、机票查询等。Since the historical search record of each user is stored in the server, the historical search record The record includes the search statement input by the user, the query result returned by the server, and the webpage clicked by the user in the query result. The first preset time may specifically be the latest year, a quarter, a month, etc., and the set of the search sentences recorded in all the historical search records in the first preset time constitutes the first historical search sentence set. The preset special categories to which each historical search sentence belongs in the first historical search sentence set are respectively determined, and the classification operation is completed, that is, each special search term library is obtained. Among them, the above-mentioned preset special categories include, but are not limited to, novels, audio, video, weather, exchange rates, ticket inquiries, and the like.
S2:根据所述专项搜索词库建立分类模型,并通过所述分类模型获取各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库。S2: Establish a classification model according to the special search vocabulary, and obtain candidate search sentences corresponding to each preset special category by using the classification model, and record the candidate search statements into a special search vocabulary of the corresponding category.
本实施例以各个预设专项类别对应的专项搜索词库为样本,建立分类模型,通过训练该分类模型,得到专项搜索词库中各个历史搜索语句相关的候补搜索语句,该候补搜索语句亦存入相应的专项搜索词库。In this embodiment, a special search vocabulary corresponding to each preset special category is taken as a sample, and a classification model is established. By training the classification model, a candidate search sentence related to each historical search sentence in the special search vocabulary is obtained, and the candidate search sentence is also saved. Enter the corresponding special search term library.
S3:获得目标搜索语句,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为目标搜索语句的意图类别。S3: Obtain a target search sentence, and determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary as the intent category of the target search statement.
获得用户输入的或其它设备发送的目标搜索语句,根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别。根据所确定的至少一个预设专项类别的点击量与所述 目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。例如,可以计算所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例;将大于预设阈值的比例对应的预设专项类别作为所述目标搜索语句的意图类别。Obtaining a target search sentence sent by the user or sent by another device, and determining at least one preset special category according to the historical search sentence and the candidate search sentence in the special search dictionary. According to the determined at least one preset special category of clicks and the stated The proportion of the total hits of the target search statement determines the intent category of the target search statement. For example, a ratio of the determined amount of clicks of the at least one preset special category to the total click amount of the target search sentence may be calculated; and a preset special category corresponding to a ratio greater than a preset threshold is used as the target search sentence. Intent category.
具体地,可以根据某个或某几个预设专项类别的点击量与目标搜索语句的总的点击量的比例的大小来确定目标搜索语句的意图类别。例如,目标搜索语句“天龙八部”对应“小说”、“视频”、“音频”等预设专项类别,对应的比例分别为0.35、0.41、0.09,假设预设阈值为0.3,即“小说”和“视频”对应的网页所占的比例超过预设阈值,则“小说”和“视频”可以作为目标搜索语句的意图类别。Specifically, the intent category of the target search sentence may be determined according to the ratio of the click amount of one or several preset special categories to the total click amount of the target search sentence. For example, the target search sentence "Tianlong Ba Bu" corresponds to preset categories such as "fiction", "video", "audio", etc., the corresponding proportions are 0.35, 0.41, and 0.09, respectively, assuming a preset threshold of 0.3, that is, "fiction" If the proportion of the webpage corresponding to the "video" exceeds a preset threshold, "fiction" and "video" may be used as the intent category of the target search sentence.
本申请实施例中,专项搜索词库相当于白名单,用于匹配确定目标搜索语句对应的意图类别。由于专项模型是基于海量历史搜索记录中的历史搜索语句建立的,各预设专项类别对应的专项搜索词库中的搜索语句数量远大于人工设置的白名单中的搜索语句数量,克服了现有技术中搜索语句覆盖不全面,泛化性差的问题,可以准确的确定用户实时输入的目标搜索语句对应的意图类别。In the embodiment of the present application, the special search vocabulary is equivalent to a white list, and is used for matching the intent category corresponding to the target search sentence. Since the special model is established based on historical search sentences in the massive historical search record, the number of search sentences in the special search vocabulary corresponding to each preset special category is much larger than the number of search sentences in the manually set white list, overcoming the existing In the technology, the search statement is not comprehensively covered, and the generalization is poor, and the intent category corresponding to the target search sentence input by the user in real time can be accurately determined.
由以上技术方案可知,本申请实施例通过获取历史搜索记录中的海量历史搜索语句并对上述的海量历史搜索语句进行分类,初步得到各个预设专项类别的专项搜索词库;进而根据该专项搜索词库建立分类模型,并通过该分类模型挖掘得到与各历史搜索 语句相关的候补搜索语句,以该候补搜索语句来补充、完善相应的各个专项搜索词库,即某一预设专项类别的专项搜索词库中既包含属于该预设专项类别的历史搜索语句,又包括相应的候补搜索语句;相对于现有技术人工设置的白名单、模糊查询阈值、模式匹配关键词等,所述专项搜索词库中的搜索语句更准确、更全面,泛化性强,因此,本申请实施例依据该专项搜索词库进行搜索意图识别,可以更准确地识别目标搜索语句的意图类别,避免人工指定规则与用户实际判断标准不一致造成的错误识别。It can be seen from the above technical solution that the embodiment of the present application obtains a large number of historical search sentences in the historical search record and classifies the above-mentioned massive historical search sentences, and initially obtains a special search term database of each preset special category; and further, according to the special search The vocabulary builds a classification model, and through this classification model mining and historical search The candidate search sentence related to the statement supplements and perfects the corresponding special search vocabulary by using the candidate search sentence, that is, the special search vocabulary of a certain preset special category includes the historical search statement belonging to the preset special category. The method further includes a corresponding candidate search statement; the search statement in the special search vocabulary is more accurate, comprehensive, and generalized than the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like. Therefore, the embodiment of the present application performs the search intent identification according to the special search vocabulary, and can more accurately identify the intent category of the target search sentence, and avoid erroneous recognition caused by the manual designation rule being inconsistent with the actual judgment standard of the user.
图4为本申请实施例提供的另一种搜索意图识别方法流程图。相对于图3,图4所示搜索意图识别方法除上述步骤S1至S3外,还包括:FIG. 4 is a flowchart of another method for identifying a search intention according to an embodiment of the present application. With reference to FIG. 3, the search intent identification method shown in FIG. 4 includes, in addition to the above steps S1 to S3, the following:
S4、获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。S4. Acquire a second historical search sentence set in a second preset time, and train the classification model according to the second historical search sentence set to update the special search term database.
步骤S4执行于步骤S3之前,其中步骤S4可以与步骤S2并列执行,也可以执行于步骤S2之后。Step S4 is performed before step S3, wherein step S4 may be performed in parallel with step S2, or may be performed after step S2.
上述第二预设时间可以为最近7天等较短的时间段。通过步骤S4的更新操作,可以及时发现新的搜索语句、调整已有搜索语句的不同意图类别对应的点击比例,从而增强专项搜索词库的时效性,保证搜索结果页面随时间的迁移及当前热门的变化而动态调整,进一步提升用户体验。 The second preset time may be a shorter period of time such as the last 7 days. Through the update operation of step S4, the new search sentence can be found in time, and the proportion of clicks corresponding to different intent categories of the existing search sentence can be adjusted, thereby enhancing the timeliness of the special search vocabulary, ensuring the migration of the search result page over time and the current hot spot. The changes are dynamically adjusted to further enhance the user experience.
参照图5,在本申请一个可行的实施例中,上述步骤S1中所述的对第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别对应的专项搜索词库,其具体实施方式可以包括如下步骤。Referring to FIG. 5, in a feasible embodiment of the present application, the historical search sentences in the first historical search sentence set are classified according to the foregoing step S1, and a special search vocabulary corresponding to each preset special category is obtained. Particular embodiments may include the following steps.
S101、获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合。S101. Acquire a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time.
上述点击网页,即用户在搜索结果页面中点击进入的网页,相对于其他网页,该点击网页反映了用户针对相应搜索语句的实际搜索意图。对于一个历史搜索语句,在第一预设时间内可能被同一用户或不同用户多次搜索,考虑到网页的时效性等因素,每次搜索得到的搜索结果页面不尽相同,用户点击打开的网页(即上述点击网页)也不尽相同。例如,对于历史搜索语句C1,根据一条相关历史搜索记录得知用户最终点击打开的网页为A1和A2,根据另一条相关历史搜索记录得知用户最终点击打开的网页为A2和A3,C1相关的所有历史搜索记录中记录的点击网页(包括A1、A2和A3)构成C1对应的点击网页组合;特别的,由于A2存在于两条不同的历史搜索记录,故点击网页组合中也相应记录两次A2。The above clicked webpage, that is, the webpage that the user clicks on in the search result page, relative to other webpages, the clickpage webpage reflects the actual search intent of the user for the corresponding search sentence. For a historical search statement, it may be searched multiple times by the same user or different users in the first preset time. Considering the timeliness of the webpage and the like, the search result page obtained by each search is not the same, and the user clicks on the opened webpage. (that is, the above click page) is not the same. For example, for the historical search sentence C1, according to a related historical search record, the webpage that the user finally clicks to open is A1 and A2, and according to another related historical search record, the webpage that the user finally clicks to open is A2 and A3, C1 related. The click pages (including A1, A2, and A3) recorded in all historical search records constitute the click page combination corresponding to C1; in particular, since A2 exists in two different historical search records, the click page combination is also recorded twice. A2.
S102、针对每个历史搜索语句,确定其点击网页组合中各个点击网页所属的预设专项类别,计算各个预设专项类别对应的点击网页在所述点击网页组合中所占的点击比例。S102. Determine, for each historical search statement, a preset special category to which each clicked webpage belongs in the clicked webpage combination, and calculate a click proportion of the clicked webpage corresponding to each preset specialized category in the clicked webpage combination.
计算各个预设专项类别对应的点击网页在相应点击网页组合 中所占的比值,也即分别统计属于某个预设专项类别B1的点击网页的个数n,以及点击网页组合中的点击网页总个数N,则预设专项类别B1的点击网页所占的比值K=n/N。其中,针对点击网页组合中重复出现的点击网页,需要重复计数,例如,上述历史搜索语句C1的两条历史搜索记录中均存在点击网页A2,点击网页A2在点击网页组合中至少出现两次,则在统计N及A2所属预设专项类别的n时都要累加2。具体的,在一个可行的实施例中,上述步骤S102所述的确定各个点击网页所属的预设专项类别,具体可通过如下方法:针对点击网页组合中的每个点击网页,获取其URL;根据所述URL确定相应的点击网页对应的主机名(即host);查询各个预设专项类别对应的专项站点列表,确定所述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。例如,C1对应的点击网页组合中,点击网页A1的URL为“http://music.***.com/…”,从中提取得到其host“music”,根据各个预设专项类别对应的专项站点列表判定“music”在预设专项类别“音频”对应的专项站点列表中,从而判定A1所属的预设专项类别为“音频”。Calculate the click page corresponding to each preset special category in the corresponding click page combination The ratio of the hits, that is, the number n of clicked webpages belonging to a certain preset special category B1, and the total number of clicked webpages in the clicked webpage combination, the default clicked webpage of the special category B1 The ratio K = n / N. For the click webpage that is repeated in the click webpage combination, the count is repeated. For example, the two historical search records of the historical search sentence C1 have the click webpage A2, and the click webpage A2 appears at least twice in the click webpage combination. Then, when the statistics N of the preset special category of N and A2 are counted, 2 is added. Specifically, in a feasible embodiment, determining the preset special category to which each clicked webpage belongs according to the foregoing step S102 may be specifically obtained by: obtaining a URL for each clicked webpage in the clicked webpage combination; The URL determines a host name corresponding to the clicked webpage (ie, host); queries a special site list corresponding to each preset special category, determines a preset special category corresponding to the special site list of the host name, and uses the corresponding Click the preset special category to which the page belongs. For example, in the click webpage combination corresponding to C1, the URL of the webpage A1 is clicked as "http://music.***.com/...", and the host "music" is extracted therefrom, and the special site list corresponding to each preset special category is selected. It is determined that “music” is in the special site list corresponding to the preset special category “audio”, thereby determining that the preset special category to which A1 belongs is “audio”.
其中,可采用如下方法获得某一预设专项类别B的专项站点列表,即将搜索意图确定为预设专项类别B的一些搜索语句,发送给搜索引擎进行搜索,然后统计搜索结果中的网页host数量,出现次数最多的M个host即可构成预设专项类别B的专项站点列表。 The following method may be used to obtain a special site list of a preset special category B, that is, some search sentences whose search intention is determined as the preset special category B are sent to the search engine for searching, and then count the number of web hosts in the search result. The M hosts with the most occurrences can constitute a special site list of the preset special category B.
S103、将大于预设阈值的点击比例对应的预设专项类别作为所述相应历史搜索语句的意图类别,并将所述相应历史搜索语句记入各个意图类别对应的专项搜索词库。S103. Use a preset special category corresponding to a click ratio greater than a preset threshold as an intent category of the corresponding historical search statement, and record the corresponding historical search statement into a special search term library corresponding to each intent category.
实际上,一个搜索语句的意图类别可能仅有一个,也可能有多个;具体到图4所示的分类过程,对于一个历史搜索语句,不同用户的搜索意图可能有多种,从而大于预设阈值的点击比例可能存在多个,该历史搜索语句将被记入多个专项搜索词库。例如,上述预设阈值可以设置为0.3,假设经过上述统计及计算,历史搜索语句“天龙八部”对应的点击网页包括“小说”、“视频”、“音频”等预设专项类别,对应的点击比例分别为0.35、0.41、0.09,即“小说”和“视频”对应的点击网页所占的点击比例都大于预设阈值,则“小说”和“视频”都是历史搜索语句“天龙八部”的意图类别,故将“天龙八部”同时记入“小说”和“视频”对应的专项搜索词库中,从而在目标搜索语句为“天龙八部”时,可以为用户展示“小说”和“视频”两个类别对应的目标网页,与实际搜索需求相符。In fact, there may be only one or more intent categories for a search statement. Specifically, for the classification process shown in Figure 4, for a historical search statement, different users may have multiple search intents, which is greater than the preset. There may be multiple click ratios for the threshold, and the historical search statement will be recorded in multiple special search thesaurus. For example, the preset threshold may be set to 0.3. It is assumed that after the above statistics and calculations, the click page corresponding to the historical search sentence “Dragon” includes preset categories such as “fiction”, “video”, “audio”, and the corresponding The click ratios are 0.35, 0.41, and 0.09 respectively. That is, the clicks of the "fiction" and "video" account for more than the preset threshold, then the "fiction" and "video" are historical search sentences. "Intention category, so "Dragon" is simultaneously recorded in the special search vocabulary corresponding to "fiction" and "video", so that when the target search sentence is "Dragon", you can show "fiction" to the user. The landing page corresponding to the two categories of "Video" matches the actual search needs.
由上述步骤S101至S103可知,本申请实施例通过对历史搜索记录中各个历史搜索语句的点击网页进行分析,来确定相应历史搜索语句的意图类别,即,本实施例对历史搜索语句进行分类的分类依据也来自历史搜索记录,进一步避免了人工置顶规则对专项搜索词库的影响,保证了基于所述专项搜索词库进行搜索意图识别的准确度。 It can be seen from the above steps S101 to S103 that the embodiment of the present application determines the intent category of the corresponding historical search sentence by analyzing the click webpage of each historical search sentence in the historical search record, that is, the classification of the historical search sentence in this embodiment. The classification basis also comes from the historical search record, which further avoids the influence of the manual topping rule on the special search vocabulary, and ensures the accuracy of the search intention recognition based on the special search vocabulary.
可以理解,本申请实施例也可以通过对历史搜索记录中的各个历史搜索语句的展现网页进行分析,获得预设专项类别对应的展现网页在全部展现网页中的比例并与预设阈值进行比较,由于展现网页的处理步骤与点击网页的处理步骤对应相同,在此便不做赘述。展现网页具体指的是输入目标搜索语句后,搜索结果页面中显示的网页。It can be understood that the embodiment of the present application can also analyze the display webpage of each historical search sentence in the historical search record, obtain the proportion of the display webpage corresponding to the preset special category in all the displayed webpages, and compare with the preset threshold. Since the processing steps of displaying the webpage are the same as the processing steps of clicking the webpage, no further description is made here. The display page specifically refers to the web page displayed in the search result page after the target search sentence is input.
参照图6,在本申请一个可行的实施例中,上述步骤S2所述的根据所述专项搜索词库建立分类模型,具体实施方式可以包括如下步骤:Referring to FIG. 6 , in a feasible embodiment of the present application, the classification model is established according to the special search vocabulary described in the foregoing step S2, and the specific implementation manner may include the following steps:
S201、针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题。S201. For each historical search sentence in the special search vocabulary, respectively obtain a URL corresponding to the clicked webpage and a webpage title.
S202、将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理。S202. Perform segmentation processing on each history search sentence and corresponding webpage title and URL in the special search dictionary.
S203、将分割历史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量。S203. The character cells obtained by dividing the historical search term and the webpage title and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space.
对于上述历史搜索语句和网页标题,具体可采用分词方法进行分割,例如,对于历史搜索词语“北京天气”,通过分词处理可以分割为“北京”和“天气”两个词元,进而对应得到两个特征向量。对于点击网页的URL,可以提取其中的host以及其他特殊字符串,host以及其他特殊字符串均可以被分别表示为特征向量。For the above historical search sentence and web page title, the word segmentation method can be used for segmentation. For example, for the historical search term “Beijing weather”, the word segmentation process can be divided into two words “Beijing” and “weather”, and then the corresponding two are obtained. Feature vector. For the URL of the clicked webpage, the host and other special strings can be extracted, and the host and other special strings can be respectively represented as feature vectors.
S204、根据所述特征向量,并以所述点击网页所属的预设专 项类别对应的点击比例作为相关特征向量的权重,建立基于最大熵模型的分类模型。S204. According to the feature vector, and according to the preset specific to the clicked webpage The click proportion corresponding to the item category is used as the weight of the related feature vector, and a classification model based on the maximum entropy model is established.
需要说明的是,上述特征向量均为建立并训练分类模型的样本,对于某个预设专项类别,该预设专项类别相关的特征向量为正样本,其他不相关的特征向量为负样本。鉴于最大熵模型在概率估计方面的良好性能,本实施例建立基于最大熵模型的分类模型,可以更好的处理海量的历史数据,保证专项搜索词库的泛化性,进而提高意图识别的准确度。It should be noted that the above feature vectors are all samples for establishing and training a classification model. For a preset special category, the feature vector related to the preset special category is a positive sample, and other unrelated feature vectors are negative samples. In view of the good performance of the maximum entropy model in probability estimation, this embodiment establishes a classification model based on the maximum entropy model, which can better process massive historical data, ensure the generalization of the special search lexicon, and improve the accuracy of the intent recognition. degree.
参照图7,在本申请一个可行的实施例中,上述搜索意图识别方法在执行完上述步骤S3,即确定目标搜索语句的意图类别之后,还可以包括如下步骤:Referring to FIG. 7, in a feasible embodiment of the present application, after the step S3 is performed, that is, after determining the intent category of the target search sentence, the search intent identification method may further include the following steps:
S501、根据所述目标搜索语句的意图类别对所述目标搜索语句进行垂直搜索,得到与所述目标搜索语句相关的目标网页。S501. Perform a vertical search on the target search sentence according to an intent category of the target search statement, to obtain a target webpage related to the target search sentence.
S502、根据所述意图类别对应的点击比例确定所述意图类别的搜索意图等级。S502. Determine a search intent level of the intent category according to a click ratio corresponding to the intent category.
S503、根据所述搜索意图等级确定各个意图类别对应的目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。S503. Determine, according to the search intention level, a display order of the target webpage corresponding to each intent category, and generate a search result page in the form of Aladdin.
上述步骤S501至S503实现了基于搜索意图识别结果的网页搜索及展示。根据上文所述,同一个历史搜索语句可能属于多个预设专项类别,被进入多个专项搜索词库;相应的,经过步骤S3,得到目标搜索语句的意图类别也可能存在多个。仍以“天龙八部” 为例,当其作为目标搜索语句时,可以得到其意图类别包括“视频”和“小说”两种,从而在步骤S501中,分别针对“视频”和“小说”两种类别进行专项搜索,相应得到两类目标网页;进而就需要考虑这两类目标网页以怎样的先后顺序排列于搜索结果页面中。The above steps S501 to S503 implement web page search and display based on the search intention recognition result. According to the above, the same historical search sentence may belong to a plurality of preset special categories, and is entered into a plurality of special search lexicons; correspondingly, after step S3, there may be multiple intent categories of the target search sentences. Still with "Tianlong Ba Bu" For example, when it is used as a target search sentence, it can be obtained that its intent category includes "video" and "fiction", so in step S501, special searches are performed for "video" and "fiction" respectively, correspondingly Get two types of landing pages; then you need to consider how the two types of landing pages are arranged in the search results page.
本申请实施例中,根据不同意图类别的搜索意图等级确定目标网页的排列顺序。具体的,可以根据上述步骤S102中计算得到的点击比例确定上述搜索意图等级。对于“天龙八部”,其作为“小说”对应的点击比例为0.35,作为“视频”对应的点击比例为0.41,故“视频”对应的搜索意图等级高于“小说”,即根据用户的历史搜索数据可以分析出用户搜索“天龙八部”时,其实际想得到“视频”搜索结果的概率大于想得到“小说”搜索结果的概率。故在搜索结果页面中,“视频”类目标网页位于“小说”类目标网页之前(如图8所示的“天龙八部”的搜索结果页面),更符合用户实际需求,可以使用户更快得到想要浏览的目标网页,提升用户体验好感度。又如图9所示,目标搜索语句“平凡的世界”也存在“视频”和“小说”两种意图类别,但“小说”对应的点击比例大于“视频”,故“小说”类目标网页位于“视频”类目标网页之前。对比图8和图9可知,通过图7所示的本申请实施例所述的目标页面排序方法,可以对意图类别相同的不同目标搜索语句,得出不同的排序结果,符合用户相应目标搜索语句的实际搜索需求,提升用户体验好感度。 In the embodiment of the present application, the order of arrangement of the target web pages is determined according to the search intention levels of different intent categories. Specifically, the search intention level may be determined according to the click ratio calculated in the above step S102. For "Dragon", the click ratio corresponding to "Fiction" is 0.35, and the click ratio corresponding to "Video" is 0.41. Therefore, the search intention level corresponding to "Video" is higher than "Fiction", that is, according to the history of the user. The search data can analyze the probability that the user actually wants to get the "video" search result when he searches for "Dragon", and the probability of getting the "fiction" search result. Therefore, in the search result page, the "video" class target page is located in front of the "fiction" class target page (as shown in the "Dragon" article search result page shown in Figure 8), which is more in line with the actual needs of the user and can make the user faster. Get the landing page you want to browse and improve your user experience. As shown in FIG. 9 , the target search sentence “Ordinary World” also has two intent categories of “video” and “fiction”, but the “fiction” corresponds to a larger proportion of clicks than “video”, so the “fiction” category target webpage is located. Before the "Video" class landing page. Comparing FIG. 8 and FIG. 9 , the target page sorting method described in the embodiment of the present application shown in FIG. 7 can obtain different sorting results for different target search sentences with the same intent category, and meet the corresponding target search sentence of the user. The actual search needs to enhance the user experience.
应当理解,当历史搜索语句的点击量超过一个预设的阈值(例如将预设的阈值设置为10)时,可以根据与上述历史搜索语句对应的各个预设专项类别的点击量分别与历史搜索语句的总点击量的比值确定所述意图类别的搜索意图等级。当历史搜索语句的点击量低于预设的阈值时,则可以根据与该历史搜索语句对应的各个预设专项类别在搜索结果页面的展现量和该历史搜索语句的总展现量的比值来确定意图类别的搜索意图等级。It should be understood that when the click amount of the historical search sentence exceeds a preset threshold (for example, the preset threshold is set to 10), the click amount of each preset special category corresponding to the historical search sentence may be separately related to the historical search. The ratio of the total hits of the statement determines the search intent level of the intent category. When the click amount of the historical search statement is lower than the preset threshold, the ratio of the amount of presentation of the search result page and the total amount of the historical search sentence of each preset special category corresponding to the historical search sentence may be determined. The search intent level of the intent category.
另外,本申请实施例中,目标网页在搜索结果页面中以阿拉丁形式展示,展示内容不仅限于传统展示形式中的文字内容(主要包括网页标题、包含目标搜索语句的网页内容、网址),还可以展示目标网页中的图片、子链接等,如图8和图9中的电视剧海报、小说封面、电视剧每集的链接等,不仅可以更直观的向用户展现目标网页中的主要内容,便于用户判断该目标网页是否满足自己的搜索意图(例如可以区分同一名称不同内容的视频、小说等),还可以使得用户通过点击子链接直接浏览对应内容,减少用户操作步骤(例如,用户可以直接点击图8所示的第五集对应的子链接,直接观看第五集,不需先进入该电视剧在线观看页面,再点击相应链接切换至第五集)。In addition, in the embodiment of the present application, the target webpage is displayed in the form of Aladdin in the search result page, and the display content is not limited to the text content in the traditional display form (mainly including the webpage title, the webpage content including the target search sentence, the webpage), and It can display pictures, sub-links, etc. in the target webpage, such as the TV drama poster, the novel cover, the link of each episode of the TV series, etc., which can not only visually display the main content in the target webpage to the user, but also facilitate the user. Determining whether the target webpage satisfies its own search intent (for example, a video, a novel, etc., which can distinguish different content of the same name), and can also enable the user to directly browse the corresponding content by clicking the sub-link, thereby reducing the user operation steps (for example, the user can directly click on the map) The sub-link corresponding to the fifth episode shown in 8 directly views the fifth episode, without first entering the online viewing page of the TV drama, and then clicking the corresponding link to switch to the fifth episode).
需要补充的是,区别于上述实施例中以在根据所述搜索意图等级确定排列顺序为目标网页最终的展示顺序,在本申请另一个可行的实施例中,还可以将根据所述搜索意图等级确定的排列顺序作为初始顺序,通过点击调权在所述初始顺序上进行调整,得到各 个目标网页在搜索结果页面中的展示顺序,可以得到更符合多数用户搜索需求的搜索结果页面。It should be added that, in addition to the above embodiment, in order to determine the ranking order according to the search intention level, the final display order of the target webpage, in another feasible embodiment of the present application, according to the search intention level. The determined sorting order is used as an initial sequence, and the initial order is adjusted by clicking the weight adjustment to obtain each The order in which the landing pages are displayed in the search results page will result in a search results page that better matches the search needs of most users.
对应于上述搜索意图识别方法实施例,本申请实施例还提供了一种搜索意图识别装置,参见图10,该装置包括:样本获取单元100、模型控制单元200和意图识别单元300。Corresponding to the above-mentioned search intent identification method embodiment, the embodiment of the present application further provides a search intention identification device. Referring to FIG. 10, the device includes: a sample acquisition unit 100, a model control unit 200, and an intention recognition unit 300.
该样本获取单元100被配置为,获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别对应的专项搜索词库。The sample obtaining unit 100 is configured to obtain a first historical search sentence set in a first preset time period, and classify the historical search sentences in the first historical search sentence set to obtain a special item corresponding to each preset special category. Search the thesaurus.
该模型控制单元200被配置为,根据所述专项搜索词库建立分类模型,并通过所述分类模型获取各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库。The model control unit 200 is configured to establish a classification model according to the special search term library, and obtain candidate search sentences corresponding to each preset special category by using the classification model, and record the candidate search statements into special items of corresponding categories. Search the thesaurus.
该意图识别单元300被配置为,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为用户实时输入的目标搜索语句对应的意图类别。The intention identification unit 300 is configured to determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary as the intent category corresponding to the target search sentence input by the user in real time.
请参见图11,所述意图识别单元300可以包括:语句接收子单元301、类别确定子单元302、意图识别子单元303。Referring to FIG. 11, the intent identification unit 300 may include a sentence receiving subunit 301, a category determining subunit 302, and an intent identifying subunit 303.
所述语句接收子单元301用于获得目标搜索语句;The sentence receiving subunit 301 is configured to obtain a target search sentence;
所述类别确定子单元302用于根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,在所述专项搜索词库中,所述历史搜索语句对应至少一个预设专项类别,所述候补搜索语句通过根据所述专项搜索词库建立的分类模型获 得;The category determining sub-unit 302 is configured to determine at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset in the special search vocabulary a special category, the candidate search sentence is obtained by a classification model established according to the special search vocabulary Have
所述意图识别子单元303用于根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。The intent identification sub-unit 303 is configured to determine an intent category of the target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence.
所述意图识别子单元303还用于:计算所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例;将大于预设阈值的比例对应的预设专项类别作为所述目标搜索语句的意图类别。The intent identification sub-unit 303 is further configured to: calculate a ratio of the determined amount of the click of the at least one preset special category to the total click amount of the target search sentence; and preset a specific item corresponding to the preset threshold The category is the intent category of the target search statement.
由以上技术方案可知,本申请实施例通过获取历史搜索记录中的海量历史搜索语句并对其进行分类,初步得到各个预设专项类别的专项搜索词库;进而根据该专项搜索词库建立分类模型,并通过该分类模型挖掘得到与各历史搜索语句相关的候补搜索语句,以该候补搜索语句来补充、完善相应的专项搜索词库,即某一预设专项类别的专项搜索词库中既包含属于该预设专项类别的历史搜索语句,又包括相应的候补搜索语句;相对于现有技术人工设置的白名单、模糊查询阈值、模式匹配关键词等,所述专项搜索词库中的搜索语句更准确、更全面,泛化性强,因此,本申请实施例依据该专项搜索词库进行搜索意图识别,可以更准确地识别目标搜索语句的意图类别,避免人工指定规则与用户实际判断标准不一致造成的错误识别。It can be seen from the above technical solutions that the embodiment of the present application obtains a large number of historical search sentences in a historical search record and classifies them, and initially obtains a special search term database of each preset special category; and further establishes a classification model according to the special search term database. And the candidate search sentence related to each historical search sentence is obtained by mining the classification model, and the corresponding search term is supplemented and improved by the candidate search sentence, that is, the special search term library of a preset special category includes The historical search statement belonging to the preset special category further includes a corresponding candidate search statement; the search statement in the special search lexicon is compared with the white list manually set by the prior art, the fuzzy query threshold, the pattern matching keyword, and the like. More accurate, more comprehensive, and more generalized. Therefore, the embodiment of the present application performs search intent recognition according to the special search vocabulary, which can more accurately identify the intent category of the target search sentence, and avoids the manual designation rule being inconsistent with the actual judgment standard of the user. The resulting misidentification.
图12为本申请提供的另一种搜索意图识别装置的结构示意 图。相对于图12所述装置还包括:更新单元400。FIG. 12 is a schematic structural diagram of another search intention identification device provided by the present application Figure. The apparatus described with respect to FIG. 12 further includes an update unit 400.
该更新单元400被配置为,获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。The updating unit 400 is configured to acquire a second historical search sentence set in a second preset time, and train the classification model according to the second historical search sentence set to update the special search term library.
通过设置上述更新装置,可以增强专项搜索词库的时效性,保证搜索结果页面随时间的迁移及当前热门的变化而动态调整,进一步提升用户体验。By setting the above update device, the timeliness of the special search vocabulary can be enhanced, and the search result page can be dynamically adjusted according to the migration of the time and the current hot change, thereby further improving the user experience.
参照图13,在本申请一个可行的实施方式中,上述样本获取单元100,具体可以包括:点击网页获取单元101和点击网页分析单元102。Referring to FIG. 13 , in a feasible implementation manner of the present application, the sample obtaining unit 100 may specifically include: a click webpage obtaining unit 101 and a click webpage analyzing unit 102.
该点击网页获取单元101被配置为,获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合;The click webpage obtaining unit 101 is configured to acquire a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time;
该点击网页分析单元102被配置为,针对每个历史搜索语句,确定其点击网页组合中各个点击网页所属的预设专项类别,计算各个预设专项类别对应的点击网页在所述点击网页组合中所占的点击比例,将大于预设阈值的点击比例对应的预设专项类别作为相应历史搜索语句的意图类别,并将各个历史搜索语句分别记入其意图类别对应的专项搜索词库。The click webpage analyzing unit 102 is configured to determine, for each historical search sentence, a preset special category to which each clicked webpage belongs in the clicked webpage combination, and calculate a clicked webpage corresponding to each preset specialized category in the clicked webpage combination. The proportion of the clicks is determined by using the preset special category corresponding to the click ratio of the preset threshold as the intent category of the corresponding historical search sentence, and each historical search sentence is respectively recorded in the special search vocabulary corresponding to the intent category.
更具体的,为实现确定历史搜索语句的点击网页组合中各个点击网页所属的预设专项类别,上述点击网页分析单元102可以 包括点击网页分类模块;该点击网页分类模块被配置为:针对点击网页组合中的每个点击网页,获取其URL,根据所述URL确定相应的点击网页对应的主机名,查询各个预设专项类别对应的专项站点列表,确定所述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。More specifically, in order to determine a preset special category to which each click webpage belongs in the click webpage combination of the historical search statement, the click webpage analyzing unit 102 may The click webpage classification module is configured to: obtain a URL for each click webpage in the click webpage combination, determine a host name corresponding to the clicked webpage according to the URL, and query each preset special category. Corresponding special site list, determining a preset special category corresponding to the special site list of the host name, and using it as a preset special category to which the corresponding clicked webpage belongs.
由上述点击网页获取单元101和点击网页分析单元102可知,本申请实施例通过对历史搜索记录中各个历史搜索语句的点击网页进行分析,来确定相应历史搜索语句的意图类别,即,本实施例对历史搜索语句进行分类的分类依据也来自历史搜索记录,进一步避免了人工置顶规则对专项搜索词库的影响,保证了基于所述专项搜索词库进行搜索意图识别的准确度。The click webpage obtaining unit 101 and the click webpage analyzing unit 102 can be used to determine the intent category of the corresponding historical search sentence by analyzing the clicked webpage of each historical search sentence in the historical search record, that is, the embodiment. The classification basis for classifying historical search sentences also comes from historical search records, which further avoids the influence of manual topping rules on the special search lexicon, and ensures the accuracy of search intention recognition based on the special search vocabulary.
仍参照图13,上述模型控制单元200具体可以包括:样本数据获取单元201、特征向量生成单元202和模型建立单元203。Still referring to FIG. 13 , the model control unit 200 may specifically include: a sample data acquiring unit 201, a feature vector generating unit 202, and a model establishing unit 203.
该样本数据获取单元201被配置为,针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题;该特征向量生成单元202被配置为,将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理,并将分割历史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量;The sample data obtaining unit 201 is configured to respectively obtain a URL corresponding to the clicked webpage and a webpage title for each historical search sentence in the special search vocabulary; the feature vector generating unit 202 is configured to: search the special search Each historical search sentence in the vocabulary and the corresponding web page title and URL are separately segmented, and each character obtained by dividing the historical search term and the web page title, and the character string obtained by dividing the URL are respectively represented as features based on the feature space. vector;
该模型建立单元203被配置为,根据所述特征向量,并以所述点击网页所属预设专项类别对应的点击比例作为相关特征向量 的权重,建立基于最大熵模型的分类模型。The model establishing unit 203 is configured to, according to the feature vector, a click ratio corresponding to a preset special category to which the clicked webpage belongs, as a related feature vector The weights are based on a classification model based on the maximum entropy model.
本实施例建立基于最大熵模型的分类模型,可以更好的处理海量的历史数据,保证专项搜索词库的泛化性,进而提高意图识别的准确度。In this embodiment, a classification model based on the maximum entropy model is established, which can better process massive historical data, ensure the generalization of the special search vocabulary, and thereby improve the accuracy of the intent recognition.
继续参照图13,基于上述点击网页分析单元102,上述搜索意图识别装置还可以包括搜索展示单元500。其中,该搜索展示单元500具体包括:垂直搜索单元501、等级确定单元502和排序单元503。With continued reference to FIG. 13, based on the click webpage analyzing unit 102, the search intent identifying apparatus may further include a search display unit 500. The search and display unit 500 specifically includes: a vertical search unit 501, a level determining unit 502, and a sorting unit 503.
该垂直搜索单元501被配置为,分别基于所述目标搜索语句对应的意图类别进行垂直搜索,得到与所述目标搜索语句相关的目标网页;The vertical search unit 501 is configured to perform a vertical search based on an intent category corresponding to the target search sentence, respectively, to obtain a target webpage related to the target search sentence;
该等级确定单元502被配置为,根据所述意图类别对应的点击比例确定其搜索意图等级;The level determining unit 502 is configured to determine a search intent level according to a click ratio corresponding to the intent category;
该排序单元503被配置为,根据所述搜索意图等级确定各个目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。The sorting unit 503 is configured to determine a display order of the respective target web pages according to the search intention level, and generate a search result page in the form of Aladdin.
本申请实施例所述的目标页面排序方法,可以对意图类别相同的不同目标搜索语句,得出不同的排序结果,符合用户相应目标搜索语句的实际搜索需求,提升用户体验好感度。另外,本申请实施例中,目标网页在搜索结果页面中以阿拉丁形式展示,不仅可以更直观的向用户展现目标网页中的主要内容,便于用户判断该目标网页是否满足自己的搜索意图,还可以使得用户通过点 击子链接直接浏览对应内容,减少用户操作步骤。The target page sorting method described in the embodiment of the present application can obtain different sorting results for different target search sentences with the same intent category, which meets the actual search requirements of the user's corresponding target search sentence, and improves the user experience goodness. In addition, in the embodiment of the present application, the target webpage is displayed in the form of Aladdin in the search result page, which not only can more intuitively display the main content in the target webpage to the user, and is convenient for the user to determine whether the target webpage satisfies his or her own search intention. Can make the user pass the point Click on the link to browse the corresponding content and reduce the user steps.
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。With regard to the apparatus in the above embodiments, the specific manner in which the respective modules perform the operations has been described in detail in the embodiment relating to the method, and will not be explained in detail herein.
本领域的技术人员可以清楚地了解到本发明实施例中的技术可借助软件加必需的通用硬件的方式来实现,通用硬件包括通用集成电路、通用CPU、通用存储器、通用元器件等,当然也可以通过专用硬件包括专用集成电路、专用CPU、专用存储器、专用元器件等来实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明实施例中的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品包括具有处理器可执行的非易失的程序代码的计算机可读介质,如只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例或者实施例的某些部分所述的方法。It will be apparent to those skilled in the art that the technology in the embodiments of the present invention can be implemented by means of software plus necessary general hardware including general-purpose integrated circuits, general-purpose CPUs, general-purpose memories, general-purpose components, and the like. It can be implemented by dedicated hardware including an application specific integrated circuit, a dedicated CPU, a dedicated memory, a dedicated component, etc., but in many cases the former is a better implementation. Based on such understanding, the technical solution in the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product including a non-volatile processor-executable processor. A computer readable medium of program code, such as Read-Only Memory (ROM), Random Access Memory (RAM), a magnetic disk, an optical disk, etc., includes instructions for causing a computer device ( The method may be a personal computer, a server, or a network device, etc., performing the various embodiments of the present invention or portions of the embodiments.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于装置和***实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。 The various embodiments in the specification are described in a progressive manner, and the same or similar parts between the various embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the device and system embodiments, the description is relatively simple, since it is substantially similar to the method embodiment, and the relevant portions of the method embodiments can be referred to.
以上所述的本发明实施方式,并不构成对本发明保护范围的限定。任何在本发明的精神和原则之内所作的修改、等同替换和改进等,均应包含在本发明的保护范围之内。 The embodiments of the invention described above are not intended to limit the scope of the invention. Any modifications, equivalent substitutions and improvements made within the spirit and scope of the invention are intended to be included within the scope of the invention.

Claims (18)

  1. 一种搜索意图识别方法,其特征在于,包括:A search intent identification method, comprising:
    获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别分别对应的专项搜索词库;根据所述专项搜索词库建立分类模型,并通过所述分类模型获取各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库;Obtaining a first historical search sentence set in the first preset time period, classifying the historical search sentences in the first historical search sentence set, and obtaining a special search vocabulary corresponding to each preset special category; according to the special item Searching the lexicon to establish a classification model, and obtaining candidate search sentences corresponding to each preset special category by using the classification model, and recording the candidate search statements into a special search vocabulary of the corresponding category;
    获得目标搜索语句,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为所述目标搜索语句的意图类别。Obtaining a target search sentence, and determining at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为目标搜索语句的意图类别之前,所述方法还包括:The method according to claim 1, wherein the determining, according to the historical search sentence and the candidate search sentence in the special search vocabulary, at least one preset special category, as the intent category of the target search sentence, is The method also includes:
    获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。Obtaining a second historical search sentence set in a second preset time, and training the classification model according to the second historical search sentence set to update the special search term database.
  3. 根据权利要求1或2所述的方法,其特征在于,所述对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个 预设专项类别对应的专项搜索词库,包括:The method according to claim 1 or 2, wherein said classifying said historical search sentences in said first set of historical search sentences The special search vocabulary corresponding to the preset special category, including:
    获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合;Obtaining a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time;
    针对每个历史搜索语句,确定其点击网页组合中各个点击网页分别所属的各个预设专项类别,计算所述各个预设专项类别对应的点击网页在所述点击网页组合中分别所占的点击比例,将大于预设阈值的点击比例对应的预设专项类别作为相应的所述历史搜索语句的意图类别,并将各个历史搜索语句分别记入其意图类别对应的专项搜索词库。For each historical search statement, determining each preset special category to which each clicked webpage belongs in the clicked webpage combination, and calculating the proportion of the clicked webpage corresponding to each of the preset specialized categories in the clicked webpage combination The preset special category corresponding to the click ratio greater than the preset threshold is used as the intent category of the corresponding historical search sentence, and each historical search sentence is respectively recorded in the special search vocabulary corresponding to the intent category thereof.
  4. 根据权利要求3所述的方法,其特征在于,确定历史搜索语句的点击网页组合中各个点击网页所属的预设专项类别,包括:The method according to claim 3, wherein determining a preset special category to which each clicked webpage belongs in the clicked webpage combination of the historical search sentence comprises:
    针对所述点击网页组合中的每个点击网页,获取其URL;Obtaining a URL for each clicked webpage in the clicked webpage combination;
    根据所述URL确定相应的点击网页对应的主机名;Determining, according to the URL, a host name corresponding to the corresponding clicked webpage;
    查询各个预设专项类别对应的专项站点列表,确定所述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。Query the specific site list corresponding to each preset special category, determine the preset special category corresponding to the special site list of the host name, and use it as the preset special category to which the corresponding clicked webpage belongs.
  5. 根据权利要求3所述的方法,其特征在于,所述根据所述专项搜索词库建立分类模型,包括: The method according to claim 3, wherein the establishing a classification model according to the special search vocabulary comprises:
    针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题;For each historical search sentence in the special search vocabulary, respectively obtain a URL corresponding to the clicked webpage and a webpage title;
    将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理;Separating and processing each historical search sentence and the corresponding webpage title and URL in the special search vocabulary;
    将分割历史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量;Each character obtained by dividing the historical search term and the web page title, and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space;
    根据所述特征向量,并以所述点击网页所属预设专项类别对应的点击比例作为相关特征向量的权重,建立基于最大熵模型的分类模型。The classification model based on the maximum entropy model is established according to the feature vector and the click proportion corresponding to the preset special category of the clicked webpage as the weight of the related feature vector.
  6. 根据权利要求3所述的方法,其特征在于,在确定所述目标搜索语句的意图类别后,所述方法还包括:The method of claim 3, wherein after determining the intent category of the target search statement, the method further comprises:
    根据所述目标搜索语句的意图类别对所述目标搜索语句进行垂直搜索,得到与所述目标搜索语句相关的目标网页;Performing a vertical search on the target search sentence according to an intent category of the target search sentence, to obtain a target webpage related to the target search sentence;
    根据所述意图类别对应的点击比例确定所述意图类别的搜索意图等级;Determining a search intent level of the intent category according to a click ratio corresponding to the intent category;
    根据所述搜索意图等级确定各个意图类别对应的目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。Determining a display order of the target web pages corresponding to the respective intent categories according to the search intent level, and generating a search result page in the form of Aladdin.
  7. 一种搜索意图识别装置,其特征在于,包括:样本获取单 元、模型控制单元和意图识别单元;A search intention identification device, comprising: a sample acquisition slip Element, model control unit and intention identification unit;
    所述样本获取单元用于,获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别分别对应的专项搜索词库;The sample obtaining unit is configured to obtain a first historical search sentence set in a first preset time period, classify the historical search sentences in the first historical search sentence set, and obtain a special item corresponding to each preset special category Search thesaurus;
    所述模型控制单元用于,根据所述专项搜索词库建立分类模型,并通过所述分类模型获取所述各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库;The model control unit is configured to: establish a classification model according to the special search vocabulary, and obtain candidate search sentences corresponding to the preset preset categories by using the classification model, and record the candidate search statements into corresponding categories Special search term library;
    所述意图识别单元用于,获得目标搜索语句,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为所述目标搜索语句的意图类别。The intention identification unit is configured to obtain a target search sentence, and determine at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  8. 根据权利要求7所述的装置,其特征在于,还包括:更新单元,用于获取第二预设时间内的第二历史搜索语句集合,并根据所述第二历史搜索语句集合训练所述分类模型,以更新所述专项搜索词库。The apparatus according to claim 7, further comprising: an updating unit, configured to acquire a second historical search sentence set in a second preset time, and train the classification according to the second historical search sentence set Model to update the special search term library.
  9. 根据权利要求7或8所述的装置,其特征在于,所述样本获取单元包括:The apparatus according to claim 7 or 8, wherein the sample acquisition unit comprises:
    点击网页获取单元,用于获取所述第一预设时间内所述第一历史搜索语句集合中各个历史搜索语句对应的点击网页组合; Clicking a webpage obtaining unit, configured to acquire a click webpage combination corresponding to each historical search sentence in the first historical search sentence set in the first preset time;
    点击网页分析单元,用于针对每个历史搜索语句,确定其点击网页组合中各个点击网页分别所属的各个预设专项类别,计算所述各个预设专项类别对应的点击网页在所述点击网页组合中分别所占的点击比例,将大于预设阈值的点击比例对应的预设专项类别作为相应的所述历史搜索语句的意图类别,并将各个历史搜索语句分别记入其意图类别对应的专项搜索词库。The click webpage analyzing unit is configured to determine, for each historical search sentence, each preset special category to which each click webpage in the click webpage combination belongs, and calculate a click webpage corresponding to each preset special category in the click webpage combination The proportion of the clicks in the respective ones is the preset special category corresponding to the click ratio of the preset threshold as the intent category of the corresponding historical search sentence, and each historical search sentence is respectively recorded in the special search corresponding to the intent category thereof. Thesaurus.
  10. 根据权利要求9所述的装置,其特征在于,所述点击网页分析单元包括:点击网页分类模块;The device according to claim 9, wherein the click webpage analyzing unit comprises: clicking a webpage sorting module;
    所述点击网页分类模块被配置为:针对所述点击网页组合中的每个点击网页,获取其URL,根据所述URL确定相应的点击网页对应的主机名,查询各个预设专项类别对应的专项站点列表,确定所述主机名所在专项站点列表对应的预设专项类别,并将其作为相应点击网页所属的预设专项类别。The click webpage classification module is configured to: obtain a URL for each click webpage in the click webpage combination, determine a host name corresponding to the corresponding click webpage according to the URL, and query a special item corresponding to each preset special category The site list determines a preset special category corresponding to the special site list of the host name, and uses it as a preset special category to which the corresponding clicked webpage belongs.
  11. 根据权利要求9所述的装置,其特征在于,所述模型控制单元包括:The apparatus according to claim 9, wherein said model control unit comprises:
    样本数据获取单元,用于针对所述专项搜索词库中各个历史搜索语句,分别获取其点击网页对应的URL、网页标题;a sample data obtaining unit, configured to obtain, for each historical search sentence in the special search vocabulary, a URL and a webpage title corresponding to the clicked webpage;
    特征向量生成单元,用于将所述专项搜索词库中的各个历史搜索语句及对应的网页标题、URL分别进行分割处理,并将分割历 史搜索词语和网页标题得到的各个词元、以及分割URL得到的字符串分别表示为基于特征空间的特征向量;a feature vector generating unit, configured to separately divide each historical search sentence in the special search dictionary and the corresponding webpage title and URL, and divide the calendar Each word obtained from the history search term and the web page title, and the character string obtained by dividing the URL are respectively represented as feature vectors based on the feature space;
    模型建立单元,用于根据所述特征向量,并以所述点击网页所属预设专项类别对应的点击比例作为相关特征向量的权重,建立基于最大熵模型的分类模型。The model establishing unit is configured to establish a classification model based on the maximum entropy model according to the feature vector and using the click proportion corresponding to the preset special category of the clicked webpage as the weight of the related feature vector.
  12. 根据权利要求9所述的装置,其特征在于,还包括:搜索展示单元;The device according to claim 9, further comprising: searching for a display unit;
    所述搜索展示单元包括:垂直搜索单元,用于根据所述目标搜索语句的意图类别对所述目标搜索语句进行垂直搜索,得到与所述目标搜索语句相关的目标网页;The search display unit includes: a vertical search unit, configured to perform a vertical search on the target search sentence according to an intent category of the target search statement, to obtain a target webpage related to the target search sentence;
    等级确定单元,用于根据所述意图类别对应的点击比例确定所述意图类别的搜索意图等级;a level determining unit, configured to determine a search intent level of the intent category according to a click ratio corresponding to the intent category;
    排序单元,用于根据所述搜索意图等级确定各个意图类别对应的目标网页的展示顺序,并生成阿拉丁形式的搜索结果页面。And a sorting unit, configured to determine, according to the search intention level, a display order of the target webpage corresponding to each intent category, and generate a search result page in the form of Aladdin.
  13. 一种服务器,其特征在于,包括:A server, comprising:
    存储器;Memory
    处理器;以及Processor;
    搜索意图识别装置,所述搜索意图识别装置安装于所述存储 器中并包括一个或多个由所述处理器执行的软件功能模块,所述装置包括:Searching for an intent recognition device, the search intent recognition device being installed in the storage And including one or more software function modules executed by the processor, the device comprising:
    样本获取单元、模型控制单元和意图识别单元;a sample acquisition unit, a model control unit, and an intention recognition unit;
    所述样本获取单元用于,获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别分别对应的专项搜索词库;The sample obtaining unit is configured to obtain a first historical search sentence set in a first preset time period, classify the historical search sentences in the first historical search sentence set, and obtain a special item corresponding to each preset special category Search thesaurus;
    所述模型控制单元用于,根据所述专项搜索词库建立分类模型,并通过所述分类模型获取所述各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库;The model control unit is configured to: establish a classification model according to the special search vocabulary, and obtain candidate search sentences corresponding to the preset preset categories by using the classification model, and record the candidate search statements into corresponding categories Special search term library;
    所述意图识别单元用于,获得目标搜索语句,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为所述目标搜索语句的意图类别。The intention identification unit is configured to obtain a target search sentence, and determine at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  14. 一种具有处理器可执行的非易失的程序代码的计算机可读介质,其特征在于,所述程序代码使所述处理器执行以下方法:A computer readable medium having processor-executable non-volatile program code, wherein the program code causes the processor to perform the following method:
    获得第一预设时间内的第一历史搜索语句集合,对所述第一历史搜索语句集合中的历史搜索语句进行分类,得到各个预设专项类别分别对应的专项搜索词库;Obtaining a first historical search sentence set in a first preset time period, classifying the historical search sentences in the first historical search sentence set, and obtaining a special search vocabulary corresponding to each preset special category;
    根据所述专项搜索词库建立分类模型,并通过所述分类模型获取所述各个预设专项类别对应的候补搜索语句,将所述候补搜索语句记入相应类别的专项搜索词库; Generating a classification model according to the special search vocabulary, and obtaining candidate search sentences corresponding to the respective preset special categories by using the classification model, and recording the candidate search sentences into a special search vocabulary of a corresponding category;
    获得目标搜索语句,根据所述专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,作为所述目标搜索语句的意图类别。Obtaining a target search sentence, and determining at least one preset special category as the intent category of the target search sentence according to the historical search sentence and the candidate search sentence in the special search dictionary.
  15. 一种搜索意图识别方法,其特征在于,包括:A search intent identification method, comprising:
    获得目标搜索语句;Obtain the target search statement;
    根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,在所述专项搜索词库中,所述历史搜索语句对应至少一个预设专项类别,所述候补搜索语句通过根据所述专项搜索词库建立的分类模型获得;Determining at least one preset special category according to the historical search sentence and the candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset special category, and the candidate search statement passes Obtained according to the classification model established by the special search vocabulary;
    根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。The intent category of the target search sentence is determined according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence.
  16. 根据权利要求15所述的方法,其特征在于,根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别,包括:The method according to claim 15, wherein determining the intent category of the target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence comprises:
    计算所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例;Calculating a ratio of the determined amount of clicks of the at least one preset special category to the total click amount of the target search sentence;
    将大于预设阈值的比例对应的预设专项类别作为所述目标搜索语句的意图类别。A preset special category corresponding to a ratio greater than a preset threshold is used as an intent category of the target search sentence.
  17. 一种搜索意图识别装置,其特征在于,包括:所述意图 识别单元包括:语句接收子单元、类别确定子单元、意图识别子单元,A search intention identification device, comprising: the intent The identification unit includes: a statement receiving subunit, a category determining subunit, and an intent identifying subunit,
    所述语句接收子单元用于获得目标搜索语句;The statement receiving subunit is configured to obtain a target search statement;
    所述类别确定子单元用于根据专项搜索词库中的历史搜索语句以及候补搜索语句确定至少一个预设专项类别,在所述专项搜索词库中,所述历史搜索语句对应至少一个预设专项类别,所述候补搜索语句通过根据所述专项搜索词库建立的分类模型获得;The category determining subunit is configured to determine at least one preset special category according to a historical search sentence and a candidate search sentence in the special search vocabulary, wherein the historical search sentence corresponds to at least one preset special item in the special search vocabulary a category, the candidate search statement is obtained by a classification model established according to the special search term library;
    所述意图识别子单元用于根据所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例确定目标搜索语句的意图类别。The intent identification subunit is configured to determine an intent category of the target search sentence according to the determined ratio of the click amount of the at least one preset special category to the total click amount of the target search sentence.
  18. 根据权利要求17所述的装置,其特征在于,所述意图识别子单元还用于:The device according to claim 17, wherein the intent identification subunit is further configured to:
    计算所确定的至少一个预设专项类别的点击量与所述目标搜索语句的总的点击量的比例;Calculating a ratio of the determined amount of clicks of the at least one preset special category to the total click amount of the target search sentence;
    将大于预设阈值的比例对应的预设专项类别作为所述目标搜索语句的意图类别。 A preset special category corresponding to a ratio greater than a preset threshold is used as an intent category of the target search sentence.
PCT/CN2016/085338 2015-08-07 2016-06-08 Search intention identification method and device WO2017024884A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201510486646.X 2015-08-07
CN201510486646.XA CN105095187A (en) 2015-08-07 2015-08-07 Search intention identification method and device

Publications (1)

Publication Number Publication Date
WO2017024884A1 true WO2017024884A1 (en) 2017-02-16

Family

ID=54575659

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/085338 WO2017024884A1 (en) 2015-08-07 2016-06-08 Search intention identification method and device

Country Status (2)

Country Link
CN (1) CN105095187A (en)
WO (1) WO2017024884A1 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509502A (en) * 2017-02-28 2018-09-07 灯塔人工智能公司 The speech interface of monitoring system for view-based access control model
CN109697282A (en) * 2017-10-20 2019-04-30 阿里巴巴集团控股有限公司 A kind of the user's intension recognizing method and device of sentence
CN110069709A (en) * 2019-04-10 2019-07-30 腾讯科技(深圳)有限公司 Intension recognizing method, device, computer-readable medium and electronic equipment
CN110503143A (en) * 2019-08-14 2019-11-26 平安科技(深圳)有限公司 Research on threshold selection, equipment, storage medium and device based on intention assessment
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN111125523A (en) * 2019-12-20 2020-05-08 华为技术有限公司 Searching method, searching device, terminal equipment and storage medium
CN111368161A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Search intention recognition method and intention recognition model training method and device
CN111563161A (en) * 2020-04-26 2020-08-21 深圳市优必选科技股份有限公司 Sentence recognition method, sentence recognition device and intelligent equipment
CN111581388A (en) * 2020-05-11 2020-08-25 北京金山安全软件有限公司 User intention identification method and device and electronic equipment
CN111666448A (en) * 2020-04-21 2020-09-15 北京奇艺世纪科技有限公司 Search method, search device, electronic equipment and computer-readable storage medium
CN111737606A (en) * 2019-03-25 2020-10-02 阿里巴巴集团控股有限公司 Search result display method, device and equipment and readable storage medium
CN111859100A (en) * 2019-12-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Retrieval intention transfer identification method and device
CN111897994A (en) * 2020-07-15 2020-11-06 腾讯音乐娱乐科技(深圳)有限公司 Search method, search device, server and computer-readable storage medium
CN111966948A (en) * 2020-09-25 2020-11-20 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium
CN112380421A (en) * 2020-11-11 2021-02-19 北京希瑞亚斯科技有限公司 Resume searching method and device, electronic equipment and computer storage medium
CN112749328A (en) * 2020-04-21 2021-05-04 腾讯科技(深圳)有限公司 Searching method and device and computer equipment
CN112749313A (en) * 2020-08-04 2021-05-04 腾讯科技(深圳)有限公司 Label labeling method and device, computer equipment and storage medium
CN113111249A (en) * 2021-03-16 2021-07-13 百度在线网络技术(北京)有限公司 Search processing method and device, electronic equipment and storage medium
CN113127602A (en) * 2021-04-30 2021-07-16 竹间智能科技(上海)有限公司 Intention identification method and device

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device
CN106919588A (en) * 2015-12-24 2017-07-04 北京奇虎科技有限公司 A kind of application program search system and method
CN105956011B (en) * 2016-04-21 2020-01-21 百度在线网络技术(北京)有限公司 Searching method and device
CN106294674A (en) * 2016-08-02 2017-01-04 郑州悉知信息科技股份有限公司 A kind of information detecting method and device
CN106776853A (en) * 2016-11-28 2017-05-31 广州市动景计算机科技有限公司 Searching method, device, client device and graph user interface system
CN106776981B (en) * 2016-12-06 2020-12-15 广州同构科技有限公司 Intelligent retrieval method based on empirical knowledge
CN106599278B (en) * 2016-12-23 2020-06-12 北京奇虎科技有限公司 Application search intention identification method and device
CN106599304B (en) * 2016-12-29 2020-03-24 中南大学 Modular user retrieval intention modeling method for small and medium-sized websites
CN106951503B (en) * 2017-03-16 2020-06-23 百度在线网络技术(北京)有限公司 Information providing method, device, equipment and storage medium
CN107153672A (en) * 2017-03-22 2017-09-12 中国科学院自动化研究所 User mutual intension recognizing method and system based on Speech Act Theory
CN107291864B (en) * 2017-06-12 2020-04-07 北京三快在线科技有限公司 Searching method and device and electronic equipment
CN107256267B (en) 2017-06-19 2020-07-24 北京百度网讯科技有限公司 Query method and device
CN107609094B (en) * 2017-09-08 2020-12-04 北京百度网讯科技有限公司 Data disambiguation method and device and computer equipment
CN110147485A (en) * 2017-09-22 2019-08-20 北京京东尚科信息技术有限公司 A kind of method and apparatus for the attribute identifying search term
CN109660580B (en) * 2017-10-11 2021-06-22 苏州跃盟信息科技有限公司 Information pushing method and device
CN109831472B (en) * 2017-11-23 2021-04-06 苏州跃盟信息科技有限公司 Information pushing and information displaying method and system
CN107832468B (en) * 2017-11-29 2019-05-10 百度在线网络技术(北京)有限公司 Demand recognition methods and device
CN108052613B (en) * 2017-12-14 2021-12-31 北京百度网讯科技有限公司 Method and device for generating page
CN108268617B (en) * 2018-01-05 2021-10-29 创新先进技术有限公司 User intention determining method and device
US10831797B2 (en) * 2018-03-23 2020-11-10 International Business Machines Corporation Query recognition resiliency determination in virtual agent systems
CN110737823B (en) * 2018-07-03 2022-06-24 百度在线网络技术(北京)有限公司 Access intention mining method and device
CN110968686A (en) * 2018-09-28 2020-04-07 百度在线网络技术(北京)有限公司 Intention recognition method, device, equipment and computer readable medium
CN111177521A (en) * 2018-10-24 2020-05-19 北京搜狗科技发展有限公司 Method and device for determining query term classification model
CN109508376A (en) * 2018-11-23 2019-03-22 四川长虹电器股份有限公司 It can online the error correction intension recognizing method and device that update
CN111666416B (en) * 2019-03-08 2023-06-16 百度在线网络技术(北京)有限公司 Method and device for generating semantic matching model
CN109947924B (en) * 2019-03-21 2021-08-31 百度在线网络技术(北京)有限公司 Dialogue system training data construction method and device, electronic equipment and storage medium
CN110245357B (en) * 2019-06-26 2023-05-02 北京百度网讯科技有限公司 Main entity identification method and device
CN112507181B (en) * 2019-09-16 2023-09-29 百度在线网络技术(北京)有限公司 Search request classification method, device, electronic equipment and storage medium
CN115146123A (en) * 2019-09-29 2022-10-04 北京百度网讯科技有限公司 Classification method and device
CN111737571B (en) * 2020-06-11 2024-01-30 北京字节跳动网络技术有限公司 Searching method and device and electronic equipment
CN112231554B (en) * 2020-10-10 2023-10-31 腾讯科技(深圳)有限公司 Search recommended word generation method and device, storage medium and computer equipment
CN112560425B (en) * 2020-12-24 2024-04-09 北京百度网讯科技有限公司 Template generation method and device, electronic equipment and storage medium
CN113343028B (en) * 2021-05-31 2022-09-02 北京达佳互联信息技术有限公司 Method and device for training intention determination model
CN113707300A (en) * 2021-08-30 2021-11-26 康键信息技术(深圳)有限公司 Search intention identification method, device, equipment and medium based on artificial intelligence

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456018A (en) * 2010-10-18 2012-05-16 腾讯科技(深圳)有限公司 Interactive search method and device
CN102955798A (en) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 Search engine based search method and search server
CN103235784A (en) * 2013-03-28 2013-08-07 百度在线网络技术(北京)有限公司 Method and equipment used for obtaining search results
CN103838754A (en) * 2012-11-23 2014-06-04 腾讯科技(深圳)有限公司 Information searching device and method
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6981040B1 (en) * 1999-12-28 2005-12-27 Utopy, Inc. Automatic, personalized online information and product services
CN101551806B (en) * 2008-04-03 2012-04-18 北京搜狗科技发展有限公司 Personalized website navigation method and system
CN104050240A (en) * 2014-05-26 2014-09-17 北京奇虎科技有限公司 Method and device for determining categorical attribute of search query word

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456018A (en) * 2010-10-18 2012-05-16 腾讯科技(深圳)有限公司 Interactive search method and device
CN102955798A (en) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 Search engine based search method and search server
CN103838754A (en) * 2012-11-23 2014-06-04 腾讯科技(深圳)有限公司 Information searching device and method
CN103235784A (en) * 2013-03-28 2013-08-07 百度在线网络技术(北京)有限公司 Method and equipment used for obtaining search results
CN105095187A (en) * 2015-08-07 2015-11-25 广州神马移动信息科技有限公司 Search intention identification method and device

Cited By (33)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108509502A (en) * 2017-02-28 2018-09-07 灯塔人工智能公司 The speech interface of monitoring system for view-based access control model
CN109697282B (en) * 2017-10-20 2023-06-06 阿里巴巴集团控股有限公司 Sentence user intention recognition method and device
CN109697282A (en) * 2017-10-20 2019-04-30 阿里巴巴集团控股有限公司 A kind of the user's intension recognizing method and device of sentence
CN111368161B (en) * 2018-12-26 2024-01-09 北京搜狗科技发展有限公司 Search intention recognition method, intention recognition model training method and device
CN111368161A (en) * 2018-12-26 2020-07-03 北京搜狗科技发展有限公司 Search intention recognition method and intention recognition model training method and device
CN111737606A (en) * 2019-03-25 2020-10-02 阿里巴巴集团控股有限公司 Search result display method, device and equipment and readable storage medium
CN111737606B (en) * 2019-03-25 2024-05-14 阿里巴巴集团控股有限公司 Method, device and equipment for showing search results and readable storage medium
CN110069709B (en) * 2019-04-10 2023-10-20 腾讯科技(深圳)有限公司 Intention recognition method, device, computer readable medium and electronic equipment
CN110069709A (en) * 2019-04-10 2019-07-30 腾讯科技(深圳)有限公司 Intension recognizing method, device, computer-readable medium and electronic equipment
CN110503143A (en) * 2019-08-14 2019-11-26 平安科技(深圳)有限公司 Research on threshold selection, equipment, storage medium and device based on intention assessment
CN110503143B (en) * 2019-08-14 2024-03-19 平安科技(深圳)有限公司 Threshold selection method, device, storage medium and device based on intention recognition
CN110597988A (en) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN110597988B (en) * 2019-08-28 2024-03-19 腾讯科技(深圳)有限公司 Text classification method, device, equipment and storage medium
CN111125523A (en) * 2019-12-20 2020-05-08 华为技术有限公司 Searching method, searching device, terminal equipment and storage medium
CN111125523B (en) * 2019-12-20 2024-03-01 华为技术有限公司 Searching method, searching device, terminal equipment and storage medium
CN111859100A (en) * 2019-12-26 2020-10-30 北京嘀嘀无限科技发展有限公司 Retrieval intention transfer identification method and device
CN111859100B (en) * 2019-12-26 2023-11-03 北京嘀嘀无限科技发展有限公司 Retrieval intention transferring and identifying method and device
CN112749328A (en) * 2020-04-21 2021-05-04 腾讯科技(深圳)有限公司 Searching method and device and computer equipment
CN111666448B (en) * 2020-04-21 2024-01-26 北京奇艺世纪科技有限公司 Search method, search device, electronic equipment and computer readable storage medium
CN111666448A (en) * 2020-04-21 2020-09-15 北京奇艺世纪科技有限公司 Search method, search device, electronic equipment and computer-readable storage medium
CN112749328B (en) * 2020-04-21 2024-01-05 腾讯科技(深圳)有限公司 Searching method, searching device and computer equipment
CN111563161B (en) * 2020-04-26 2023-05-23 深圳市优必选科技股份有限公司 Statement identification method, statement identification device and intelligent equipment
CN111563161A (en) * 2020-04-26 2020-08-21 深圳市优必选科技股份有限公司 Sentence recognition method, sentence recognition device and intelligent equipment
CN111581388A (en) * 2020-05-11 2020-08-25 北京金山安全软件有限公司 User intention identification method and device and electronic equipment
CN111581388B (en) * 2020-05-11 2023-09-19 北京金山安全软件有限公司 User intention recognition method and device and electronic equipment
CN111897994A (en) * 2020-07-15 2020-11-06 腾讯音乐娱乐科技(深圳)有限公司 Search method, search device, server and computer-readable storage medium
CN112749313A (en) * 2020-08-04 2021-05-04 腾讯科技(深圳)有限公司 Label labeling method and device, computer equipment and storage medium
CN111966948B (en) * 2020-09-25 2023-08-01 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium
CN111966948A (en) * 2020-09-25 2020-11-20 北京百度网讯科技有限公司 Information delivery method, device, equipment and storage medium
CN112380421A (en) * 2020-11-11 2021-02-19 北京希瑞亚斯科技有限公司 Resume searching method and device, electronic equipment and computer storage medium
CN113111249A (en) * 2021-03-16 2021-07-13 百度在线网络技术(北京)有限公司 Search processing method and device, electronic equipment and storage medium
CN113127602B (en) * 2021-04-30 2023-05-26 竹间智能科技(上海)有限公司 Intention recognition method and device
CN113127602A (en) * 2021-04-30 2021-07-16 竹间智能科技(上海)有限公司 Intention identification method and device

Also Published As

Publication number Publication date
CN105095187A (en) 2015-11-25

Similar Documents

Publication Publication Date Title
WO2017024884A1 (en) Search intention identification method and device
US10409874B2 (en) Search based on combining user relationship datauser relationship data
US10733197B2 (en) Method and apparatus for providing information based on artificial intelligence
JP6554685B2 (en) Method and apparatus for providing search results
CN110162695B (en) Information pushing method and equipment
WO2018072071A1 (en) Knowledge map building system and method
WO2021098648A1 (en) Text recommendation method, apparatus and device, and medium
US20140019460A1 (en) Targeted search suggestions
CN109918555B (en) Method, apparatus, device and medium for providing search suggestions
US20230147941A1 (en) Method, apparatus and device used to search for content
CN110147494B (en) Information searching method and device, storage medium and electronic equipment
WO2018176913A1 (en) Search method and apparatus, and non-temporary computer-readable storage medium
CN106462644B (en) Identifying preferred result pages from multiple result page identifications
US10127322B2 (en) Efficient retrieval of fresh internet content
JP2017045196A (en) Ambiguity evaluation device, ambiguity evaluation method, and ambiguity evaluation program
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
US8799314B2 (en) System and method for managing information map
WO2014059851A1 (en) Search server and search method
US20210334314A1 (en) Sibling search queries
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN113761084B (en) POI search ranking model training method, ranking device, method and medium
RU2589856C2 (en) Method of processing target message, method of processing new target message and server (versions)
CN116508004A (en) Method for point of interest information management, electronic device, and storage medium
WO2021051587A1 (en) Search result sorting method and apparatus based on semantic recognition, electronic device, and storage medium
CN113761125A (en) Dynamic summary determination method and device, computing equipment and computer storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16834508

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16834508

Country of ref document: EP

Kind code of ref document: A1