WO2021226840A1 - 热点新闻意图识别方法、装置、设备及可读存储介质 - Google Patents

热点新闻意图识别方法、装置、设备及可读存储介质 Download PDF

Info

Publication number
WO2021226840A1
WO2021226840A1 PCT/CN2020/089839 CN2020089839W WO2021226840A1 WO 2021226840 A1 WO2021226840 A1 WO 2021226840A1 CN 2020089839 W CN2020089839 W CN 2020089839W WO 2021226840 A1 WO2021226840 A1 WO 2021226840A1
Authority
WO
WIPO (PCT)
Prior art keywords
hot news
query
news
keyword
hot
Prior art date
Application number
PCT/CN2020/089839
Other languages
English (en)
French (fr)
Inventor
刘晓聪
曾冠荣
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2020/089839 priority Critical patent/WO2021226840A1/zh
Priority to CN202080100566.5A priority patent/CN115516447A/zh
Publication of WO2021226840A1 publication Critical patent/WO2021226840A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines

Definitions

  • the present disclosure relates to the field of information technology, and in particular, to a method, device, device, and readable storage medium for identifying hot news intentions.
  • Search Engine is a retrieval technology that uses specific strategies to retrieve specified information from the Internet and feeds it back to users based on user needs and certain algorithms. Every search initiation of a search engine comes from the input of a query. How the search engine understands the query input by the user directly affects the final search results. Therefore, the effect of intent recognition in the search scenario is to measure the quality of the search engine. Decisive factor.
  • Hot news refers to news that has occurred recently and has high real-time characteristics.
  • the search engine needs to identify the hot news query intention from the query sentence.
  • the present disclosure provides a hot news intention recognition method, device, equipment and readable storage medium.
  • a hot news query intention recognition method including: obtaining a query sentence sent by a client; performing word segmentation processing on the query sentence to obtain at least one query word; querying the at least one query word Whether there are query words that match the keywords in the hot news keyword set; when there is a query word in the at least one query word that matches the keywords in the hot news keyword set, the query words are respectively obtained At least one piece of hot news corresponding to the matched keyword; and identifying whether the query intention of the query statement is a hot news query intention according to the determined correlation value between the query sentence and each piece of the hot news.
  • a method for determining a set of hot news keywords includes: obtaining a plurality of hot news items; respectively performing word segmentation processing on each hot news item to obtain at least one news item of each hot news item Vocabulary; according to at least one news vocabulary of each of the hot news, respectively determine the keywords of each of the hot news; and according to each of the hot news and the keywords of each of the hot news, determine the hot Information keyword collection.
  • a hot news query intention recognition device including: a sentence acquisition module for acquiring a query sentence sent by a client; a word segmentation processing module for performing word segmentation processing on the query sentence to obtain at least one Query term; keyword matching module for querying whether there is a query term in the at least one query term that matches a keyword in the hot news keyword set; news recall module for when there is a query in the at least one query term When a word matches a keyword in the hot news keyword set, at least one hot news item corresponding to the keyword matching each query word is obtained; The correlation value between each piece of hot news identifies whether the query intention of the query sentence is the hot news query intention.
  • a device for determining a set of hot news keywords which includes: a news acquisition module for acquiring multiple hot news; a word segmentation processing module for separately performing word segmentation processing on each of the hot news To obtain at least one news vocabulary of each hot news item; a keyword extraction module for separately determining the keywords of each hot news item according to at least one news vocabulary of each hot news item; and a set determining module , Used to determine the hot information keyword set according to each hot news item and the keywords of each hot news item.
  • an electronic device including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to execute the executable instructions The above-mentioned hot news query intention identification method is executed, or the processor is configured to execute the above-mentioned hot news keyword set determination method by executing the executable instruction.
  • a computer-readable storage medium on which a computer program is stored, and the computer program is executed by a processor to realize the above-mentioned hot news query intention recognition method, or the computer program is processed
  • the above method for determining the set of hot news keywords is implemented when the processor is executed.
  • the query sentence is segmented to obtain at least one query word; then each query word is matched with a keyword in a pre-obtained hot news keyword set to determine Pre-recall hot news; then according to the correlation between the query statement and the pre-recall hot news, to identify whether the query statement has the hot news query intention.
  • this method compared to the method of recognizing query intentions by training text classification models in related technologies, this method does not need to train a large number of samples with intention labels to obtain a certain accuracy Classification model, this method can identify whether the query sentence has the intention of searching hot news in real time; second, through the pre-recall method of hot news, the calculation amount of the correlation between the user query sentence and the massive hot news is greatly reduced. In this way, the intent classification can be completed quickly.
  • the average delay of the hot news query intent recognition service is less than 3ms, so the service concurrency can also be increased, which can reduce the number of deployed search servers;
  • this method can also effectively improve the accuracy and recall rate of hot news query intention recognition;
  • this method can also achieve a good intervention in online search hot news intention recognition by configuring a set of hot news keywords in advance.
  • Fig. 1 is a schematic structural diagram of a computer system provided by an exemplary embodiment of the present disclosure.
  • Fig. 2 shows a flowchart of a hot news query intention recognition method in an embodiment of the present disclosure.
  • Fig. 3 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure.
  • FIG. 4 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure.
  • Fig. 5 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure.
  • Fig. 6 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure.
  • Fig. 7 shows a flowchart of a method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • Fig. 8 shows a flowchart of another method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • FIG. 9 shows a flowchart of another method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of a hot news query intention recognition device in an embodiment of the present disclosure.
  • FIG. 11 shows a schematic diagram of an apparatus for determining a set of hot news keywords in an embodiment of the present disclosure.
  • FIG. 12 shows a schematic structural diagram of an electronic device in an embodiment of the present disclosure.
  • FIG. 13 shows a schematic diagram of a computer-readable storage medium in an embodiment of the present disclosure.
  • plural means at least two, such as two, three, etc., unless otherwise specifically defined.
  • “And/or” describes the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B, which can indicate the existence of A alone, B alone, and both A and B.
  • the symbol “/” generally indicates that the associated objects before and after are in an “or” relationship.
  • the terms “first” and “second” are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance or implicitly indicating the number of indicated technical features. Thus, the features defined with “first” and “second” may explicitly or implicitly include one or more of these features.
  • search engines need to be able to be able to The ability to quickly provide information search; search engines also need to be sensitive to outdated information, so as to avoid recalling news or events that are no longer popular; because users have different descriptions of news events, hot news query intention recognition also needs to query The change of sentence has strong robustness.
  • the recognition of hot news query intent also has the following technical difficulties that need to be overcome: high accuracy rate and high recall rate are required, among which high accuracy rate Priority; high computing performance and fast response speed are required; there is a large demand for short query sentences, especially when the number of words (term) after word segmentation is less than 8.
  • a common solution for intent recognition in search is to classify text.
  • the offline part uses user query sentences with annotated intent to train the classification model to determine whether it has this type of intent ;
  • the online part uses the classification model trained in the offline part to make predictions, and uses the prediction results to judge user intentions in real time.
  • This method also has high accuracy and reliable generalization ability, and as an end-to-end method, it also has relatively high controllability.
  • the method can be optimized with the help of dimensions such as samples and features.
  • it is an abstract model it is less affected by specific issues, so it has excellent results in a large number of intent classification scenarios.
  • the more commonly used text classification methods are as follows:
  • bag-of-words models such as one-hot (one-hot code), TF-IDF (Term Frequency-Inverse Document Frequency, word frequency-inverse text frequency) and other methods to transform vocabulary into The vector form for calculation, and then the text classification is realized by means of support vector machine, naive Bayes, etc.
  • Pre-trained word embedding refers to conversion through a specific method, using a word embedding model that can better express semantics to convert the input vocabulary into a vector, and then using deep learning models such as fully connected layers, Convolutional layer, etc., perform automatic feature extraction and transformation, and finally obtain the prediction result.
  • hot news News has the characteristics of high real-time.
  • the user's search demand for this news will explode in a short time and end in a short time.
  • the same query sentence needs to change from no hot news intention to hot news intention in a short time.
  • Text classification is this This method that relies on a lot of time training to have better results is not suitable; under real-time news requirements, the user query sentence data is scarce, and they are all unintentionally labeled data, unable to provide enough data for model training; whether there are hot spots in the query sentence There is not much relationship between news intention and semantic expression, so the text classification method, which is essentially semantic analysis, cannot reflect whether the query sentence has the hot news intention.
  • this disclosure proposes a hot news query intention recognition method.
  • a pre-recall mechanism is used to quickly determine the specific hot news that matches the user’s query sentence, and then through calculations
  • the relevance of the query statement and each specific hot spot information is to identify whether the query statement has the hot spot information query intention.
  • This method makes the search system sensitive to the birth and death process of hot news, can respond accurately and quickly, and identify whether the query sentence has the hot news query intention.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, and mechatronics.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • the solutions provided by the embodiments of the present disclosure mainly involve artificial natural language processing technology and machine learning technology.
  • Natural language processing (Nature Language Processing, NLP) is an important direction in the field of computer science and artificial intelligence. It studies various theories and methods that enable effective communication between humans and computers in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use daily, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning and other technologies.
  • artificial intelligence technology has been researched and applied in many fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, autonomous driving, drones , Robotics, intelligent medical care, intelligent customer service, etc., I believe that with the development of technology, artificial intelligence technology will be applied in more fields and play more and more important values.
  • Fig. 1 is a schematic structural diagram of a computer system provided by an exemplary embodiment of the present disclosure.
  • the system includes: a number of terminals 120 and a server cluster 140.
  • the terminal 120 may be a mobile phone, a game console, a tablet computer, an e-book reader, a smart glasses, an MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert compression standard audio layer 4) player, a smart home device, AR (Augmented Reality, Mobile terminals such as augmented reality equipment, VR (Virtual Reality, virtual reality) equipment, or the terminal 120 may also be a personal computer (PC), such as a laptop portable computer, a desktop computer, and so on.
  • PC personal computer
  • a client application program may be installed in the terminal 120, so that the user can search through the client application program, including searching for hot news.
  • the terminal 120 and the server cluster 140 are connected through a communication network.
  • the communication network is a wired network or a wireless network.
  • the server cluster 140 is a server, or consists of several servers, or a virtualization platform, or a cloud computing service center.
  • the server cluster 140 is used to provide background services for client applications in the terminal 120, such as providing online search engine services.
  • all or part of the servers in the server cluster 140 can also be used to provide offline services.
  • the method for determining a set of hot news keywords provided in the present disclosure is executed to determine a set of hot news keywords for online search offline.
  • the client applications installed in different terminals 120 are the same, or the clients installed on the two terminals 120 are the same type of client applications on different control system platforms.
  • the specific form of the client application program may also be different.
  • the client application program may be a mobile phone client, a PC client, or a World Wide Web (Web) client.
  • the number of the aforementioned terminals 120 may be more or less. For example, there may be only one terminal, or there may be dozens or hundreds of terminals, or more. The embodiments of the present disclosure do not limit the number of terminals and device types.
  • the aforementioned wireless network or wired network uses standard communication technologies and/or protocols.
  • the network is usually the Internet, but it can also be any network, including but not limited to Local Area Network (LAN), Metropolitan Area Network (MAN), Wide Area Network (WAN), mobile, wired or wireless Network, private network or any combination of virtual private network).
  • technologies and/or formats including HyperText Mark-up Language (HTML), Extensible Markup Language (XML), etc. are used to represent data exchanged over the network.
  • SSL Secure Socket Layer
  • TLS Transport Layer Security
  • VPN Virtual Private Network
  • IPsec Internet Protocol Security
  • Conventional encryption technology to encrypt all or some links.
  • customized and/or dedicated data communication technologies can also be used to replace or supplement the aforementioned data communication technologies.
  • Fig. 2 shows a flowchart of a hot news query intention recognition method in an embodiment of the present disclosure.
  • the method provided by the embodiments of the present disclosure can be executed by any electronic device with computing and processing capabilities.
  • the server cluster 140 is used as the execution subject for the example description.
  • the hot news query intention recognition method 10 includes:
  • step S102 the query sentence sent by the client is acquired.
  • the server cluster 140 can provide an online search function, receive the query sentence sent by the client in the terminal 120, perform a search based on the query sentence, and return the search result to the client.
  • step S104 word segmentation is performed on the query sentence to obtain at least one query term (term).
  • Word segmentation algorithms and/or tools can be used to segment query sentences.
  • the jieba (stuttering) word segmentation tool can be used to segment the query sentence to obtain one or more query words.
  • step S106 it is queried whether any of the at least one query term matches a keyword in the hot news keyword set.
  • the hot news keyword set is obtained and stored in advance by the server cluster 140, the set may be generated in advance in an offline manner, which contains recent hot news, and extracts the hot news keywords.
  • the collection will be updated accordingly in real time, continuously adding hot news that occurred recently, and removing outdated hot news.
  • the timeliness and hot spots of the collection can be set according to actual needs in specific applications, and the present disclosure is not limited thereto.
  • the method for determining the generation of the hot news keyword set will be described below.
  • the hot news query intention recognition method 10 may further include: acquiring and storing the hot news keyword set.
  • step S108 when a query word in the at least one query word matches a keyword in the hot news keyword set, at least one hot news item corresponding to the keyword matching each query word is obtained.
  • Obtaining at least one hot news corresponding to each matched keyword, respectively, can constitute a pre-recall hot news collection.
  • the matching of the above-mentioned query word and the keyword may include, for example, that the two are completely identical in literal terms, or may also include the same semantics of the two, and the present disclosure is not limited thereto.
  • step S110 according to the determined correlation value between the query sentence and each hot news item, it is identified whether the query intention of the query sentence is the hot news query intention.
  • the pre-recall hot news collection After obtaining the pre-recall hot news collection, respectively determine the correlation between the query sentence and each hot news in the pre-recall hot news collection, and the correlation can be represented by the correlation value, for example. Then, according to the correlation value between the query sentence and each hot news item, it is identified whether the query intention of the query sentence is the hot information query intention.
  • the largest relevance value may be selected among them, and the maximum relevance value is greater than the preset relevance value.
  • the threshold hot news is used as the hot news query intention of the query sentence, that is, the query intention of the query sentence is identified as the hot news query intention, and the hot news can be returned to the client.
  • the maximum relevance value is less than the preset relevance threshold, it is recognized that the query intention of the query sentence is not the hot news query intention.
  • the query sentence is segmented to obtain at least one query word; then each query word is matched with a keyword in a pre-obtained hot news keyword set to determine Pre-recall hot news; then according to the correlation between the query statement and the pre-recall hot news, to identify whether the query statement has the hot news query intention.
  • this method compared to the method of recognizing query intentions by training text classification models in related technologies, this method does not need to train a large number of samples with intention labels to obtain a certain accuracy Classification model, this method can identify whether the query sentence has the intention of searching hot news in real time; second, through the pre-recall method of hot news, the calculation amount of the correlation between the user query sentence and the massive hot news is greatly reduced. In this way, the intent classification can be completed quickly.
  • the average delay of the hot news query intent recognition service is less than 3ms, so the service concurrency can also be increased, which can reduce the number of deployed search servers;
  • this method can also effectively improve the accuracy and recall rate of hot news query intention recognition;
  • this method can also achieve a good intervention in online search hot news intention recognition by configuring a set of hot news keywords in advance.
  • Fig. 3 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure. Different from the hot news query intention recognition method 10 shown in FIG. 2, the hot news query intention recognition method shown in FIG. 3 further provides an implementation manner of step S108 in FIG. 2.
  • the hot news keyword set includes: keyword-hot news inverted dictionary; keyword-hot news inverted dictionary includes: each keyword in the hot news keyword set and their respective and each key One or more hot news items corresponding to the word.
  • step S108 may further include:
  • step S1082 according to the keyword-hot news inverted dictionary, the keyword-hot news inverted sort index is established in the form of a dictionary tree.
  • Trie tree prefix tree/dictionary tree
  • Trie tree is a data structure that has a tree structure and specializes in processing string matching. It is used to solve the problem of quickly finding a string in a set of strings. It is often used in the recall phase of search engine systems.
  • the string search speed is mainly related to the length of the longest string.
  • the inverted index can be implemented as a specific storage form of the "keyword-hot news ID matrix".
  • the inverted index is performed by the keyword, and the hot news list containing the keyword can be quickly obtained according to the keyword.
  • step S1084 it is searched whether there is a query word in at least one query word that matches the keyword in the keyword-hot news inverted sort index.
  • the matching of the query term and the keyword may include, for example, that the two are completely identical in literal terms, or may also include the same semantics.
  • query words matching the keyword can be quickly found, so that the pre-recall hot news collection can be quickly associated.
  • FIG. 4 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure. Different from the hot news query intention recognition method 10 shown in FIG. 2, the hot news query intention recognition method shown in FIG. 4 further provides an implementation manner of step S110.
  • step S110 includes:
  • step S1102 the sum of TF-IDF of each query word contained in each hot news is calculated.
  • TF-IDF is a common method of extracting keywords. Its core advantage is unsupervised. This feature enables it to obtain reliable results with a given dictionary, so it can be quickly developed and deployed online.
  • TF-IDF is essentially a method of calculating the weight of each word in a sentence.
  • the main variables involved are TF (Term Frequency, the frequency of words appearing in a sentence) and IDF (Inverse Document Frequency, inverse document frequency, that is, containing The document of this vocabulary accounts for the proportion of all documents), hence the name.
  • TF Term Frequency, the frequency of words appearing in a sentence
  • IDF Inverse Document Frequency, inverse document frequency, that is, containing The document of this vocabulary accounts for the proportion of all documents
  • n i,j represent the frequency of the query term i appearing in hot news j
  • D is the hot news collection of the aforementioned pre-recall.
  • tfidf i is the TF-IDF value of the i-th query term contained in the recalled hot news
  • sum_tfidf is the sum of the TF-IDF values of the recalled hot news including all query tokens.
  • step S1104 based on the sum of TF-IDF of each hot news, sort the hot news in descending order.
  • step S1106 the correlation value between the query sentence and the hot news is determined.
  • step S1108 it is determined that the correlation value is greater than a preset correlation threshold. If yes, go to step S1110; otherwise, go to step S1102.
  • step S1110 the query intention of the query sentence is identified as the hot news query intention, and the query result containing the hot news is returned to the client, and the hot news is at the forefront of the query result.
  • step S1102 it is determined whether the hot news is the last hot news. If yes, go to step S1104; otherwise, go back to step S1106, and process the next hot news in descending order.
  • step S1104 it is recognized that the query intention of the query sentence is not the hot news intention.
  • the query statement of the query statement is only the query intention of ordinary news, or other query intentions, etc.
  • the process of identifying whether the query intention of the query sentence is the hot news query intention based on the determined correlation value between the query sentence and each hot news item it is first based on the fact that each hot news item contains The sum of the TF-IDF of each query term sorts the hot news in descending order, which can improve the accuracy of identification; and then process each hot news in the descending order to identify whether the query intention of the query sentence is hot news
  • the purpose of accelerating the recognition speed can be achieved.
  • Fig. 5 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure. Different from the hot news query intention recognition method shown in FIG. 4, the hot news query intention recognition method shown in FIG. 5 further provides an implementation manner of step S1106.
  • the hot news keyword set includes: hot news-keyword forward dictionary; hot news-keyword forward dictionary includes: each hot news item in the hot news keyword set and their respective The keywords and related words corresponding to the hot news and the word frequency of each keyword and each related word.
  • step S1106 includes:
  • step S61 according to the hot news-keyword forward dictionary, and based on the BM25 text similarity algorithm, the relevance value of the query sentence and the hot news is calculated.
  • BM25 is an algorithm used to evaluate the correlation between query sentences and documents. It is an algorithm based on a probabilistic retrieval model. In the present disclosure, the BM25 algorithm can be used to calculate the correlation between query sentences and various hot news. The calculation formula of the BM25 value of the correlation value between the query sentence and each hot news is shown in formula (3):
  • RSV d is the correlation BM25 value between the query sentence and the hot news d;
  • t ⁇ q represents the query word sequence formed by the query word, that is, the query word t belongs to the query word sequence q.
  • tf td is the term frequency of the query term in the hot news d, and
  • tf tq is the term frequency of the query term in the query sentence.
  • L d and L ave are respectively the length of hot news d and the average length of hot news in the entire hot news collection.
  • the above-mentioned correlation threshold corresponds to the threshold bm25 , which can be set according to the classification effect experiment, and the present disclosure is not limited to this.
  • the relevance value is further determined based on the BM25 text similarity algorithm.
  • the BM25 text similarity algorithm adds several adjustable parameters on the basis of the traditional TF-IDF algorithm, making it more flexible and powerful in application, and has higher practicability.
  • Fig. 6 shows a flowchart of another hot news query intention recognition method in an embodiment of the present disclosure. Different from the hot news query intention recognition method shown in FIG. 4, the hot news query intention recognition method shown in FIG. 6 further provides an implementation manner of step S1106.
  • step S1106 may further include:
  • step S62 the correlation value between the query sentence and the hot news is predicted based on the deep structure semantic model DSSM.
  • the deep structure semantic model DSSM is obtained by training based on historical query sentences in the search engine and corresponding hot news click data.
  • DNN Deep Neural Networks, deep neural network
  • DNN Deep Neural Networks, deep neural network
  • the DNN network When searching online, the DNN network is used to express query sentences and hot news as low-dimensional semantic vectors, and then the trained DSSM model is used to predict the magnitude of the correlation between query sentences and hot news.
  • the hot news query intention recognition method introduces the word embedding vector feature when determining the correlation value between the query sentence and each hot news, and the correlation between the query sentence and the hot news can be calculated on the semantic level. It further improves the accuracy of correlation determination.
  • a "pre-training + fine-tuning" language model in the form of BERT can also be used.
  • BERT Bidirectional Encoder Representations from Transformers
  • the BERT pre-training model combined with the semantic similarity model is used to determine the correlation between the query sentence and each hot news, which can satisfy more accurate correlation calculations at the semantic level.
  • the present disclosure further provides a method for determining a set of hot news keywords.
  • the method can be implemented offline.
  • it can also be implemented by the server cluster 140 shown in FIG.
  • Other servers obtain the hot news keyword set for online search and the above-mentioned hot news query intention recognition.
  • Fig. 7 shows a flowchart of a method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • the method 20 for determining a set of hot news keywords includes:
  • step S202 multiple hot news items are acquired.
  • hot news will be obtained in real time to continuously update the hot news keyword collection.
  • the timeliness and hot spots of the collection can be set according to actual needs in specific applications, and the present disclosure is not limited thereto.
  • step S204 word segmentation processing is performed on each hot news item to obtain at least one news vocabulary of each hot news item.
  • the jieba tokenizer consistent with the above-mentioned hot news query intention recognition method can be used to separately perform word segmentation processing on each obtained hot news.
  • N-GRAM N-gram grammar model
  • the content in the hot news is divided into terms according to bytes.
  • N is the size
  • 1 is the step size for window sliding operation to form a sequence of term fragments of length N.
  • you can operate by setting N as 1, 2 and 3 respectively, and keep the final news vocabulary according to the following rules: when N is 1, no filtering is performed, and all terms obtained by word segmentation are included; when N is 2 , Only keep at least one word in two terms, or a combination containing an array; when N is 3, only keep three terms with a combination of words. After N 1/2/3 processing, the final at least one news vocabulary is obtained.
  • step S206 the keywords of each hot news item are respectively determined according to at least one news vocabulary of each hot news item.
  • the keywords of each hot news are extracted separately.
  • step S208 a set of hot information keywords is determined according to each hot news and the keywords of each hot news.
  • a set of hot information keywords is constructed.
  • the above-mentioned TF-IDF method may be used to extract the keywords of each hot news separately. For example, calculate the TF-IDF of at least one news vocabulary of each hot news separately, compare the calculated TF-IDF with a preset keyword threshold, and if it is greater than the keyword threshold, determine the corresponding news vocabulary as One of the keywords of this hot news.
  • the calculation formula of TF-IDF can be specifically referred to the above formula (1), which will not be repeated here.
  • the hot news keyword set includes: keyword-hot news inverted dictionary; keyword-hot news inverted dictionary includes: each keyword in the hot news keyword set and one corresponding to each keyword. Or multiple hot news.
  • the method for determining the set of hot news keywords disclosed in the embodiments of the present disclosure by organizing hot news information offline, prepares for online search in advance, and can improve the recognition speed of the online end.
  • the hot news in the collection is constantly updated, it can be guaranteed Timeliness of hot news to avoid recalling outdated hot news.
  • Fig. 8 shows a flowchart of another method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • the method 30 for determining a set of hot news keywords may further include:
  • step S302 according to at least one news vocabulary of each hot news item, relevant words of each hot news item are respectively determined.
  • the method of calculating the TF-IDF of each news vocabulary can also be used to determine the related words of each hot news. For example, after calculating the TF-IDF of each news vocabulary, by comparing with the preset related word threshold, if it is greater than the related word threshold and less than the above keyword threshold, it is determined to be a related word.
  • keywords refer to vocabulary that can represent the core of the hot news to a certain extent, such as time, location, task, etc.
  • keywords are used for the above-mentioned pre-recall of hot news, on the other hand, they are used for matching degree calculation; related words It refers to the vocabulary that can reflect the internal information of the hot news to a certain extent, such as key adjectives, hot tags, etc., and related words are used to calculate the matching degree; irrelevant words refer to the corresponding hot news, but do not clearly reflect the Vocabulary of the meaning of hot news, such as auxiliary words, stop words, etc.
  • the above keyword threshold and related word threshold are set to jointly determine whether a news vocabulary belongs to a keyword, a related word, or an irrelevant word. If the TF-IDF is greater than the keyword threshold, it is determined as a keyword; if it is less than the keyword threshold but greater than the related word threshold, it is determined as a related word; otherwise, it is determined as an irrelevant word.
  • the keywords can be further downgraded to related words, such as deleting 2-GRAM combinations with quantifiers, such as 1 year, etc.; , Vocabulary with high IDF but no clear hot news information, such as men, women, etc.
  • the keywords, related words and corresponding TF-IDF of each hot news item are determined, they can be stored, for example, stored in a database, so as to generate a hot news keyword set.
  • the hot news keyword set may further include: hot news-keyword forward dictionary
  • hot news-keyword forward dictionary includes: each of the hot news keyword set Pieces of hot news, keywords and related words corresponding to each piece of hot news, and the word frequency of each keyword and each related word.
  • Hot news-keyword positive row dictionary can be used to calculate the correlation between pre-recall hot news and query sentences.
  • a set of hot information keywords is determined according to each hot news and the keywords and related words of each hot news.
  • FIG. 9 shows a flowchart of another method for determining a set of hot news keywords in an embodiment of the present disclosure.
  • the method for determining a set of hot news keywords shown in FIG. 9 further illustrates another method of determining the keywords and related words of each hot news item according to at least one news vocabulary of each hot news item. Examples.
  • the method 40 for determining a set of hot news keywords includes:
  • step S202 multiple hot news items are acquired.
  • step S204 word segmentation processing is performed on each hot news item to obtain at least one news vocabulary of each hot news item.
  • step S402 at least one news vocabulary of each hot news is converted into a word vector based on the word-to-vector word2vector algorithm, and the part-of-speech feature and location feature of the at least one news vocabulary of each hot news are respectively determined.
  • step S404 the TF-IDF of at least one news vocabulary of each hot news is calculated respectively.
  • step S406 input the word vector, part-of-speech feature, location feature and TF-IDF of at least one news vocabulary of each hot news into the trained BILSTM-CRF (bidirectional long short-term memory network-conditional random field) model, Determine the keywords and related words of each hot news.
  • BILSTM-CRF bidirectional long short-term memory network-conditional random field
  • the BILSTM-CRF model can further classify the above-mentioned irrelevant words.
  • step S408 a set of hot information keywords is determined according to each hot news and the keywords and related words of each hot news.
  • the method for determining the set of hot news keywords provided by the embodiments of the present disclosure, when keywords and related words are extracted, semantic features are introduced, and the pre-training word vector plus deep learning mode is further used, combining part of speech, location, and TF-IDF Keyword extraction with features such as value can further improve the accuracy and recall rate of keyword extraction.
  • FIG. 10 shows a schematic diagram of a hot news query intention recognition device in an embodiment of the present disclosure.
  • the hot news query intention recognition device 50 includes: a sentence acquisition module 502, a word segmentation processing module 504, a keyword matching module 506, a news recall module 508, and an intention recognition module 510.
  • the sentence acquisition module 502 is used to acquire the query sentence sent by the client;
  • the word segmentation processing module 504 is configured to perform word segmentation processing on the query sentence to obtain at least one query word;
  • the keyword matching module 506 is used for querying whether any query word in at least one query word matches a keyword in the hot news keyword set;
  • the news recall module 508 is configured to obtain at least one hot news corresponding to the keywords matched by each query when a query word in at least one query word matches a keyword in the hot news keyword set;
  • the intention recognition module 510 is configured to identify whether the query intention of the query sentence is the hot news query intention according to the determined correlation value between the query sentence and each hot news item.
  • the query sentence is segmented to obtain at least one query word; then each query word is matched with a keyword in a pre-obtained hot news keyword set to determine Pre-recall hot news; then according to the correlation between the query statement and the pre-recall hot news, to identify whether the query statement has the hot news query intention.
  • this method compared to the method of recognizing query intentions by training text classification models in related technologies, this method does not need to train a large number of samples with intention labels to obtain a certain accuracy Classification model, this method can identify whether the query sentence has the intention of searching hot news in real time; second, through the pre-recall method of hot news, the calculation amount of the correlation between the user query sentence and the massive hot news is greatly reduced. In this way, the intent classification can be completed quickly.
  • the average delay of the hot news query intent recognition service is less than 3ms, so the service concurrency can also be increased, which can reduce the number of deployed search servers;
  • this method can also effectively improve the accuracy and recall rate of hot news query intention recognition;
  • this method can also achieve a good intervention in online search hot news intention recognition by configuring a set of hot news keywords in advance.
  • the intention recognition module 510 includes: a TF-IDF calculation unit, a TF-IDF sorting unit, and an intention recognition unit.
  • the TF-IDF calculation unit is used to calculate the sum of term frequency-inverse text frequency TF-IDF of each query word contained in each hot news;
  • the TF-IDF ranking unit is used to calculate the sum of TF-IDF based on each hot news, Sort the hot news in descending order;
  • the intention recognition unit is used to process each hot news in sequence based on the descending order: determine the correlation value between the query sentence and the hot news; when the correlation value is greater than the preset
  • the relevance threshold of the query intention of the query sentence is recognized as the hot news query intention, and the query result containing the hot news is returned to the client, and the hot news is at the top of the query result; when the relevance value is not greater than the relevance threshold , To deal with the next hot news.
  • the hot news keyword set includes: hot news-keyword forward dictionary; hot news-keyword forward dictionary includes: each hot news item in the hot news keyword set and corresponding to each hot news item Keywords, related words, and word frequency of each keyword and related words; the intention recognition unit is used to calculate the correlation between query sentences and hot news according to the hot news-keyword orthographic dictionary, based on the BM25 text similarity algorithm .
  • the intent recognition unit is used to predict the correlation value between the query sentence and hot news based on the deep structure semantic model DSSM; wherein, the deep structure semantic model DSSM is based on the historical query sentence in the search engine and the corresponding hot news. News click data for training.
  • the hot news keyword set includes: keyword-hot news inverted dictionary; keyword-hot news inverted dictionary includes: each keyword in the hot news keyword set and one corresponding to each keyword. Or multiple hot news; the keyword matching module 506 includes: an index establishment unit and a keyword search unit.
  • the index building unit is used to build a keyword-hot news inverted sort index according to the keyword-hot news inverted dictionary in the form of a dictionary tree; the keyword search unit includes: searching for at least one query word whether there are query words and keywords- Hot news is matched with keywords in the inverted sort index.
  • the hot news query intention recognition device 50 further includes: a collection acquisition module for acquiring and storing a collection of hot news keywords.
  • FIG. 11 shows a schematic diagram of an apparatus for determining a set of hot news keywords in an embodiment of the present disclosure.
  • the device 60 for determining a set of hot news keywords includes: a news acquisition module 602, a word segmentation processing module 604, a keyword extraction module 606, and a set determination module 608.
  • the news acquisition module 602 is used to acquire multiple hot news
  • the word segmentation processing module 604 is used to segment each hot news item separately to obtain at least one news vocabulary of each hot news item;
  • the keyword extraction module 606 is used to determine the keywords of each hot news according to at least one news vocabulary of each hot news;
  • the set determining module 608 is used for determining a set of hot information keywords according to each hot news and the keywords of each hot news.
  • the device for determining a set of hot news keywords disclosed in the embodiment of the present disclosure organizes hot news information offline to prepare for online search in advance, and can improve the recognition speed of the online end.
  • the hot news in the collection is constantly updated, it can be guaranteed Timeliness of hot news to avoid recalling outdated hot news.
  • the hot news keyword set includes: keyword-hot news inverted dictionary; keyword-hot news inverted dictionary includes: each keyword in the hot news keyword set and one corresponding to each keyword. Or multiple hot news.
  • the keyword extraction module 606 is further configured to determine the related words of each hot news item according to at least one news vocabulary of each hot news item.
  • the hot news keyword collection includes: hot news-keyword forward dictionary; hot news-keyword forward dictionary includes: each hot news in the hot news keyword set and the keywords and related words corresponding to each hot news. And the word frequency of each keyword and each related word.
  • the keyword extraction module 606 includes: a TF-IDF calculation unit and a vocabulary extraction unit.
  • the TF-IDF calculation unit is used to calculate the TF-IDF of at least one news vocabulary of each hot news;
  • the vocabulary extraction unit is used to calculate the TF-IDF of at least one news vocabulary of each hot news and the preset TF-IDF threshold , Respectively determine the keywords and related words of each hot news.
  • the keyword extraction module 606 includes: a feature extraction unit, a TF-IDF calculation unit, and a vocabulary extraction unit.
  • the feature extraction unit is used to convert at least one news vocabulary of each hot news into a word vector based on the word-to-vector word2vector algorithm, and respectively determine the part-of-speech feature and location feature of at least one news vocabulary of each hot news; TF-IDF
  • the calculation unit is used to calculate the TF-IDF of at least one news vocabulary of each hot news;
  • the vocabulary extraction unit is used to input the word vector, part-of-speech feature, location feature and TF-IDF of at least one news vocabulary of each hot news respectively
  • the keywords and related words of each hot news are determined.
  • the word segmentation processing module 604 includes: a word segmentation unit and a word segmentation processing unit.
  • the word segmentation unit is used to segment each hot news to obtain at least one word segmentation of each hot news; the word segmentation processing unit is used to perform multiple N-GRAM processing on at least one word segmentation of each hot news to obtain At least one news vocabulary for each hot news item.
  • the electronic device 800 according to this embodiment of the present disclosure will be described below with reference to FIG. 12.
  • the electronic device 800 shown in FIG. 12 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present disclosure.
  • the electronic device 800 is represented in the form of a general-purpose computing device.
  • the components of the electronic device 800 may include, but are not limited to: the aforementioned at least one processing unit 810, the aforementioned at least one storage unit 820, and a bus 830 connecting different system components (including the storage unit 820 and the processing unit 810).
  • the storage unit stores program code, and the program code can be executed by the processing unit 810, so that the processing unit 810 executes the various exemplary methods described in the "Exemplary Method" section of this specification. Steps of implementation.
  • the processing unit 810 may execute step S102 as shown in FIG.
  • step S104 perform word segmentation processing on the query sentence to obtain at least one query word
  • step S106 query at least one query Whether there are query words in the words that match the keywords in the hot news keyword set
  • step S108 when at least one of the query words matches the keywords in the hot news keyword set, obtain matches with each query word
  • step S110 according to the determined correlation value between the query sentence and each hot news, identify whether the query intention of the query sentence is the hot news query intention.
  • the storage unit 820 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 8201 and/or a cache storage unit 8202, and may further include a read-only storage unit (ROM) 8203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 820 may also include a program/utility tool 8204 having a set of (at least one) program module 8205.
  • program module 8205 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 830 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 800 can also communicate with one or more external devices 700 (such as keyboards, pointing devices, Bluetooth devices, etc.), and can also communicate with one or more devices that enable a user to interact with the electronic device 800, and/or communicate with Any device (such as a router, modem, etc.) that enables the electronic device 800 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 850.
  • the electronic device 800 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 860.
  • networks for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet
  • the network adapter 860 communicates with other modules of the electronic device 800 through the bus 830. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiments of the present disclosure.
  • a computing device which may be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium on which is stored a program product capable of implementing the above-mentioned method of this specification.
  • various aspects of the present disclosure may also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to enable the The terminal device executes the steps according to various exemplary embodiments of the present disclosure described in the above-mentioned "Exemplary Method" section of this specification.
  • a program product 900 for implementing the above method according to an embodiment of the present disclosure is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • the program product of the present disclosure is not limited thereto.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of the present disclosure can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers.
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种热点新闻意图识别方法、装置、设备及可读存储介质。热点新闻查询意图识别方法,包括:获取客户端发送的查询语句(S102);对查询语句进行分词处理,得到至少一个查询词(S104);查询至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配(S106);当至少一个查询词中有查询词与热点新闻关键词集合中的关键词匹配时,分别获得与各查询词匹配的关键词对应的至少一条热点新闻(S108);以及根据确定的查询语句与各条热点新闻之间的相关性量值,识别查询语句的查询意图是否为热点新闻查询意图(S110)。

Description

热点新闻意图识别方法、装置、设备及可读存储介质 技术领域
本公开涉及信息技术领域,具体而言,涉及一种热点新闻意图识别方法、装置、设备及可读存储介质。
背景技术
搜索引擎(Search Engine)是根据用户需求与一定算法,运用特定策略从互联网检索指定信息并反馈给用户的一门检索技术。搜索引擎的每一次搜索发起都来自于查询语句(query)的输入,搜索引擎如何理解用户输入的查询语句直接影响最终返回的搜索结果,所以搜索场景中的意图识别效果是衡量搜索引擎质量优劣的决定性因素。
热点新闻指近期发生的热度较高的新闻,具有实时性高的特点。在用户进行热点新闻搜索过程中,搜索引擎需要从查询语句识别出该热点新闻查询意图。
在所述背景技术部分公开的上述信息仅用于加强对本公开的背景的理解,因此它可以包括不构成对本领域普通技术人员已知的现有技术的信息。
发明内容
本公开提供一种热点新闻意图识别方法、装置、设备及可读存储介质。
本公开的其他特性和优点将通过下面的详细描述变得显然,或部分地通过本公开的实践而习得。
根据本公开的一方面,提供一种热点新闻查询意图识别方法,包括:获取客户端发送的查询语句;对所述查询语句进行分词处理,得到至少一个查询词;查询所述至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;当所述至少一个查询词中有查询词与所述热点新闻关键词集合中的关键词匹配时,分别获得与各所述查询词匹配的关键词对应的至少一条热点新闻;以及根据确定的所述查询语句与各条所述热点新闻之间的相关性量值,识别所述查询语句的查询意图是否为热点新闻查询意图。
根据本公开的另一方面,提供一种热点新闻关键词集合确定方法,包括:获取多条热点新闻;分别对各条所述热点新闻进行分词处理,获得各条所述热点新闻的至少一个新闻词汇;根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的关键词;以及根据各条所述热点新闻及各条所述热点新闻的关键词,确定所述热点信息关键词集合。
根据本公开的再一方面,提供热点新闻查询意图识别装置,包括:语句获取模块,用于获取客户端发送的查询语句;分词处理模块,用于对所述查询语句进行分词处理,得到至少一个查询词;关键词匹配模块,用于查询所述至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;新闻召回模块,用于当所述至少一个查询词中有查询词与所述热点新闻关键词集合中的关键词匹配时,分别获得与各所述查询词匹配的关键词对应的至少一条热点新闻;以及意图识别模块,用于根据确定的所述查询语句与各条所述热点新闻之间的相关性量值,识别所述查询语句的查询意图是否为热点新闻查询意图。
根据本公开的再一方面,提供一种热点新闻关键词集合确定装置,包括:新闻获取模块,用于获取多条热点新闻;分词处理模块,用于分别对各条所述热点新闻进行 分词处理,获得各条所述热点新闻的至少一个新闻词汇;关键词提取模块,用于根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的关键词;以及集合确定模块,用于根据各条所述热点新闻及各条所述热点新闻的关键词,确定所述热点信息关键词集合。
根据本公开的再一方面,提供一种电子设备,包括:处理器;以及存储器,用于存储所述处理器的可执行指令;其中,所述处理器配置为经由执行所述可执行指令来执行上述的热点新闻查询意图识别方法,或者所述处理器配置为经由执行所述可执行指令来执行上述的热点新闻关键词集合确定方法。
根据本公开的再一个方面,提供一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时实现上述的热点新闻查询意图识别方法,或者所述计算机程序被处理器执行时实现上述的热点新闻关键词集合确定方法。
本公开实施例提供的热点新闻查询意图识别方法,首先将查询语句进行分词得到至少一条查询词;再将各条查询词分别与预先获得的热点新闻关键词集合中的关键词进行匹配,确定出预召回的热点新闻;之后再根据查询语句与预召回的热点新闻之间的相关性量值,来识别该查询语句是否具有热点新闻查询意图。通过该方法可以产生如下几点有益效果:第一,相比于相关技术中通过训练文本分类模型来识别查询意图的方法,该方法无需对大量具有意图标签的样本进行训练以获得具有一定精度的分类模型,该方法可以做到在线实时识别查询语句是否具有搜索热点新闻的意图;第二,通过热点新闻的预召回方式,大大减少了用户查询语句和海量热点新闻之间相关性的计算量,从而可以快速地完成意图分类,经测试,在搜索场景业务中,对热点新闻查询意图识别服务的平均时延小于3ms,因此服务并发量也得以提升,从而可以减少部署的搜索服务器的数量;第三,该方法还可以有效提升热点新闻查询意图识别的准确率和召回率;第四,该方法还可以通过事先配置热点新闻关键词集合,实现对在线搜索热点新闻意图识别的良好干预。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本公开。
附图说明
通过参照附图详细描述其示例实施例,本公开的上述和其它目标、特征及优点将变得更加显而易见。
图1是本公开一个示例性实施例提供的计算机***的结构示意图。
图2示出本公开实施例中一种热点新闻查询意图识别方法流程图。
图3示出本公开实施例中另一种热点新闻查询意图识别方法流程图。
图4示出本公开实施例中再一种热点新闻查询意图识别方法流程图。
图5示出本公开实施例中再一种热点新闻查询意图识别方法流程图。
图6示出本公开实施例中再一种热点新闻查询意图识别方法流程图。
图7示出本公开实施例中一种热点新闻关键词集合确定方法流程图。
图8示出本公开实施例中另一种热点新闻关键词集合确定方法流程图。
图9示出本公开实施例中再一种热点新闻关键词集合确定方法流程图。
图10示出本公开实施例中一种热点新闻查询意图识别装置示意图。
图11示出本公开实施例中一种热点新闻关键词集合确定装置示意图。
图12示出本公开实施例中一种电子设备的结构示意图。
图13示出本公开实施例中一种计算机可读存储介质示意图。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本公开将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施方式中。
此外,附图仅为本公开的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
此外,在本公开的描述中,“多个”的含义是至少两个,例如两个,三个等,除非另有明确具体的限定。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如A和/或B,可以表示单独存在A、单独存在B及同时存在A和B三种情况。符号“/”一般表示前后关联对象是一种“或”的关系。术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。
由于热点新闻自身的特点,在识别热点新闻查询意图时,相比于其他查询意图识别,将面临更多的技术难题:如由于热点新闻的实时性很高,搜索引擎需要具备在事件发生后能快速提供信息搜索的能力;搜索引擎还需要具有对过时信息的灵敏反映,从而可以避免召回不再热门的新闻或事件;由于用户对新闻事件的描述各异,热点新闻查询意图识别还需要对查询语句的变化具有较强的鲁棒性。
此外,作为众多查询意图识别中的一种,尤其是在移动终端搜索场景下,热点新闻查询意图的识别还具有如下需克服的技术难点:需要高准确率和高召回率,其中以高准确率优先;需要高计算性能和快速的响应速度;对于较短的查询语句,尤其是分词后的词汇(term)数量在8个以内的需求量大。
在相关技术中,搜索中意图识别的常见解决方案是对文本分类。通过自然语言处理方法,采用FastText(快速文本)、TextCNN(文本卷积神经网络)等文本分类模型,离线部分应用带标注意图的用户查询语句对分类模型进行训练,判断是否具有该类型的意图;在线部分利用离线部分训练得到的分类模型进行预测,并利用预测结果实时判断用户意图。该方法同时具有高准确率及可靠的泛化能力,并且作为端到端的方法也具有比较高的可控性,可以借助样本、特征等维度实现方法优化。此外,由于是抽象模型,受具体问题影响度小,因此在大量的意图分类场景下,均有非常优秀的结果。较为常用的文本分类方法如下:
-词袋模型与浅层机器学习模型:通过词袋模型,如one-hot(独热码),TF-IDF(Term Frequency-Inverse Document Frequency,词频-逆文本频率)等方法将词汇转化为可供计算的向量形式,然后通过支持向量机、朴素贝叶斯等方式实现文本分类。
-预训练词嵌入与深度学习模型:预训练词嵌入是指通过特定方式转化,使用能更好地表达语义的词嵌入模型将输入词汇转化为向量,然后利用深度学习模型,如全连接层、卷积层等,进行自动化特征提取和转化,最终得到预测结果。
尽管文本分类方法能有很高的召准率,是一种常见的可靠方案,然而针对时事热点新闻意图识别,此方法就会因为热点新闻意图识别问题本身的特点,产生诸多问题:例如,热点新闻具有高实时性的特点,用户对该新闻的搜索需求会在短时间内爆发并 在短期内结束,同一个查询语句需要在短期内从无热点新闻意图变成有热点新闻意图,文本分类这种依赖大量时间训练才能有较好效果的方法并不适合;实时新闻需求下,用户查询语句数据匮乏,且均为无意图标注数据,无法给到足够数据进行模型训练;查询语句中是否存在热点新闻意图和语义表达之间关系并无太大关系,因此实质上为语义分析的文本分类方法并不能体现该查询语句是否具有热点新闻意图。
因此,针对当前研究所出现的问题与时事热点新闻的实际需求,本公开提出一种热点新闻查询意图识别方法,先通过预召回机制快速确定出与用户查询语句匹配的特定热点新闻,再通过计算查询语句与各特定热点信息的相关性,识别该查询语句是否具有热点信息查询意图。该方法使搜索***对热点新闻的生灭过程产生敏感性,可以精准快速地作出反应,识别出查询语句是否具有热点新闻查询意图。
本公开实施例提供的方案,涉及人工智能技术领域。为了便于理解,下面首先对本公开涉及到的几个名词进行解释。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
其中,本公开实施例提供的方案主要涉及人工智能的自然语言处理技术、机器学习技术。
自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习等技术。
随着人工智能技术研究和进步,人工智能技术在多个领域展开研究和应用,例如常见的智能家居、智能穿戴设备、虚拟助理、智能音箱、智能营销、无人驾驶、自动驾驶、无人机、机器人、智能医疗、智能客服等,相信随着技术的发展,人工智能技术将在更多的领域得到应用,并发挥越来越重要的价值。
本公开实施例提供的方案,具体通过如下实施例进行说明:
图1是本公开一个示例性实施例提供的计算机***的结构示意图。该***包括: 若干个终端120和服务器集群140。
终端120可以是手机、游戏主机、平板电脑、电子书阅读器、智能眼镜、MP4(MovingPicture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、智能家居设备、AR(Augmented Reality,增强现实)设备、VR(Virtual Reality,虚拟现实)设备等移动终端,或者,终端120也可以是个人计算机(Personal Computer,PC),比如膝上型便携计算机和台式计算机等等。
其中,终端120中可以安装有客户端应用程序,使得用户可以通过该客户端应用程序进行搜索,包括对热点新闻的搜索。
终端120与服务器集群140之间通过通信网络相连。可选的,通信网络是有线网络或无线网络。
服务器集群140是一台服务器,或者由若干台服务器组成,或者是一个虚拟化平台,或者是一个云计算服务中心。服务器集群140用于为终端120中的客户端应用程序提供后台服务,如提供线上搜索引擎服务。
此外,服务器集群140中的全部或部分服务器还可以用于提供离线服务,例如执行本公开提供的热点新闻关键词集合确定方法,来离线地确定出用于线上搜索的热点新闻关键词集合。
可选地,不同的终端120中安装的客户端应用程序是相同的,或两个终端120上安装的客户端是不同控制***平台的同一类型的客户端应用程序。基于终端平台的不同,该客户端应用程序的具体形态也可以不同,比如,该客户端应用程序可以是手机客户端、PC客户端或者全球广域网(World Wide Web,Web)客户端等。
本领域技术人员可以知晓,上述终端120的数量可以更多或更少。比如上述终端可以仅为一个,或者上述终端为几十个或几百个,或者更多数量。本公开实施例对终端的数量和设备类型不加以限定。
可选的,上述的无线网络或有线网络使用标准通信技术和/或协议。网络通常为因特网、但也可以是任何网络,包括但不限于局域网(Local Area Network,LAN)、城域网(Metropolitan Area Network,MAN)、广域网(Wide Area Network,WAN)、移动、有线或者无线网络、专用网络或者虚拟专用网络的任何组合)。在一些实施例中,使用包括超文本标记语言(Hyper Text Mark-up Language,HTML)、可扩展标记语言(Extensible MarkupLanguage,XML)等的技术和/或格式来代表通过网络交换的数据。此外还可以使用诸如安全套接字层(Secure Socket Layer,SSL)、传输层安全(Transport Layer Security,TLS)、虚拟专用网络(Virtual Private Network,VPN)、网际协议安全(Internet ProtocolSecurity,IPsec)等常规加密技术来加密所有或者一些链路。在另一些实施例中,还可以使用定制和/或专用数据通信技术取代或者补充上述数据通信技术。
下面,将结合附图及实施例对本公开示例实施例中的热点新闻查询意图识别方法及热点新闻关键词集合确定方法的各个步骤进行更详细的说明。
图2示出本公开实施例中一种热点新闻查询意图识别方法流程图。本公开实施例提供的方法可以由任意具备计算处理能力的电子设备执行,在下面的举例说明中,以服务器集群140为执行主体进行示例说明。
如图2所示,热点新闻查询意图识别方法10包括:
在步骤S102中,获取客户端发送的查询语句。
如上述,服务器集群140可以提供线上搜索功能,接收终端120中客户端发送的查询语句,基于查询语句进行搜索后,向客户端返回搜索结果。
在步骤S104中,对查询语句进行分词处理,得到至少一个查询词(term)。
可以利用分词算法和/或工具对查询语句进行分词处理。例如,可以采用jieba(结巴)分词工具对查询语句进行分词,得到一个或多个查询词。
在步骤S106中,查询至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配。
热点新闻关键词集合如为服务器集群140预先获得并存储的,该集合可以为通过离线方式事先生成好的,其中包含了近期的热点新闻,并对热点新闻的关键词进行了提取处理。为了保证热点新闻的时效性,该集合会相应地实时进行更新,不断增添近期发生的热点新闻,并移除过时的热点新闻。
关于该集合的时效性及热点程度等指标,在具体应用时,可以根据实际需求来设定,本公开不以此为限。此外,关于该热点新闻关键词集合的生成确定方法将在下文中说明。
在一些实施例中,热点新闻查询意图识别方法10还可以包括:获取并存储该热点新闻关键词集合。
在步骤S108中,当至少一个查询词中有查询词与热点新闻关键词集合中的关键词匹配时,分别获得与各查询词匹配的关键词对应的至少一条热点新闻。
将上述至少一个查询词分别与热点新闻关键词集合中的关键词进行匹配,如果有查询词可以与热点新闻关键词集合中的关键词匹配上,则获得该被匹配的关键词对应的一条或多条热点新闻。
分别获得各被匹配的关键词对应的至少一条热点新闻,可以构成预召回的热点新闻集合。
本领域技术人员可以理解的是,上述查询词与关键词的匹配例如可以包括两者字面完全相同,或者也可以包括两者语义相同等,本公开不以此为限。
在步骤S110中,根据确定的查询语句与各条热点新闻之间的相关性量值,识别查询语句的查询意图是否为热点新闻查询意图。
在获得了预召回的热点新闻集合后,分别确定查询语句与预召回的热点新闻集合中各条热点新闻之间的相关性,该相关性如可以通过相关性量值表示。再根据查询语句与各条热点新闻之间的相关性量值,识别该查询语句的查询意图是否为热点信息查询意图。
例如,在一些实施例中,可以在分别确定了查询语句与各条热点新闻之间的相关性量值后,选取其中具有最大相关性量值,且该最大相关性量值大于预设相关性阈值的热点新闻作为该查询语句的热点新闻查询意图,即识别该查询语句的查询意图为热点新闻查询意图,并可将该热点新闻返回给客户端。而如果最大相关性量值小于预设相关性阈值,则识别该查询语句的查询意图不是热点新闻查询意图。通过选取具有最大相关性量值且该量值大于预设阈值的的热点新闻作为热点新闻查询意图,方式简单,且处理速度较快。
本公开实施例提供的热点新闻查询意图识别方法,首先将查询语句进行分词得到至少一条查询词;再将各条查询词分别与预先获得的热点新闻关键词集合中的关键词进行匹配,确定出预召回的热点新闻;之后再根据查询语句与预召回的热点新闻之间的相关性量值,来识别该查询语句是否具有热点新闻查询意图。通过该方法可以产生如下几点有益效果:第一,相比于相关技术中通过训练文本分类模型来识别查询意图的方法,该方法无需对大量具有意图标签的样本进行训练以获得具有一定精度的分类模型,该方法可以做到在线实时识别查询语句是否具有搜索热点新闻的意图;第二,通过热点新闻的预召回方式,大大减少了用户查询语句和海量热点新闻之间相关性的计算量,从而可以快速地完成意图分类,经测试,在搜索场景业务中,对热点新闻查询意图识别服务的平均时延小于3ms,因此服务并发量也得以提升,从而可以减少部 署的搜索服务器的数量;第三,该方法还可以有效提升热点新闻查询意图识别的准确率和召回率;第四,该方法还可以通过事先配置热点新闻关键词集合,实现对在线搜索热点新闻意图识别的良好干预。
图3示出本公开实施例中另一种热点新闻查询意图识别方法流程图。与图2所示的热点新闻查询意图识别方法10不同的是,图3所示的热点新闻查询意图识别方法进一步提供了图2中步骤S108的一种实施方式。
在图3所示的实施例中,热点新闻关键词集合包括:关键词-热点新闻倒排词典;关键词-热点新闻倒排词典包括:热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻。
如图3所示,步骤S108可以进一步包括:
在步骤S1082中,根据关键词-热点新闻倒排词典,采用字典树的形式建立关键词-热点新闻倒排序索引。
基于关键词-热点新闻倒排词典中关键词与热点新闻之间的倒排结构,采用Trie树(前缀树/字典树)建立关键词倒排序索引。
Trie树是一种具有树形结构且专门处理字符串匹配的数据结构,用来解决在一组字符串集合中快速查找某个字符串的问题,常常用于搜索引擎***的召回阶段,它的字符串查找速度主要和最长字符串的长度相关。
倒排索引可以实现为“关键词-热点新闻ID矩阵”的一种具体存储形式,通过关键词进行倒排索引,根据关键词快速获取包含这个关键词的热点新闻列表。
在步骤S1084中,查找至少一个查询词中是否有查询词与关键词-热点新闻倒排序索引中的关键词匹配。
基于上述构建的关键词倒排索引,快速查找是否有查询词与关键词-热点信息倒排索引中的关键词匹配。
如上述,查询词与关键词的匹配例如可以包括两者字面完全相同,或者也可以包括两者语义相同等。
在本公开实施例中,通过关键词倒排索引,可以快速地查找出与关键词匹配的查询词,从而可以快速地关联出预召回的热点新闻集合。
图4示出本公开实施例中再一种热点新闻查询意图识别方法流程图。与图2所示的热点新闻查询意图识别方法10不同的是,图4所示的热点新闻查询意图识别方法进一步提供了步骤S110的一种实施方式。
如图4所示,步骤S110包括:
在步骤S1102中,分别计算各条热点新闻中包含的各查询词的TF-IDF之和。
TF-IDF为常见的提取关键词的方法,其核心优点是无监督,该特性使其能在给定词典的情况下就能获得可靠的结果,因此可以快速开发部署上线。
TF-IDF本质上是一种计算句子中每个词的权重的方法,涉及的主要变量是TF(Term Frequency,词汇出现在句子中的频率)和IDF(Inverse Document Frequency,逆文档频率,即含有该词汇的文档占所有文档的比例),因此而得名。对于上述一个查询词i,其TF-IDF值计算如公式(1)所示:
Figure PCTCN2020089839-appb-000001
其中,n i,j表示查询词i在热点新闻j中出现的频次,D是上述预召回的热点新闻 集合。
然后计算各条热点新闻的sum_tfidf,计算公式如公式(2)所示:
Figure PCTCN2020089839-appb-000002
其中,公式中tfidf i为该召回的热点新闻中包含的第i个查询词的TF-IDF值,sum_tfidf为该召回的热点新闻包含所有的查询分词的TF-IDF值总和。
在步骤S1104中,基于各条热点新闻的TF-IDF之和,按照降序对各条热点新闻进行排序。
根据计算出的各条热点新闻的TF-IDF之和对各条热点新闻进行降序排序,并基于该排序,依次执行如下步骤。
在步骤S1106中,确定查询语句与该条热点新闻之间的相关性量值。
在步骤S1108中,判断该相关性量值大于预设的相关性阈值。如果是,进入步骤S1110;否则,进入步骤S1102。
在步骤S1110中,识别查询语句的查询意图为热点新闻查询意图,向客户端返回包含该条热点新闻的查询结果,且该条热点新闻位于查询结果的最前面。
在步骤S1102中,判断该条热点新闻是否为最后一条热点新闻。如果是,进入步骤S1104;否则,返回步骤S1106,按照降序,处理下一条热点新闻。
在步骤S1104中,识别查询语句的查询意图不是热点新闻意图。
例如,该查询语句的查询语句仅为普通新闻的查询意图,或者是其他查询意图等。
在本公开实施例中,在根据确定的查询语句与各条热点新闻之间的相关性量值,识别查询语句的查询意图是否为热点新闻查询意图的过程中,首先基于各条热点新闻中包含的各查询词的TF-IDF之和对各条热点新闻进行降序排序,可以提升识别的精准度;再按照该降序依次对各条热点新闻进行处理,以识别查询语句的查询意图是否为热点新闻查询意图的过程中,可以达到加速识别速度的目的。
图5示出本公开实施例中再一种热点新闻查询意图识别方法流程图。与图4所示的热点新闻查询意图识别方法不同的是,图5所示的热点新闻查询意图识别方法进一步提供了步骤S1106的一种实施方式。
在图5所示的实施例中,热点新闻关键词集合包括:热点新闻-关键词正排词典;热点新闻-关键词正排词典包括:热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各关键词和各相关词的词频。
如图5所示,步骤S1106包括:
在步骤S61中,根据热点新闻-关键词正排词典,基于BM25文本相似度算法,计算查询语句与该条热点新闻的相关性量值。
BM25是一种用来评价查询语句和文档之间相关性的算法,它是一种基于概率检索模型提出的算法。本公开中可以利用BM25算法计算查询语句和各条热点新闻之间的相关性。查询语句和各条热点新闻之间的相关性量值BM25值的计算公式如公式(3)所示:
Figure PCTCN2020089839-appb-000003
其中,RSV d为查询语句与热点新闻d之间的相关性BM25值;t∈q表示查询词构成的查询词序列,即查询词t属于查询词序列q。tf td为查询词在热点新闻d中的词频, tf tq为查询词在查询语句中的词频。L d和L ave分别是热点新闻d的长度及整个热点新闻集合中热点新闻的平均长度。k 1与k 3均为调优参数,其中k 1用于对热点新闻集合中的项频率TF项进行缩放控制;如果k 1取0,则相当于不考虑词频,如果k 1取较大的值,那么对应于使用原始项频率;b是另外一个调节参数(0≤b≤1),决定热点新闻长度的缩放程度:b=1表示基于热点新闻长度对词项频率进行完全的缩放,b=0表示归一化时不考虑热点新闻长度因素。
需要说明的是,当使用上述的BM25值作为相关性量值时,则上述的相关性阈值对应使用threshold bm25,该阈值可以根据分类效果实验设定,本公开不以此为限。
在本公开实施例中,在确定查询语句与该条热点新闻之间的相关性量值过程中,进一步基于BM25文本相似度算法,来确定相关性量值。BM25文本相似度算法在传统TF-IDF算法的基础上增加了几个可调节的参数,使得它在应用上更佳灵活和强大,具有较高的实用性。
图6示出本公开实施例中再一种热点新闻查询意图识别方法流程图。与图4所示的热点新闻查询意图识别方法不同的是,图6所示的热点新闻查询意图识别方法进一步提供了步骤S1106的一种实施方式。
如图6所示,步骤S1106可以进一步包括:
在步骤S62中,基于深层结构语义模型DSSM预测查询语句与热点新闻之间的相关性量值。
其中,深层结构语义模型DSSM是基于搜索引擎中历史查询语句与对应的热点新闻点击数据进行训练得到的。
例如,可以首先利用DNN(Deep Neural Networks,深度神经网络)模型将历史查询语句和对应的热点新闻点击数据表达为低维语义向量,再通过余弦相似度距离来计算两个语义向量之间的距离,来训练DSSM语义相似度模型。
在线搜索时,利用DNN网络将查询语句和热点新闻表达为低维语义向量,再利用经训练的DSSM模型来预测查询语句与热点新闻之间的相关性量值。
本公开实施例提供的热点新闻查询意图识别方法,在确定查询语句与各条热点新闻的相关性量值时,引入词嵌入向量特征,可以在语义层面上计算查询语句与热点新闻之间的相关性,进一步提升了相关性确定的准确性。
此外,进一步地,还可以使用BERT(Bidirectional Encoder Representations from Transformers,来自转换器的双向编码器表示)形式的“预训练+微调”语言模型。例如,在线搜索时,通过BERT预训练模型联合语义相似度模型一起进行查询语句与各条热点新闻之间相关性的确定,能够满足语义层面更加精准的相关性计算。
本公开还进一步提供了一种热点新闻关键词集合确定方法,该方法可以离线实施,例如也可以由图1中所示的服务器集群140实施,或者也可以由其他服务器实施,服务器集群140从该其他服务器获取该热点新闻关键词集合,以用于线上搜索及上述的热点新闻查询意图识别。
图7示出本公开实施例中一种热点新闻关键词集合确定方法流程图。
参考图7,热点新闻关键词集合确定方法20包括:
在步骤S202中,获取多条热点新闻。
如上述,为了保证热点新闻的时效性,会实时获取热点新闻,以不断更新热点新闻关键词集合。
关于该集合的时效性及热点程度等指标,在具体应用时,可以根据实际需求来设定,本公开不以此为限。
在步骤S204中,分别对各条热点新闻进行分词处理,获得各条热点新闻的至少一个新闻词汇。
例如,可以采用与上述热点新闻查询意图识别方法中一致的jieba分词器来分别对每条获取的热点新闻进行分词处理。
在一些实施例中,对于分词器分出的词汇(term),还可以进一步进行N-GRAM(N元文法模型)处理,将热点新闻里面的内容,按照字节以分出的term为单位,N为大小,1为步长进行窗口滑动操作,形成长度为N的term片段的序列。例如,可以分别将N取值为1、2及3来进行操作,按照下面的规则保留最终的新闻词汇:当N取1时,不做过滤,包含分词得到的所有term;当N取2时,仅保留两个term内至少有一个单字,或者含有数组的组合;当N取3时,仅保留三个term均为单字的组合。分别经过N=1/2/3处理后得到最终的至少一个新闻词汇。
在步骤S206中,根据各条热点新闻的至少一个新闻词汇,分别确定各条热点新闻的关键词。
通过关键词提取方法,分别提取出各条热点新闻的关键词。
在步骤S208中,根据各条热点新闻及各条热点新闻的关键词,确定热点信息关键词集合。
例如,通过汇总各条热点新闻的关键词,构建出热点信息关键词集合。
在一些实施例中,例如可以采用上述的TF-IDF方法来分别提取各条热点新闻的关键词。例如,分别计算各条热点新闻的至少一个新闻词汇的TF-IDF,将计算出的TF-IDF与预设的关键词阈值进行比较,如果大于该关键词阈值,则将对应的新闻词汇确定为该条热点新闻的关键词之一。其中,TF-IDF的计算公式可具体参见上述的公式(1),在此不再赘述。
在一些实施例中,热点新闻关键词集合包括:关键词-热点新闻倒排词典;关键词-热点新闻倒排词典包括:热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻。
本公开实施例公开的热点新闻关键词集合确定方法,通过离线整理热点新闻信息,为在线端的搜索提前做好准备,可以提升在线端的识别速度,此外由于不断更新该集合中的热点新闻,可以保证热点新闻的时效性,避免召回过时的热点新闻。
图8示出本公开实施例中另一种热点新闻关键词集合确定方法流程图。
参考图8,热点新闻关键词集合确定方法30进一步还可以包括:
在步骤S302中,根据各条热点新闻的至少一个新闻词汇,分别确定各条热点新闻的相关词。
例如,也可以采用计算各新闻词汇的TF-IDF的方法来确定各条热点新闻的相关词。如在分别计算出各新闻词汇的TF-IDF后,通过与预设的相关词阈值比较,如果大于该相关词阈值且小于上述的关键词阈值,则确定其为相关词。
此外,除了确定各条热点新闻的关键词、相关词外,还可以包括确定无关词。其中,关键词是指能够一定程度代表该热点新闻的核心的词汇,如时间、地点、任务等,关键词一方面用于上述预召回热点新闻,另一方面用于进行匹配度计算;相关词是指能够一定程度体现该热点新闻内部信息的词汇,如关键的形容词、热点标签等,相关词用于进行匹配度计算;无关词是指虽然可能与对应热点新闻有关,但是并不能明确体现该热点新闻的含义的词汇,如助词、停词等。
例如,以一条热点新闻为例,计算出其包含的各新闻词汇的TF-IDF之后,通过 设置上述的关键词阈值、相关词阈值来联合判断一个新闻词汇属于关键词、相关词或无关词。如果其TF-IDF大于关键词阈值,则判别为关键词;如果小于关键词阈值但大于相关词阈值,则判别为相关词;否则,判别为无关词。
此外,在通过上述TF-IDF方法区分出关键词、相关词及无关词后,还可以进一步将关键词降级到相关词,例如删除带有量词的2-GRAM组合,如1年等;再例如,具有高IDF但是无明确热点新闻信息的词汇,如男人、女人等。
在确定出各条热点新闻的关键词、相关词及其对应的TF-IDF后,可以将其进行存储,例如存入数据库中,以用于生成热点新闻关键词集合。
热点新闻关键词集合除了包含上述的关键词-热点新闻倒排词典外,还可以进一步包括:热点新闻-关键词正排词典,热点新闻-关键词正排词典包括:热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各关键词和各相关词的词频。热点新闻-关键词正排词典可用于上述对预召回热点新闻与查询语句之间的相关性计算。
此外,相应地,在步骤S208’中,根据各条热点新闻及各条热点新闻的关键词和相关词,确定热点信息关键词集合。
图9示出本公开实施例中再一种热点新闻关键词集合确定方法流程图。图9所示的热点新闻关键词集合确定方法进一步示出另一种根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的所述关键词及所述相关词的实施例。
参考图9,热点新闻关键词集合确定方法40包括:
在步骤S202中,获取多条热点新闻。
在步骤S204中,分别对各条热点新闻进行分词处理,获得各条热点新闻的至少一个新闻词汇。
上述步骤与热点新闻关键词集合确定方法20中的步骤相同,在此不再赘述。
在步骤S402中,分别将各条热点新闻的至少一个新闻词汇,基于字到向量word2vector算法转换为词向量,并分别确定各条热点新闻的至少一个新闻词汇的词性特征及位置特征。
在步骤S404中,分别计算各条热点新闻的至少一个新闻词汇的TF-IDF。
新闻词汇的TF-IDF的计算如上述,在此不再赘述。
在步骤S406中,分别将各条热点新闻的至少一个新闻词汇的词向量、词性特征、位置特征及TF-IDF输入至经训练的BILSTM-CRF(双向长短期记忆网络-条件随机场)模型,确定出各条热点新闻的关键词及相关词。
在一些实施例中,BILSTM-CRF模型还可以进一步分类出上述的无关词。
在步骤S408中,根据各条热点新闻及各条热点新闻的关键词和相关词,确定热点信息关键词集合。
根据本公开实施例提供的热点新闻关键词集合确定方法,在进行关键词、相关词提取时,引入了语义特征,进一步使用预训练词向量加深度学习的模式,结合词性、位置、TF-IDF值等特征综合进行关键词提取,可以进一步提升关键词提取的准确率和召回率。
需要注意的是,上述附图仅是根据本公开示例性实施例的方法所包括的处理的示意性说明,而不是限制目的。易于理解,上述附图所示的处理并不表明或限制这些处理的时间顺序。另外,也易于理解,这些处理可以是例如在多个模块中同步或异步执行的。
下述为本公开装置实施例,可以用于执行本公开方法实施例。对于本公开装置实 施例中未披露的细节,请参照本公开方法实施例。
图10示出本公开实施例中一种热点新闻查询意图识别装置示意图。
如图10所示,热点新闻查询意图识别装置50包括:语句获取模块502、分词处理模块504、关键词匹配模块506、新闻召回模块508及意图识别模块510。
其中,语句获取模块502用于获取客户端发送的查询语句;
分词处理模块504用于对查询语句进行分词处理,得到至少一个查询词;
关键词匹配模块506用于查询至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;
新闻召回模块508用于当至少一个查询词中有查询词与热点新闻关键词集合中的关键词匹配时,分别获得与各查询词匹配的关键词对应的至少一条热点新闻;
意图识别模块510用于根据确定的查询语句与各条热点新闻之间的相关性量值,识别查询语句的查询意图是否为热点新闻查询意图。
本公开实施例提供的热点新闻查询意图识别装置,首先将查询语句进行分词得到至少一条查询词;再将各条查询词分别与预先获得的热点新闻关键词集合中的关键词进行匹配,确定出预召回的热点新闻;之后再根据查询语句与预召回的热点新闻之间的相关性量值,来识别该查询语句是否具有热点新闻查询意图。通过该方法可以产生如下几点有益效果:第一,相比于相关技术中通过训练文本分类模型来识别查询意图的方法,该方法无需对大量具有意图标签的样本进行训练以获得具有一定精度的分类模型,该方法可以做到在线实时识别查询语句是否具有搜索热点新闻的意图;第二,通过热点新闻的预召回方式,大大减少了用户查询语句和海量热点新闻之间相关性的计算量,从而可以快速地完成意图分类,经测试,在搜索场景业务中,对热点新闻查询意图识别服务的平均时延小于3ms,因此服务并发量也得以提升,从而可以减少部署的搜索服务器的数量;第三,该方法还可以有效提升热点新闻查询意图识别的准确率和召回率;第四,该方法还可以通过事先配置热点新闻关键词集合,实现对在线搜索热点新闻意图识别的良好干预。
在一些实施例中,意图识别模块510包括:TF-IDF计算单元、TF-IDF排序单元及意图识别单元。TF-IDF计算单元用于分别计算各条热点新闻中包含的各查询词的词频-逆文本频率TF-IDF之和;TF-IDF排序单元用于基于各条热点新闻的TF-IDF之和,按照降序对各条热点新闻进行排序;意图识别单元用于基于降序,依次对各条热点新闻进行如下处理:确定查询语句与热点新闻之间的相关性量值;当相关性量值大于预设的相关性阈值时,识别查询语句的查询意图为热点新闻查询意图,向客户端返回包含热点新闻的查询结果,且热点新闻位于查询结果的最前面;当相关性量值不大于相关性阈值时,处理下一条热点新闻。
在一些实施例中,热点新闻关键词集合包括:热点新闻-关键词正排词典;热点新闻-关键词正排词典包括:热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各关键词和各相关词的词频;意图识别单元用于根据热点新闻-关键词正排词典,基于BM25文本相似度算法,计算查询语句与热点新闻的相关性量值。
在一些实施例中,意图识别单元用于基于深层结构语义模型DSSM预测查询语句与热点新闻之间的相关性量值;其中,深层结构语义模型DSSM是基于搜索引擎中历史查询语句与对应的热点新闻点击数据进行训练得到的。
在一些实施例中,热点新闻关键词集合包括:关键词-热点新闻倒排词典;关键词-热点新闻倒排词典包括:热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻;关键词匹配模块506包括:索引建立单元及关键词查找单元。索引建立单元用于根据关键词-热点新闻倒排词典,采用字典树的形式建立关键词-热点 新闻倒排序索引;关键词查找单元包括:查找至少一个查询词中是否有查询词与关键词-热点新闻倒排序索引中的关键词匹配。
在一些实施例中,热点新闻查询意图识别装置50还包括:集合获取模块,用于获取并存储热点新闻关键词集合。
图11示出本公开实施例中一种热点新闻关键词集合确定装置示意图。
如图11所示,热点新闻关键词集合确定装置60包括:新闻获取模块602、分词处理模块604、关键词提取模块606及集合确定模块608。
其中,新闻获取模块602用于获取多条热点新闻;
分词处理模块604用于分别对各条热点新闻进行分词处理,获得各条热点新闻的至少一个新闻词汇;
关键词提取模块606用于根据各条热点新闻的至少一个新闻词汇,分别确定各条热点新闻的关键词;
集合确定模块608用于根据各条热点新闻及各条热点新闻的关键词,确定热点信息关键词集合。
本公开实施例公开的热点新闻关键词集合确定装置,通过离线整理热点新闻信息,为在线端的搜索提前做好准备,可以提升在线端的识别速度,此外由于不断更新该集合中的热点新闻,可以保证热点新闻的时效性,避免召回过时的热点新闻。
在一些实施例中,热点新闻关键词集合包括:关键词-热点新闻倒排词典;关键词-热点新闻倒排词典包括:热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻。
在一些实施例中,关键词提取模块606还用于根据各条热点新闻的至少一个新闻词汇,分别确定各条热点新闻的相关词。热点新闻关键词集合包括:热点新闻-关键词正排词典;热点新闻-关键词正排词典包括:热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各关键词和各相关词的词频。
在一些实施例中,关键词提取模块606包括:TF-IDF计算单元及词汇提取单元。TF-IDF计算单元用于分别计算各条热点新闻的至少一个新闻词汇的TF-IDF;词汇提取单元用于基于各条热点新闻的至少一个新闻词汇的TF-IDF及预设的TF-IDF阈值,分别确定各条热点新闻的关键词及相关词。
在一些实施例中,关键词提取模块606包括:特征提取单元、TF-IDF计算单元及词汇提取单元。特征提取单元用于分别将各条热点新闻的至少一个新闻词汇,基于字到向量word2vector算法转换为词向量,并分别确定各条热点新闻的至少一个新闻词汇的词性特征及位置特征;TF-IDF计算单元用于分别计算各条热点新闻的至少一个新闻词汇的TF-IDF;词汇提取单元用于分别将各条热点新闻的至少一个新闻词汇的词向量、词性特征、位置特征及TF-IDF输入至经训练的双向长短期记忆网络-条件随机场BILSTM-CRF模型,确定出各条热点新闻的关键词及相关词。
在一些实施例中,分词处理模块604包括:切词单元及分词处理单元。切词单元用于分别对各条热点新闻进行分词处理,得到各条热点新闻的至少一个分词;分词处理单元用于分别对各条热点新闻的至少一个分词进行多次N-GRAM处理,获得获得各条热点新闻的至少一个新闻词汇。
所属技术领域的技术人员能够理解,本公开的各个方面可以实现为***、方法或程序产品。因此,本公开的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“***”。
下面参照图12来描述根据本公开的这种实施方式的电子设备800。图12显示的电子设备800仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图12所示,电子设备800以通用计算设备的形式表现。电子设备800的组件可以包括但不限于:上述至少一个处理单元810、上述至少一个存储单元820、连接不同***组件(包括存储单元820和处理单元810)的总线830。
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元810执行,使得所述处理单元810执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。例如,所述处理单元810可以执行如图2中所示的步骤S102,获取客户端发送的查询语句;步骤S104,对查询语句进行分词处理,得到至少一个查询词;步骤S106,查询至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;步骤S108,当至少一个查询词中有查询词与热点新闻关键词集合中的关键词匹配时,分别获得与各查询词匹配的关键词对应的至少一条热点新闻;步骤S110,根据确定的查询语句与各条热点新闻之间的相关性量值,识别查询语句的查询意图是否为热点新闻查询意图。
存储单元820可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)8201和/或高速缓存存储单元8202,还可以进一步包括只读存储单元(ROM)8203。
存储单元820还可以包括具有一组(至少一个)程序模块8205的程序/实用工具8204,这样的程序模块8205包括但不限于:操作***、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线830可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、***总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备800也可以与一个或多个外部设备700(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备800交互的设备通信,和/或与使得该电子设备800能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口850进行。并且,电子设备800还可以通过网络适配器860与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器860通过总线830与电子设备800的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备800使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID***、磁带驱动器以及数据备份存储***等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本公开实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本公开实施方式的方法。
在本公开的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本公开的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当所述程序产品在终端设备上运行时,所述程序代码用于使所述终端设备执行本说明书上述“示例性方法”部分中描述的根据本公开各种示例性实施方式的步骤。
参考图9所示,描述了根据本公开的实施方式的用于实现上述方法的程序产品900,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本公开的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本公开操作的程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本公开的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本公开旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本公开的真正范围和精神由所附的权利要求指出。

Claims (16)

  1. 一种热点新闻查询意图识别方法,其特征在于,包括:
    获取客户端发送的查询语句;
    对所述查询语句进行分词处理,得到至少一个查询词;
    查询所述至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;
    当所述至少一个查询词中有查询词与所述热点新闻关键词集合中的关键词匹配时,分别获得与各所述查询词匹配的关键词对应的至少一条热点新闻;以及
    根据确定的所述查询语句与各条所述热点新闻之间的相关性量值,识别所述查询语句的查询意图是否为热点新闻查询意图。
  2. 根据权利要求1所述的方法,其特征在于,根据确定的所述查询语句与各条所述热点新闻之间的相关性量值,识别所述查询语句的查询意图是否为热点新闻查询意图,包括:
    分别计算各条所述热点新闻中包含的各所述查询词的词频-逆文本频率TF-IDF之和;
    基于各条所述热点新闻的所述TF-IDF之和,按照降序对各条所述热点新闻进行排序;以及
    基于所述降序,依次对各条所述热点新闻进行如下处理:确定所述查询语句与所述热点新闻之间的相关性量值;当所述相关性量值大于预设的相关性阈值时,识别所述查询语句的查询意图为热点新闻查询意图。
  3. 根据权利要求2所述的方法,其特征在于,所述热点新闻关键词集合包括:热点新闻-关键词正排词典;所述热点新闻-关键词正排词典包括:所述热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各所述关键词和各所述相关词的词频;
    确定所述查询语句与所述热点新闻之间的相关性量值,包括:根据所述热点新闻-关键词正排词典,基于BM25文本相似度算法,计算所述查询语句与所述热点新闻的相关性量值。
  4. 根据权利要求2所述的方法,其特征在于,确定所述查询语句与所述热点新闻之间的相关性量值,包括:基于深层结构语义模型DSSM预测所述查询语句与所述热点新闻之间的相关性量值;其中,所述深层结构语义模型DSSM是基于搜索引擎中历史查询语句与对应的热点新闻点击数据进行训练得到的。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述热点新闻关键词集合包括:关键词-热点新闻倒排词典;所述关键词-热点新闻倒排词典包括:所述热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻;
    查询所述至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配,包括:根据所述关键词-热点新闻倒排词典,采用字典树的形式建立关键词-热点新闻倒排序索引;及查找所述至少一个查询词中是否有查询词与所述关键词-热点新闻倒排序索引中的关键词匹配。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,还包括:获取并存储所述热点新闻关键词集合。
  7. 一种热点新闻关键词集合确定方法,其特征在于,包括:
    获取多条热点新闻;
    分别对各条所述热点新闻进行分词处理,获得各条所述热点新闻的至少一个新闻 词汇;
    根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的关键词;以及
    根据各条所述热点新闻及各条所述热点新闻的关键词,确定所述热点信息关键词集合。
  8. 根据权利要求7所述的方法,其特征在于,所述热点新闻关键词集合包括:关键词-热点新闻倒排词典;所述关键词-热点新闻倒排词典包括:所述热点新闻关键词集合中各关键词及分别与各关键词对应的一条或多条热点新闻。
  9. 根据权利要求7或8所述的方法,其特征在于,还包括:根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的相关词;所述热点新闻关键词集合包括:热点新闻-关键词正排词典;所述热点新闻-关键词正排词典包括:所述热点新闻关键词集合中各条热点新闻及分别与各条热点新闻对应的关键词、相关词及各所述关键词和各所述相关词的词频。
  10. 根据权利要求9所述的方法,其特征在于,根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的所述关键词及所述相关词,包括:
    分别计算各条所述热点新闻的至少一个新闻词汇的TF-IDF;以及
    基于各条所述热点新闻的至少一个新闻词汇的TF-IDF及预设的TF-IDF阈值,分别确定各条所述热点新闻的所述关键词及所述相关词。
  11. 根据权利要求9所述的方法,其特征在于,根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的所述关键词及所述相关词,包括:
    分别将各条所述热点新闻的至少一个新闻词汇,基于字到向量word2vector算法转换为词向量,并分别确定各条所述热点新闻的至少一个新闻词汇的词性特征及位置特征;
    分别计算各条所述热点新闻的至少一个新闻词汇的TF-IDF;以及
    分别将各条所述热点新闻的至少一个新闻词汇的词向量、词性特征、位置特征及TF-IDF输入至经训练的双向长短期记忆网络-条件随机场BILSTM-CRF模型,确定出各条所述热点新闻的所述关键词及所述相关词。
  12. 根据权利要求7或8所述的方法,其特征在于,分别对各条所述热点新闻进行分词处理,获得各条所述热点新闻的至少一个新闻词汇,包括:分别对各条所述热点新闻进行分词处理,得到各条所述热点新闻的至少一个分词;分别对各条所述热点新闻的至少一个分词进行多次N-GRAM处理,获得获得各条所述热点新闻的至少一个新闻词汇。
  13. 一种热点新闻查询意图识别装置,其特征在于,包括:
    语句获取模块,用于获取客户端发送的查询语句;
    分词处理模块,用于对所述查询语句进行分词处理,得到至少一个查询词;
    关键词匹配模块,用于查询所述至少一个查询词中是否有查询词与热点新闻关键词集合中的关键词匹配;
    新闻召回模块,用于当所述至少一个查询词中有查询词与所述热点新闻关键词集合中的关键词匹配时,分别获得与各所述查询词匹配的关键词对应的至少一条热点新闻;以及
    意图识别模块,用于根据确定的所述查询语句与各条所述热点新闻之间的相关性量值,识别所述查询语句的查询意图是否为热点新闻查询意图。
  14. 一种热点新闻关键词集合确定装置,其特征在于,包括:
    新闻获取模块,用于获取多条热点新闻;
    分词处理模块,用于分别对各条所述热点新闻进行分词处理,获得各条所述热点 新闻的至少一个新闻词汇;
    关键词提取模块,用于根据各条所述热点新闻的至少一个新闻词汇,分别确定各条所述热点新闻的关键词;以及
    集合确定模块,用于根据各条所述热点新闻及各条所述热点新闻的关键词,确定所述热点信息关键词集合。
  15. 一种电子设备,其特征在于,包括:
    处理器;以及
    存储器,用于存储所述处理器的可执行指令;
    其中,所述处理器配置为经由执行所述可执行指令来执行权利要求1-6任一项或权利要求7-12任一项所述的方法。
  16. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现权利要求1-6任一项或权利要求7-12任一项所述的方法。
PCT/CN2020/089839 2020-05-12 2020-05-12 热点新闻意图识别方法、装置、设备及可读存储介质 WO2021226840A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/089839 WO2021226840A1 (zh) 2020-05-12 2020-05-12 热点新闻意图识别方法、装置、设备及可读存储介质
CN202080100566.5A CN115516447A (zh) 2020-05-12 2020-05-12 热点新闻意图识别方法、装置、设备及可读存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/089839 WO2021226840A1 (zh) 2020-05-12 2020-05-12 热点新闻意图识别方法、装置、设备及可读存储介质

Publications (1)

Publication Number Publication Date
WO2021226840A1 true WO2021226840A1 (zh) 2021-11-18

Family

ID=78526131

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/089839 WO2021226840A1 (zh) 2020-05-12 2020-05-12 热点新闻意图识别方法、装置、设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN115516447A (zh)
WO (1) WO2021226840A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880588A (zh) * 2022-06-13 2022-08-09 四川封面传媒科技有限责任公司 基于知识图谱的新闻热度预测方法
CN115033594A (zh) * 2022-08-10 2022-09-09 之江实验室 一种给出置信度的垂直领域检索方法与装置
CN115221954A (zh) * 2022-07-12 2022-10-21 中国电信股份有限公司 用户画像方法、装置、电子设备以及存储介质
CN116340626A (zh) * 2023-03-20 2023-06-27 长沙松柏之志传媒有限公司 内容推荐方法、推荐***及相关设备
CN116861915A (zh) * 2023-06-02 2023-10-10 国网江苏省电力有限公司南京供电分公司 一种基于nlp与热点词元分析的用电诉求辨析方法和***

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117235242B (zh) * 2023-11-15 2024-02-06 浙江力石科技股份有限公司 一种基于智能问答数据库的热点信息筛选方法及***

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030163302A1 (en) * 2002-02-27 2003-08-28 Hongfeng Yin Method and system of knowledge based search engine using text mining
CN102254039A (zh) * 2011-08-11 2011-11-23 武汉安问科技发展有限责任公司 一种基于搜索引擎的网络搜索方法
CN102402619A (zh) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 一种搜索方法和装置
CN102955798A (zh) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 一种基于搜索引擎的搜索方法及搜索服务器

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030163302A1 (en) * 2002-02-27 2003-08-28 Hongfeng Yin Method and system of knowledge based search engine using text mining
CN102254039A (zh) * 2011-08-11 2011-11-23 武汉安问科技发展有限责任公司 一种基于搜索引擎的网络搜索方法
CN102955798A (zh) * 2011-08-25 2013-03-06 腾讯科技(深圳)有限公司 一种基于搜索引擎的搜索方法及搜索服务器
CN102402619A (zh) * 2011-12-23 2012-04-04 广东威创视讯科技股份有限公司 一种搜索方法和装置

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114880588A (zh) * 2022-06-13 2022-08-09 四川封面传媒科技有限责任公司 基于知识图谱的新闻热度预测方法
CN114880588B (zh) * 2022-06-13 2024-04-26 四川封面传媒科技有限责任公司 基于知识图谱的新闻热度预测方法
CN115221954A (zh) * 2022-07-12 2022-10-21 中国电信股份有限公司 用户画像方法、装置、电子设备以及存储介质
CN115221954B (zh) * 2022-07-12 2023-10-31 中国电信股份有限公司 用户画像方法、装置、电子设备以及存储介质
CN115033594A (zh) * 2022-08-10 2022-09-09 之江实验室 一种给出置信度的垂直领域检索方法与装置
CN115033594B (zh) * 2022-08-10 2022-11-18 之江实验室 一种给出置信度的垂直领域检索方法与装置
CN116340626A (zh) * 2023-03-20 2023-06-27 长沙松柏之志传媒有限公司 内容推荐方法、推荐***及相关设备
CN116861915A (zh) * 2023-06-02 2023-10-10 国网江苏省电力有限公司南京供电分公司 一种基于nlp与热点词元分析的用电诉求辨析方法和***

Also Published As

Publication number Publication date
CN115516447A (zh) 2022-12-23

Similar Documents

Publication Publication Date Title
WO2021226840A1 (zh) 热点新闻意图识别方法、装置、设备及可读存储介质
CN110162593B (zh) 一种搜索结果处理、相似度模型训练方法及装置
US11900064B2 (en) Neural network-based semantic information retrieval
WO2020082560A1 (zh) 文本关键词提取方法、装置、设备及计算机可读存储介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN116775847B (zh) 一种基于知识图谱和大语言模型的问答方法和***
CN111241237B (zh) 一种基于运维业务的智能问答数据处理方法及装置
US20180329985A1 (en) Method and Apparatus for Compressing Topic Model
CN113434636B (zh) 基于语义的近似文本搜索方法、装置、计算机设备及介质
CN113806588B (zh) 搜索视频的方法和装置
CN115203421A (zh) 一种长文本的标签生成方法、装置、设备及存储介质
CN113128431B (zh) 视频片段检索方法、装置、介质与电子设备
CN117609479B (zh) 一种模型处理方法、装置、设备、介质及产品
JP2021508391A (ja) 対象領域およびクライアント固有のアプリケーション・プログラム・インタフェース推奨の促進
TW202001621A (zh) 語料庫產生方法及裝置、人機互動處理方法及裝置
CN114003682A (zh) 一种文本分类方法、装置、设备及存储介质
TW202334839A (zh) 用於問題回答過程的上下文澄清和消歧
CN117093687A (zh) 问题应答方法和装置、电子设备、存储介质
CN111797204A (zh) 文本匹配方法、装置、计算机设备及存储介质
CN116821307B (zh) 内容交互方法、装置、电子设备和存储介质
CN111988668B (zh) 一种视频推荐方法、装置、计算机设备及存储介质
CN113505196A (zh) 基于词性的文本检索方法、装置、电子设备及存储介质
CN117874234A (zh) 基于语义的文本分类方法、装置、计算机设备及存储介质
CN117312535A (zh) 基于人工智能的问题数据处理方法、装置、设备及介质
CN113761270A (zh) 视频召回方法、装置、电子设备以及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20935555

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 18.04.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20935555

Country of ref document: EP

Kind code of ref document: A1