CN111274366A - Search recommendation method and device, equipment and storage medium - Google Patents

Search recommendation method and device, equipment and storage medium Download PDF

Info

Publication number
CN111274366A
CN111274366A CN202010220548.2A CN202010220548A CN111274366A CN 111274366 A CN111274366 A CN 111274366A CN 202010220548 A CN202010220548 A CN 202010220548A CN 111274366 A CN111274366 A CN 111274366A
Authority
CN
China
Prior art keywords
candidate text
character
searched
information
word vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010220548.2A
Other languages
Chinese (zh)
Inventor
沈强
谭松波
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN202010220548.2A priority Critical patent/CN111274366A/en
Publication of CN111274366A publication Critical patent/CN111274366A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a search recommendation method, a search recommendation device, equipment and a storage medium, wherein the method comprises the following steps: determining the word vector correlation degree between the information to be searched and each candidate text in the candidate text set; determining the character relevancy between the information to be searched and each candidate text; fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text; and recommending candidate texts corresponding to the evaluation values meeting the specific conditions.

Description

Search recommendation method and device, equipment and storage medium
Technical Field
The embodiment of the application relates to the internet technology, and relates to but is not limited to a search recommendation method, a search recommendation device, search recommendation equipment and a storage medium.
Background
Users often search for needed information in the mass information of the internet, and search engines have become indispensable tools in the life and work of users. The search engine is a retrieval technology that retrieves relevant texts from the internet by using a specific strategy according to user requirements and a certain algorithm and then feeds back the texts to users. The method comprises the following steps of determining the relevance between information to be searched input by a user and candidate texts, and recommending the candidate texts with high relevance to the information to be searched to the user.
Therefore, how to make the candidate texts recommended to the user more accurate has certain significance for better meeting the user requirements.
Disclosure of Invention
In view of this, embodiments of the present application provide a search recommendation method and apparatus, a device, and a storage medium.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a search recommendation method, where the method includes: determining the word vector correlation degree between the information to be searched and each candidate text in the candidate text set; determining the character relevancy between the information to be searched and each candidate text; fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text; and recommending candidate texts corresponding to the evaluation values meeting the specific conditions.
In a second aspect, an embodiment of the present application provides a search recommendation apparatus, including: the first determination module is used for determining the word vector relevancy between the information to be searched and each candidate text in the candidate text set; the second determination module is used for determining the character relevancy between the information to be searched and each candidate text; the fusion module is used for fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text; and the recommending module is used for recommending the candidate texts corresponding to the evaluation values meeting the specific conditions.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory stores a computer program that is executable on the processor, and the processor executes the computer program to implement steps in any search recommendation method according to the embodiment of the present application.
In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the steps in any one of the search recommendation methods in the embodiment of the present application.
In the method, word vector relevancy and character relevancy between information to be searched and each candidate text are determined; then, fusing the word vector correlation degree and the character correlation degree corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text; therefore, the similarity between the information to be searched and the candidate text can be more accurately determined, so that the candidate text recommended to the user is more accurate, and the user requirements are better met.
Drawings
FIG. 1 is a schematic diagram illustrating an implementation flow of a search recommendation method according to an embodiment of the present application;
FIG. 2 is a general flow chart of a model for determining an evaluation value of a candidate text answer according to an embodiment of the present application;
FIG. 3 is a diagram illustrating search recommendation results returned by a BERT model only in an embodiment of the present application;
FIG. 4 is a diagram illustrating search recommendation results returned by fusing a BERT model and an N-GRAM according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a search recommendation apparatus according to an embodiment of the present application;
fig. 6 is a schematic diagram of a hardware entity of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, specific technical solutions of the present application will be described in further detail below with reference to the accompanying drawings in the embodiments of the present application. The following examples are intended to illustrate the present application but are not intended to limit the scope of the present application.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is understood that "some embodiments" may be the same subset or different subsets of all possible embodiments, and may be combined with each other without conflict.
It should be noted that the terms "first \ second \ third" referred to in the embodiments of the present application merely distinguish similar or different objects and do not represent a specific ordering with respect to the objects, and it should be understood that "first \ second \ third" may be interchanged under certain ordering or sequence circumstances to enable the embodiments of the present application described herein to be implemented in other orders than illustrated or described herein.
The search recommendation method provided by the embodiment of the application can be applied to electronic equipment, and the electronic equipment can be various types of equipment with a search function in the implementation process, for example, the electronic equipment can include an intelligent mobile terminal (e.g., a mobile phone), a tablet computer, an electronic book, a notebook computer, a desktop computer, and the like. The functions implemented by the method can be implemented by calling program code by a processor in the electronic device, and the program code can be stored in a computer storage medium.
An embodiment of the present application provides a search recommendation method, and fig. 1 is a schematic view of an implementation flow of the search recommendation method according to the embodiment of the present application, and as shown in fig. 1, the method may include the following steps 101 to 104:
step 101, determining word vector correlation between information to be searched and each candidate text in the candidate text set.
It is understood that the information to be searched is a keyword or a sentence, also called a query sentence (query), input into the search engine by the user through voice or text input. The candidate text set may be an existing corpus or other databases. The candidate text may be a web page title, a web page content corresponding to the web page title, or the web page title and the corresponding web page content. The larger the word vector relevance is, the closer the semantics between the information to be searched and the candidate text is.
The electronic device may determine the word vector correlation degree between the information to be searched and the candidate text through steps 201 to 203 of the following embodiments.
Step 102, determining the character relevancy between the information to be searched and each candidate text.
It is known that a word vector is a semantic representation of text, and characters refer to words, etc. contained in the text. Therefore, the character relevance and the word vector relevance are measures representing the closeness degree between two texts from different angles. Word vector relatedness characterizes how close the semantics are, while character relatedness characterizes how similar the words are.
The electronic device may determine the character correlation between the information to be searched and the candidate text through steps 204 to 206 of the following embodiments.
And 103, fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text.
It can be understood that the evaluation value of the candidate text is actually a comprehensive evaluation result of the corresponding word vector relevance and the character relevance, the degree of similarity between the candidate text and the information to be searched is evaluated, and the larger the evaluation value is, the closer the semantic meaning and the word meaning between the candidate text and the information to be searched are represented. The electronic device may determine the evaluation value of the corresponding candidate text through step 207 and step 208 of the following embodiments.
And 104, recommending candidate texts corresponding to the evaluation values meeting the specific conditions.
The specific conditions may be various. For example, if the specific condition is that the evaluation value is greater than a specific threshold, the candidate text corresponding to the evaluation value greater than the specific threshold is recommended accordingly. The specific condition may be K with the largest evaluation values, where K is an integer greater than 0, and the candidate text corresponding to the K largest evaluation values is recommended accordingly.
When the candidate texts are recommended for the user, the candidate texts can be ranked according to the order of the evaluation values from large to small, so that the candidate texts with higher evaluation values are preferentially displayed, and the user can conveniently select the candidate texts.
In the embodiment of the application, a search recommendation method is provided, in the method, word vector relevancy and character relevancy between information to be searched and each candidate text are determined; then, fusing the word vector correlation degree and the character correlation degree corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text; therefore, the similarity between the information to be searched and the candidate text can be more accurately determined, so that the candidate text recommended to the user is more accurate, and the user requirements are better met.
An embodiment of the present application further provides a search recommendation method, where the method may include the following steps 201 to 209:
step 201, respectively performing feature extraction on the information to be searched and each candidate text to obtain a first word vector of the information to be searched and a second word vector of the corresponding candidate text.
In some embodiments, the feature extraction is performed on the information to be searched to obtain the first word vector, and the method may be implemented by the following steps 2011 and 2012: step 2011, determining a text vector, a position vector and an initial word vector of the information to be searched; step 2012, processing the text vector, the position vector and the initial word vector by using a bert (bidirectional Encoder responses from transformer) model to obtain the first word vector.
It should be noted that the primary input of the BERT model is the original Word Vector of each Word or phrase in the text, i.e., the initial Word Vector in step 2011, which may be initialized randomly, or pre-trained using an algorithm such as Word2Vector, etc. to serve as an initial value; the output is vector representation of each character or word in the text after full-text semantic information is fused. That is, the first word vector is a vector representation in which full-text semantic information of the information to be searched is fused, and the second word vector is also the same.
And the value of the text vector is automatically learned in the training process of the BERT model, is used for depicting the global semantic information of the text and is fused with the semantic information of the single character or word. Position vectors, because semantic information carried by words or words appearing at different positions of the text is different (such as "I love you" and "I love me"), the BERT model adds a different vector to the words or words at different positions respectively for distinguishing. Finally, the BERT model takes the sum of the text vector, the position vector, and the initial word vector as input to the model to obtain a first word vector.
Of course, the determination method of the second word vector is similar to the determination method of the first word vector, except that the processed text content is different, the former processed text is the candidate text, and the latter processed text is the information to be searched.
Step 202, processing the first word vector and a second word vector of the ith candidate text by using a classifier model obtained by training to obtain the probability that the information to be searched and the ith candidate text belong to similar categories; wherein i is a positive integer less than or equal to the total number of candidate texts in the candidate set;
step 203, determining the probability as the word vector correlation degree between the information to be searched and the ith candidate text.
In some embodiments, an initial classifier of the classifier model may be trained using a set of sample data to derive the classifier model. Each sample data of the sample data set comprises a word vector (the word vector is a result after feature extraction) corresponding to each of the two texts and a label indicating whether the two texts are similar, for example, if the two texts are similar, the label is 1, otherwise, the label is 0; thus, the classifier model with the binary classification function is obtained through training. Further, the classifier model can be used to determine the word vector correlation degree between the information to be searched and the candidate text.
It should be noted that, for each candidate text, the electronic device may determine the word vector correlation degree with the information to be searched through the above steps 202 and 203.
Step 204, according to the M division grammar, respectively performing character string division on the information to be searched to obtain M first character string sets; wherein M is an integer greater than 0;
step 205, according to the M-segmentation grammar, respectively performing character string segmentation on the ith candidate text to obtain M second character string sets.
In the embodiment of the present application, the value of M is not limited, that is, the electronic device may perform character string segmentation on the text according to one or more segmentation grammars. In some embodiments, the segmentation grammar may be an N-GRAM (N-GRAM). For example, the N-GRAM may be a 1-GRAM, a 2-GRAM, or a 3-GRAM, etc. The M-partition grammar may comprise any one of N-grams or a variety of different N-grams. For example, the M-partition syntax includes a 1-GRAM and a 2-GRAM. Alternatively, the M-partition grammar includes a 1-GRAM and a 3-GRAM; alternatively, the M-partition grammar includes a 2-GRAM and a 3-GRAM; alternatively, the M-partition syntax includes a 1-GRAM, a 2-GRAM, and a 3-GRAM.
For example, the information to be searched is Tencent cloud, and the information is subjected to string segmentation by using 1-GRAM, and the obtained first string set is { Teng/cloud }, wherein "/" is a segmentation symbol and exists for convenience of understanding. And performing character string segmentation on the information by using a 2-GRAM to obtain a first character string set of { Tencent/Wen cloud }.
It should be noted that a character is a broad description, i.e. a character may include words, letters, numbers, etc. One character string may include only one character or may include a plurality of characters. The character string segmentation method of each candidate text is the same.
Step 206, determining the character correlation degree between the information to be searched and the ith candidate text according to the M first character string sets and the M second character string sets.
For each candidate text, the electronic device determines the character relevance with the information to be searched in the same way. For example, the electronic device may determine the character relevance between each candidate text and the information to be searched through step 304 and step 305 of the following embodiments.
It should be noted that, in the embodiment of the present application, the determination order of the word vector relevancy and the character relevancy is not limited. The electronic equipment can determine the word vector relevancy first and then determine the character relevancy, and also can determine the character relevancy first and then determine the word vector relevancy; the electronic device may also determine character relevancy and word vector relevancy between the information to be searched and the candidate text in parallel.
Step 207, obtaining a penalty coefficient, wherein the penalty coefficient is used for characterizing the accuracy of the determination method of the character relevancy or the determination method of the word vector relevancy.
In some embodiments, the penalty factor is a confidence level of a determination method of character relevance or a confidence level of a determination method of word vector relevance.
And 208, determining an evaluation value of the corresponding candidate text according to the penalty coefficient, the character relevance corresponding to each candidate text and the word vector relevance.
The penalty factor may be preconfigured. In some embodiments, where the penalty factor is used to characterize a confidence level of a determination method of character relevance, the electronic device may determine a first product between the penalty factor and the character relevance; and determining the evaluation value of the corresponding candidate text according to the word vector correlation degree corresponding to each candidate text and the first product.
For example, the evaluation value Score of the candidate text may be determined according to the following formula (1)1
Score1=ScortBert11*F1(1);
In formula (1), ScortBert1Word vector representing candidate text and information to be searchedDegree of correlation, α1Denotes a penalty factor, F1And the character relevance of the candidate text and the information to be searched is shown.
In other embodiments, when the penalty factor is used to characterize the confidence of the determination method of the word vector relevance, the electronic device may further determine a second product between the penalty factor and the word vector relevance; and determining the evaluation value of the corresponding candidate text according to the character relevancy corresponding to each candidate text and the second product.
For example, the electronic device may also determine the evaluation value Score of the candidate text according to the following formula (2)2
Score2=α2*ScortBert1+F1(2);
Understandably, penalty factor α1Characterised by the accuracy of the method for determining the degree of correlation of the characters α2Characterized by the accuracy of the method of determining the relevance of the word vector, as exemplified by equation (1), α1When the value of (2) is greater than 1, the accuracy of the determination method representing the character relevance is higher than that of the determination method representing the word vector relevance, so that the character relevance accounts for a larger proportion in the determination of the evaluation value, the evaluation value of the candidate text which has large word vector relevance but is actually far from the content of the information to be searched can be reduced, and the candidate text recommended to the user is more in line with the actual requirements of the user1When the value of (a) is less than 1, the accuracy of the determination method of the correlation degree of the explanatory character is lower than that of the correlation degree of the word vector α1And α2May be the same or different.
In step 209, candidate texts corresponding to evaluation values satisfying specific conditions are recommended.
An embodiment of the present application further provides a search recommendation method, where the method may include the following steps 301 to 307:
step 301, determining word vector correlation between information to be searched and each candidate text in a candidate text set;
step 302, according to the M division grammar, respectively performing character string division on the information to be searched to obtain M first character string sets; wherein M is an integer greater than 0;
step 303, according to the M-partition grammar, performing character string partition on the ith candidate text to obtain M second character string sets; wherein i is a positive integer less than or equal to the total number of candidate texts in the candidate set;
and step 304, determining the distance between the first character string set and the second character string set obtained by adopting the same segmentation grammar.
In some embodiments, the distance may be the same number of strings for both sets. For example, the M partition syntax is 1-GRAM and 2-GRAM, and the information to be searched is: tencent cloud, candidate texts are: tencent is the gaming company. According to 1-GRAM, the information to be searched is subjected to character string segmentation, and the obtained first character string set is { Teng/news/cloud }, wherein, "/" is a segmentation symbol and exists for convenience of description. According to 1-GRAM, the candidate text is subjected to character string segmentation, the obtained first character string set is { Teng/news/Yes/games/Gong/Ses }, and then the two sets have the same character string as { Teng/news }, so that the distance between the two sets is easily obtained to be 2.
According to 2-GRAM, character string segmentation is carried out on the information to be searched, the obtained first character string set is { Tencent/news cloud }, character string segmentation is carried out on the candidate text, the obtained first character string set is { Tencent/news is/game/public/company }, the same character string in the two sets is { Tencent }, and therefore the distance is 1.
Step 305, determining the character correlation degree between the information to be searched and the ith candidate text according to each distance.
It should be noted that, for each candidate text, the determination method of the character correlation degree is the same, i.e., all the determination methods can be realized through the above steps 302 to 305. Taking the ith candidate text as an example, in some embodiments, the electronic device may determine a match score between the first set of character strings and the ith second set of character strings of the candidate text according to the jth distance, the total number of character strings included in the first set of character strings, and the total number of character strings included in the ith second set of character strings of the candidate text; wherein j is an integer greater than 0 and less than or equal to M; and weighting each matching score to obtain the character relevancy between the information to be searched and the ith candidate text.
That is, each distance corresponds to a match score, and the determination method for each match score is the same. For example, in the above example, the M-partition syntax is 1-GRAM and 2-GRAM, and the information to be searched is: tencent cloud, candidate texts are: tencent is the gaming company. The two texts are subjected to character string segmentation by adopting 1-GRAM, the first character string set obtained respectively is { Teng/Centrin/cloud } and { Teng/Centrin/Ye/Games/Gong/Set }, the total number of character strings contained in the first character string set is 3, the total number of character strings contained in the second character string set is 7, the distance between the two sets is 2, and therefore the determined accurate value is 2 based on the distance
Figure BDA0002425914130000101
The determined recall value is
Figure BDA0002425914130000102
Substituting these two values into the following formula (3) can obtain the matching score Match _ Scores corresponding to the distance:
Figure BDA0002425914130000103
in equation (3), Precision represents an accurate value, and Recall represents a Recall value.
Similarly, based on the two first character string sets determined by the 2-GRAM, the corresponding matching score can be obtained according to the formula (3).
The weight for each match score is related to how well the grammar is segmented. For example, the M-partition grammars are 1-GRAM and 2-GRAM, and the weight of the 2-GRAM-based match score may be configured to be greater than the weight of the 1-GRAM-based match score.
Of course, in some embodiments, the electronic device may directly add each of the matching scores corresponding to the ith candidate text, and determine the added result as the character relevancy corresponding to the candidate text.
And step 306, fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text.
In some embodiments, the electronic device determines an evaluation value of a corresponding candidate text according to a penalty coefficient, the character relevance corresponding to each candidate text and the word vector relevance. For example, the electronic device may determine the evaluation value Score of the candidate text according to the following formula (4)1
Score1=ScortBert11*F1(4);
In formula (4), ScortBert1Word vector relatedness representing candidate text and information to be searched, α1Denotes a penalty factor, F1And the character relevance of the candidate text and the information to be searched is shown.
Step 307 is to recommend a candidate text corresponding to an evaluation value satisfying a specific condition.
The internet has many search engines, such as hundredths, ***, dog search, etc., and the so-called search engine is a retrieval technology that searches out formulated information from the internet by using a specific strategy according to user requirements and a certain algorithm and then feeds back the formulated information to the user. One key technology in the search engine is to match the relevance of the search query statement and the result title, and then return the result with high relevance to the query statement.
In this key technology, the relevant solutions are mainly based on statistics, deep learning, language model methods, etc. Statistical methods, such as Term Frequency-Inverse Document Frequency (TF-IDF), Best Match (Best Match25, BM25) and other algorithms, need to calculate a large amount of user statistical information, such as the number of documents, Term Frequency and the like, and cannot acquire semantic information of a user. Deep learning-based methods, such as Deep Reconstruction Classification Networks (DRCNs), ARC-I, ARC-II, etc., require a large number of labeled related corpora, and the quality of the corpora greatly affects the model effect. The language model includes a correlation model (Word to Vector, Word2Vec) for generating Word vectors, Doc2Vec, etc., and the Word vectors calculated by the correlation model and the Doc2Vec include semantics, but have certain limitations and are not accurate enough.
Based on this, an exemplary application of the embodiment of the present application in a practical application scenario will be described below.
The method selects to use the BERT model to obtain the word vector, and compared with a correlation method, the word vector obtained by the model contains richer semantic features, but many wrong terms are brought by the BERT model. Based on this, in the embodiment of the application, a character string distance calculation model based on N-GRAM is added, and the more the same character, the higher the sentence score, the deviation caused by BERT can be suppressed to a certain extent.
In the embodiment of the application, the core points of the solution are as follows:
firstly, Word vectors are obtained through a BERT model, and the method has better Word vectors than Word2Vec and Doc2Vec and has more semantic information than TF-IDF and BM 25;
secondly, the BERT model may bring some results with great deviation, and in the embodiment of the present application, an N-GRAM-based string distance calculation model is also designed and added, and the model can automatically analyze the string matching situation and return a score (i.e. the character correlation degree described in the above embodiment), and the score is incorporated into BERT when the matching degree is higher and the score is higher.
As shown in fig. 2, the figure is a general flow chart of a model for determining an evaluation value of a candidate text answer, the model takes a query sentence and the answer as input, and returns a score through model calculation, and the higher the score is, the higher the correlation degree between the query sentence and the answer is.
It should be noted that:
1, the candidate text set is mainly crawled from the internet (e.g., micro blogs).
2, the model shown in fig. 2 is mainly divided into two parts, i.e. a BERT model and an N-GRAM model, and word vector correlation and character correlation of two input texts can be obtained through the two models respectively, for example, as shown in fig. 2, a query sentence: "james" and answer statements: "how to treat the first jensel, disclose jensems? "input into BERT model and N-GRAM model, the obtained word vector correlation degree is 0.73415, and the obtained character correlation degree is 0.5751.
The BERT model may be a pre-trained model that is published and then fine-tuned with some similar data set.
And 4, firstly counting all possible 1-GRAM and 2-GRAM sets of the query statement and the answer statement based on the string distance model of the N-GRAM, and then calculating corresponding Precision value Precision and Recall value Recall. Finally, a character correlation F1 value is calculated according to the following formula (5), and a higher value indicates a higher degree of matching.
Figure BDA0002425914130000121
In the formula, Precision1And Recall1Precise values and recall values determined based on 1-GRAM; precision2And Recall2Is an accurate value and a recall value determined based on the 2-GRAM.
Combining Score ScoreBert of the BERT model with F1 to obtain final Score Score, wherein the combining formula is shown as the following formula (6), wherein α is a penalty coefficient:
Score=ScortBert+α*F1(6)。
in evaluating the model effect, the test data selects some titles from different fields on the network. When searching for "james sports", the results returned using the BERT model alone are shown in fig. 3, from which it can be seen that the BERT model learns some parts of the same semantics, such as "exercisers", "high speed", etc., all relate to sports, but the model also produces many bad terms such as "stock market", etc.
When the BERT model was added to the N-GRAM and searched for "jemes sports", the results were returned as shown in fig. 4, and it can be seen from fig. 4 that the results produced qualitative changes compared to fig. 3, and the first 10 pieces were basically related to the query "jemes sports", and secondly, the model also provided information on the team where "jemes" was located, information on the foreign number, "sports".
Based on the foregoing embodiments, the present application provides a search recommendation apparatus, where the apparatus includes modules and units included in the modules, and may be implemented by a processor in an electronic device; of course, the implementation can also be realized through a specific logic circuit; in implementation, the processor may be a Central Processing Unit (CPU), a Microprocessor (MPU), a Digital Signal Processor (DSP), a Field Programmable Gate Array (FPGA), or the like.
Fig. 5 is a schematic structural diagram of a search recommendation apparatus according to an embodiment of the present application, and as shown in fig. 5, the apparatus 500 includes a first determining module 501, a second determining module 502, a fusing module 503, and a recommending module 504; wherein the content of the first and second substances,
a first determining module 501, configured to determine a word vector relevancy between information to be searched and each candidate text in the candidate text set;
a second determining module 502, configured to determine a character relevancy between the information to be searched and each of the candidate texts;
a fusion module 503, configured to fuse the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value of the corresponding candidate text;
a recommending module 504, configured to recommend a candidate text corresponding to an evaluation value that satisfies a specific condition.
In some embodiments, the second determining module 502 is configured to: according to the M segmentation grammars, respectively carrying out character string segmentation on the information to be searched to obtain M first character string sets; wherein M is an integer greater than 0; according to the M segmentation grammars, respectively carrying out character string segmentation on the ith candidate text to obtain M second character string sets; wherein i is a positive integer less than or equal to the total number of candidate texts in the candidate set; and determining the character correlation degree between the information to be searched and the ith candidate text according to the M first character string sets and the M second character string sets.
In some embodiments, the second determining module 502 is configured to: determining the distance between a first character string set and a second character string set obtained by adopting the same segmentation grammar; and determining the character relevancy between the information to be searched and the ith candidate text according to each distance.
In some embodiments, M is an integer greater than 1, and the second determining module 502 is configured to: determining a matching score between the first character string set and the ith second character string set of the candidate text according to the jth distance, the total number of character strings included in the first character string set and the total number of character strings included in the ith second character string set of the candidate text; wherein j is an integer greater than 0 and less than or equal to M; and weighting each matching score to obtain the character relevancy between the information to be searched and the ith candidate text.
In some embodiments, the fusion module 503 is configured to: obtaining a penalty coefficient, wherein the penalty coefficient is used for representing the accuracy of a determination method of character relevancy or a determination method of word vector relevancy; and determining an evaluation value of the corresponding candidate text according to the penalty coefficient, the character relevance corresponding to each candidate text and the word vector relevance.
In some embodiments, the first determining module 501 is configured to: respectively extracting the characteristics of the information to be searched and each candidate text to obtain a first word vector of the information to be searched and a second word vector of the corresponding candidate text; processing the first word vector and a second word vector of the ith candidate text by using a classifier model obtained by training to obtain the probability that the information to be searched and the ith candidate text belong to similar categories; determining the probability as the word vector correlation degree between the information to be searched and the ith candidate text; wherein i is a positive integer less than or equal to the total number of candidate texts of the candidate set.
In some embodiments, the first determining module 501 is configured to: determining a text vector, a position vector and an initial word vector of the information to be searched; and processing the text vector, the position vector and the initial word vector by using a BERT model to obtain the first word vector.
The above description of the apparatus embodiments, similar to the above description of the method embodiments, has similar beneficial effects as the method embodiments. For technical details not disclosed in the embodiments of the apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be noted that, in the embodiment of the present application, if the search recommendation method is implemented in the form of a software functional module and sold or used as a standalone product, the search recommendation method may also be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be an intelligent mobile terminal (e.g., a mobile phone), a tablet computer, an electronic book, a notebook computer, a desktop computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read Only Memory (ROM), a magnetic disk, or an optical disk. Thus, embodiments of the present application are not limited to any specific combination of hardware and software.
Correspondingly, an embodiment of the present application provides an electronic device, fig. 6 is a schematic diagram of a hardware entity of the electronic device according to the embodiment of the present application, and as shown in fig. 6, the hardware entity of the electronic device 600 includes: comprising a memory 601 and a processor 602, said memory 601 storing a computer program operable on said processor 602, said processor 602 implementing the steps in the search recommendation method provided in the above embodiments when executing said program.
The memory 601 is configured to store instructions and applications executable by the processor 602, and may also buffer data (e.g., image data, audio data, voice communication data, and video communication data) to be processed or already processed by the processor 602 and modules in the electronic device 600, and may be implemented by a FLASH memory (FLASH) or a Random Access Memory (RAM).
Correspondingly, the embodiment of the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the steps in the search recommendation method provided in the above embodiment.
Here, it should be noted that: the above description of the storage medium and device embodiments is similar to the description of the method embodiments above, with similar advantageous effects as the method embodiments. For technical details not disclosed in the embodiments of the storage medium and apparatus of the present application, reference is made to the description of the embodiments of the method of the present application for understanding.
It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application. The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of the unit is only a logical functional division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units; can be located in one place or distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
Those of ordinary skill in the art will understand that: all or part of the steps for realizing the method embodiments can be completed by hardware related to program instructions, the program can be stored in a computer readable storage medium, and the program executes the steps comprising the method embodiments when executed; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Read Only Memory (ROM), a magnetic disk, or an optical disk.
Alternatively, the integrated units described above in the present application may be stored in a computer-readable storage medium if they are implemented in the form of software functional modules and sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present application may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for enabling an electronic device (which may be an intelligent mobile terminal (e.g., a mobile phone), a tablet computer, an electronic book, a notebook computer, a desktop computer, etc.) to execute all or part of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a removable storage device, a ROM, a magnetic or optical disk, or other various media that can store program code.
The methods disclosed in the several method embodiments provided in the present application may be combined arbitrarily without conflict to obtain new method embodiments.
Features disclosed in several of the product embodiments provided in the present application may be combined in any combination to yield new product embodiments without conflict.
The features disclosed in the several method or apparatus embodiments provided in the present application may be combined arbitrarily, without conflict, to arrive at new method embodiments or apparatus embodiments.
The above description is only for the embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A search recommendation method, the method comprising:
determining the word vector correlation degree between the information to be searched and each candidate text in the candidate text set;
determining the character relevancy between the information to be searched and each candidate text;
fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text;
and recommending candidate texts corresponding to the evaluation values meeting the specific conditions.
2. The method of claim 1, wherein the determining the character relevance between the information to be searched and each candidate text comprises:
according to the M segmentation grammars, respectively carrying out character string segmentation on the information to be searched to obtain M first character string sets; wherein M is an integer greater than 0;
according to the M segmentation grammars, respectively carrying out character string segmentation on the ith candidate text to obtain M second character string sets; wherein i is a positive integer less than or equal to the total number of candidate texts in the candidate set;
and determining the character correlation degree between the information to be searched and the ith candidate text according to the M first character string sets and the M second character string sets.
3. The method according to claim 2, wherein the determining the character correlation between the information to be searched and the i-th candidate text according to the M first character string sets and the M second character string sets comprises:
determining the distance between a first character string set and a second character string set obtained by adopting the same segmentation grammar;
and determining the character relevancy between the information to be searched and the ith candidate text according to each distance.
4. The method according to claim 3, wherein M is an integer greater than 1, and said determining the character correlation between the information to be searched and the i-th candidate text according to each of the distances comprises:
determining a matching score between the first character string set and the ith second character string set of the candidate text according to the jth distance, the total number of character strings included in the first character string set and the total number of character strings included in the ith second character string set of the candidate text; wherein j is an integer greater than 0 and less than or equal to M;
and weighting each matching score to obtain the character relevancy between the information to be searched and the ith candidate text.
5. The method according to claim 1, wherein the fusing the word vector relevancy and the character relevancy corresponding to each of the candidate texts to obtain the evaluation value of the corresponding candidate text comprises:
obtaining a penalty coefficient, wherein the penalty coefficient is used for representing the accuracy of a determination method of character relevancy or a determination method of word vector relevancy;
and determining an evaluation value of the corresponding candidate text according to the penalty coefficient, the character relevance corresponding to each candidate text and the word vector relevance.
6. The method of claim 1, wherein determining a word vector relevance between the information to be searched and each candidate text in the candidate text set comprises:
respectively extracting the characteristics of the information to be searched and each candidate text to obtain a first word vector of the information to be searched and a second word vector of the corresponding candidate text;
processing the first word vector and a second word vector of the ith candidate text by using a classifier model obtained by training to obtain the probability that the information to be searched and the ith candidate text belong to similar categories;
determining the probability as the word vector correlation degree between the information to be searched and the ith candidate text; wherein i is a positive integer less than or equal to the total number of candidate texts of the candidate set.
7. The method of claim 6, wherein the extracting the features of the information to be searched to obtain a first word vector comprises:
determining a text vector, a position vector and an initial word vector of the information to be searched;
and processing the text vector, the position vector and the initial word vector by using a BERT model to obtain the first word vector.
8. A search recommendation apparatus, comprising:
the first determination module is used for determining the word vector relevancy between the information to be searched and each candidate text in the candidate text set;
the second determination module is used for determining the character relevancy between the information to be searched and each candidate text;
the fusion module is used for fusing the word vector relevancy and the character relevancy corresponding to each candidate text to obtain an evaluation value corresponding to the candidate text;
and the recommending module is used for recommending the candidate texts corresponding to the evaluation values meeting the specific conditions.
9. An electronic device comprising a memory and a processor, the memory storing a computer program operable on the processor, wherein the processor implements the steps of the search recommendation method of any one of claims 1 to 7 when executing the program.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the search recommendation method according to any one of claims 1 to 7.
CN202010220548.2A 2020-03-25 2020-03-25 Search recommendation method and device, equipment and storage medium Pending CN111274366A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010220548.2A CN111274366A (en) 2020-03-25 2020-03-25 Search recommendation method and device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010220548.2A CN111274366A (en) 2020-03-25 2020-03-25 Search recommendation method and device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111274366A true CN111274366A (en) 2020-06-12

Family

ID=71002548

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010220548.2A Pending CN111274366A (en) 2020-03-25 2020-03-25 Search recommendation method and device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111274366A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312523A (en) * 2021-07-30 2021-08-27 北京达佳互联信息技术有限公司 Dictionary generation and search keyword recommendation method and device and server
CN113821587A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Text relevance determination method, model training method, device and storage medium
CN114117046A (en) * 2021-11-26 2022-03-01 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126335A1 (en) * 2006-11-29 2008-05-29 Oracle International Corporation Efficient computation of document similarity
WO2019136993A1 (en) * 2018-01-12 2019-07-18 深圳壹账通智能科技有限公司 Text similarity calculation method and device, computer apparatus, and storage medium
WO2019210820A1 (en) * 2018-05-03 2019-11-07 华为技术有限公司 Information output method and apparatus
CN110427463A (en) * 2019-08-08 2019-11-08 腾讯科技(深圳)有限公司 Search statement response method, device and server and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080126335A1 (en) * 2006-11-29 2008-05-29 Oracle International Corporation Efficient computation of document similarity
WO2019136993A1 (en) * 2018-01-12 2019-07-18 深圳壹账通智能科技有限公司 Text similarity calculation method and device, computer apparatus, and storage medium
WO2019210820A1 (en) * 2018-05-03 2019-11-07 华为技术有限公司 Information output method and apparatus
CN110427463A (en) * 2019-08-08 2019-11-08 腾讯科技(深圳)有限公司 Search statement response method, device and server and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821587A (en) * 2021-06-02 2021-12-21 腾讯科技(深圳)有限公司 Text relevance determination method, model training method, device and storage medium
CN113821587B (en) * 2021-06-02 2024-05-17 腾讯科技(深圳)有限公司 Text relevance determining method, model training method, device and storage medium
CN113312523A (en) * 2021-07-30 2021-08-27 北京达佳互联信息技术有限公司 Dictionary generation and search keyword recommendation method and device and server
CN113312523B (en) * 2021-07-30 2021-12-14 北京达佳互联信息技术有限公司 Dictionary generation and search keyword recommendation method and device and server
CN114117046A (en) * 2021-11-26 2022-03-01 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium
CN114117046B (en) * 2021-11-26 2023-08-11 北京百度网讯科技有限公司 Data processing method, device, electronic equipment and medium

Similar Documents

Publication Publication Date Title
US10997370B2 (en) Hybrid classifier for assigning natural language processing (NLP) inputs to domains in real-time
CN108647205B (en) Fine-grained emotion analysis model construction method and device and readable storage medium
CN106709040B (en) Application search method and server
JP6007088B2 (en) Question answering program, server and method using a large amount of comment text
CN110019658B (en) Method and related device for generating search term
CA2774278C (en) Methods and systems for extracting keyphrases from natural text for search engine indexing
US20130060769A1 (en) System and method for identifying social media interactions
Kanwal et al. A review of text-based recommendation systems
CN111753167B (en) Search processing method, device, computer equipment and medium
JP5710581B2 (en) Question answering apparatus, method, and program
CN111274366A (en) Search recommendation method and device, equipment and storage medium
CN105653562A (en) Calculation method and apparatus for correlation between text content and query request
Gu et al. Service package recommendation for mashup creation via mashup textual description mining
WO2021112984A1 (en) Feature and context based search result generation
CN111400584A (en) Association word recommendation method and device, computer equipment and storage medium
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
Figueroa et al. Contextual language models for ranking answers to natural language definition questions
CN111460177B (en) Video expression search method and device, storage medium and computer equipment
JP6173958B2 (en) Program, apparatus and method for searching using a plurality of hash tables
Lu et al. Entity identification on microblogs by CRF model with adaptive dependency
Gupta et al. Document summarisation based on sentence ranking using vector space model
CN112214511A (en) API recommendation method based on WTP-WCD algorithm
JP2010282403A (en) Document retrieval method
CN107423298B (en) Searching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination