CN111400607B - Search content output method and device, computer equipment and readable storage medium - Google Patents

Search content output method and device, computer equipment and readable storage medium Download PDF

Info

Publication number
CN111400607B
CN111400607B CN202010497756.7A CN202010497756A CN111400607B CN 111400607 B CN111400607 B CN 111400607B CN 202010497756 A CN202010497756 A CN 202010497756A CN 111400607 B CN111400607 B CN 111400607B
Authority
CN
China
Prior art keywords
information
vector
entity
sample
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010497756.7A
Other languages
Chinese (zh)
Other versions
CN111400607A (en
Inventor
苑爱泉
王磊
王晓峰
芦亚飞
王宇昊
何旺贵
桑梓森
孙靓
徐花
李向阳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Koubei Network Technology Co Ltd
Original Assignee
Zhejiang Koubei Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Koubei Network Technology Co Ltd filed Critical Zhejiang Koubei Network Technology Co Ltd
Priority to CN202010497756.7A priority Critical patent/CN111400607B/en
Publication of CN111400607A publication Critical patent/CN111400607A/en
Application granted granted Critical
Publication of CN111400607B publication Critical patent/CN111400607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a search content output method, a search content output device, a computer device and a readable storage medium, relates to the technical field of Internet, and aims to obtain information word segmentation, entity identification and semantic vectors of information to be searched, further obtain first candidate content comprising the information word segmentation, obtain second candidate content corresponding to the entity identification and/or with similarity larger than a first similarity threshold, and obtain third candidate content with similarity larger than a second similarity threshold with the semantic vectors so as to output search content. The method comprises the following steps: analyzing the information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched; acquiring first candidate content comprising information word segmentation; acquiring second candidate content based on the entity identification; obtaining third candidate content output by the vector model; generating search content and outputting the search content.

Description

Search content output method and device, computer equipment and readable storage medium
Technical Field
The present invention relates to the field of internet technologies, and in particular, to a search content output method, apparatus, computer device, and readable storage medium.
Background
In recent years, with the rapid development of internet technology, various internet applications are widely deepened into various fields, big data is explosively increased, massive data and information are dispersed in a network space, and when a user needs to acquire the information and the data, information search can be performed through a search platform, so that the search platform can output related search contents. As an important link between a user and information, a search platform generally provides various search modes, such as text search, voice search, picture search, video search and the like, which objectively forms a multi-mode user interaction mode and meets the diversified requirements of the user.
In the related technology, a large number of materials for searching are preset in a search platform, when a user request for content search is received, materials related to the content search requested by the user are searched in the preset materials, and the related materials are output to the user as search content for the user to refer to.
In the process of implementing the invention, the inventor finds that the related art has at least the following problems:
the materials preset in the search platform for searching are various, and include materials in text forms such as names, categories, addresses and comments, materials in picture forms such as environment images, article images and address images, and materials in video forms such as video albums, arrival videos and live videos.
Disclosure of Invention
In view of the above, the present invention provides a search content output method, apparatus, computer device and readable storage medium, and mainly aims to solve the problems of the current search method that the limitation is high, and a large amount of search materials in other forms are wasted, resulting in low success rate and accuracy rate of search.
According to a first aspect of the present invention, there is provided a search content output method including:
analyzing information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched;
acquiring first candidate content comprising the information word cutting;
based on the entity identification, second candidate content is obtained, wherein the second candidate content is at least a sample entity corresponding to the entity identification and/or having a similarity with the entity identification larger than a first similarity threshold;
inputting the semantic vector into a vector model, and acquiring third candidate content output by the vector model, wherein the vector model is established by adopting a text material, a voice material and a video material, and the similarity between the third candidate content and the semantic vector is greater than a second similarity threshold;
and generating search content according to the first candidate content, the second candidate content and the third candidate content, and outputting the search content.
In another embodiment, before the analyzing the information to be searched to obtain the information word segmentation, the entity identifier, and the semantic vector of the information to be searched, the method further includes:
acquiring sample information, analyzing materials in the sample information to obtain a sample characteristic vector, wherein the sample information is at least historical operation information and/or preset information;
learning the sample feature vectors by adopting a sorting algorithm to generate a plurality of sample vector groups, wherein the sample vector groups are at least triples consisting of search word vectors, first name vectors and second name vectors in the sample feature vectors, the first name vectors are name vectors matched with the search word vectors in the sample feature vectors, and the second name vectors are name vectors matched with the search word vectors in the sample feature vectors;
respectively inputting the plurality of sample vector groups into a semantic matching model, and acquiring an output vector of the last layer in a hidden layer of the semantic matching model to obtain a plurality of output vectors of the plurality of sample vector groups;
using the plurality of output vectors as the vector model.
In another embodiment, the analyzing the material in the sample information to obtain a sample feature vector includes:
extracting the voice material from the material of the sample information, and calling a voice recognition algorithm to recognize the voice material to obtain a prepared text material;
extracting the video material from the material of the sample information, and extracting the video material by adopting a video key frame extraction algorithm to obtain a prepared picture material;
extracting original text materials from the materials of the sample information, and training the original text materials and the prepared text materials by adopting a semantic training algorithm to obtain text feature vectors;
extracting original picture materials from the materials of the sample information, operating a picture feature extractor, taking the entity categories to which the original picture materials and the prepared picture materials belong as a first extraction target of the picture feature extractor, learning the original picture materials and the prepared picture materials according to the first extraction target, and taking the feature vector of the last layer in the picture feature extractor as a picture feature vector;
and taking the text feature vector and the picture feature vector as the sample feature vector.
In another embodiment, the analyzing the information to be searched to obtain the information word segmentation, the entity identifier and the semantic vector of the information to be searched includes:
performing word segmentation on the information to be searched according to a word segmentation template to obtain information word segmentation;
establishing an entity identification task, and identifying the information to be searched by taking the search type of the information to be searched as the identification direction of the entity identification task to obtain the entity identification;
and determining the information type of the information to be searched, and identifying the information to be searched according to the information type to obtain the semantic vector.
In another embodiment, the identifying the information to be searched according to the information type to obtain the semantic vector includes:
if the information type is a text type, identifying the information to be searched by adopting a semantic training algorithm, and taking a feature vector obtained by identification as the semantic vector;
if the information type is a voice type, calling a voice recognition algorithm to recognize the information to be searched to obtain the information to be searched of a text type, recognizing the information to be searched of the text type by adopting the semantic training algorithm, and taking a feature vector obtained by recognition as the semantic vector;
if the information type is a picture type, operating a picture feature extractor, taking an entity category to which the information to be searched belongs as a second extraction target of the picture feature extractor, learning the information to be searched according to the second extraction target, and taking a feature vector of the last layer in the picture feature extractor as the semantic vector;
if the information type is a video type, processing the information to be searched by adopting a video key frame extraction algorithm to obtain the information to be searched of the picture type, operating the picture feature extractor, setting the second extraction target for the picture feature extractor, learning the information to be searched of the picture type according to the second extraction target, and taking the feature vector of the last layer in the picture feature extractor as the semantic vector.
In another embodiment, the obtaining the second candidate content based on the entity identification includes:
querying a preset information index corresponding to the entity identifier, and taking a first sample entity indicated by the preset information index as the second candidate content; and/or the presence of a gas in the gas,
and acquiring an entity knowledge graph, mapping the entity identification to the entity knowledge graph for wandering, and acquiring a second sample entity which is output by the entity knowledge graph and has similarity with the entity identification greater than the first similarity threshold value as the second candidate content, wherein the entity knowledge graph describes similarity and similarity relation among a plurality of sample entities.
In another embodiment, the mapping the entity identifier to the entity knowledge graph for wandering, and acquiring, as the second candidate content, a second sample entity output by the entity knowledge graph and having a similarity with the entity identifier greater than the first similarity threshold, includes:
determining an appointed sample entity of the entity identification corresponding to the entity knowledge graph, acquiring a second sample entity which has a similarity relation with the appointed sample entity and the similarity is greater than the first similarity threshold, continuously acquiring the sample entity which has the similarity relation with the second sample entity and the similarity is greater than the first similarity threshold until the entity which has the similarity relation or the similarity is greater than the first similarity threshold cannot be acquired, and taking all the acquired second sample entities as the second candidate content; or the like, or, alternatively,
determining a target vector corresponding to the entity identifier in the semantic vector, mapping the target vector to the entity knowledge graph, calculating a first cosine value between the target vector and an entity vector of a sample entity in the entity knowledge graph, and extracting a second sample entity with the first cosine value larger than the first similarity threshold as the second candidate content.
In another embodiment, the inputting the semantic vector into a vector model and obtaining a third candidate content output by the vector model comprises:
inputting the semantic vector into the vector model, and comparing the semantic vector with a plurality of output vectors included in the vector model;
extracting at least one candidate output vector from the plurality of output vectors, wherein the number of numerical values of the candidate output vector and the semantic vector which coincide on the same numerical value is greater than a number threshold value;
and calculating a second cosine value of the semantic vector and the at least one candidate output vector, and extracting a target output vector of which the second cosine value is larger than the second similarity threshold value as the third candidate content.
In another embodiment, the generating search content based on the first candidate content, the second candidate content, and the third candidate content, and outputting the search content includes:
counting the content intersection of the first candidate content, the second candidate content and the third candidate content;
respectively determining the similarity between each sub-content in the content intersection and the information to be searched, and dividing the content intersection into a plurality of content groups according to a preset similar grade interval, wherein the preset similar grade interval specifies the corresponding relation between the similar grade and the similarity;
and extracting a target content group with the similarity level higher than a level threshold value from the plurality of content groups as the search content, and outputting the search content.
In another embodiment, after the determining the similarity between each sub-content in the content intersection and the information to be searched respectively and dividing the content intersection into a plurality of content groups according to a preset similarity level interval, the method further includes:
for each content group in the plurality of content groups, acquiring content scores corresponding to all sub-contents in the content group, wherein the content scores are at least set according to factors such as the distance and the quality of the sub-contents;
and sorting all sub-contents included in the content group in the order of the content scores from high to low.
According to a second aspect of the present invention, there is provided a search content output apparatus including:
the first analysis module is used for analyzing the information to be searched to obtain information word segmentation, entity identification and semantic vectors of the information to be searched;
the first acquisition module is used for acquiring first candidate content comprising the information word segmentation;
a second obtaining module, configured to obtain second candidate content based on the entity identifier, where the second candidate content is at least a sample entity corresponding to the entity identifier and/or having a similarity with the entity identifier greater than a first similarity threshold;
a third obtaining module, configured to input the semantic vector into a vector model, and obtain a third candidate content output by the vector model, where the vector model is established using a text material, a voice material, and a video material, and a similarity between the third candidate content and the semantic vector is greater than a second similarity threshold;
and the output module is used for generating search content according to the first candidate content, the second candidate content and the third candidate content and outputting the search content.
In another embodiment, the apparatus further comprises:
the second analysis module is used for acquiring sample information, analyzing materials in the sample information to obtain a sample characteristic vector, wherein the sample information is at least historical operation information and/or preset information;
a generating module, configured to learn the sample feature vectors by using a sorting algorithm, and generate a plurality of sample vector groups, where a sample vector group is a triple composed of at least a search word vector, a first name vector, and a second name vector in the sample feature vectors, where the first name vector is a name vector in the sample feature vectors that matches the search word vector, and the second name vector is a name vector in the sample feature vectors that matches the search word vector;
the input module is used for respectively inputting the plurality of sample vector groups to a semantic matching model, acquiring the output vector of the last layer in the hidden layer of the semantic matching model, and obtaining a plurality of output vectors of the plurality of sample vector groups;
a determination module for using the plurality of output vectors as the vector model.
In another embodiment, the second parsing module includes:
the recognition unit is used for extracting the voice material from the material of the sample information, calling a voice recognition algorithm to recognize the voice material and obtaining a prepared text material;
the extraction unit is used for extracting the video material from the material of the sample information and extracting the video material by adopting a video key frame extraction algorithm to obtain a prepared picture material;
the training unit is used for extracting original text materials from the materials of the sample information, and training the original text materials and the prepared text materials by adopting a semantic training algorithm to obtain text feature vectors;
the learning unit is used for extracting original picture materials from the materials of the sample information, operating a picture feature extractor, taking the entity categories to which the original picture materials and the prepared picture materials belong as first extraction targets of the picture feature extractor, learning the original picture materials and the prepared picture materials according to the first extraction targets, and taking the feature vector of the last layer in the picture feature extractor as a picture feature vector;
a generating unit, configured to use the text feature vector and the picture feature vector as the sample feature vector.
In another embodiment, the first parsing module includes:
the segmentation unit is used for carrying out word segmentation on the information to be searched according to the word segmentation template to obtain the information word segmentation;
the establishment unit is used for establishing an entity identification task, and identifying the information to be searched by taking the search type of the information to be searched as the identification direction of the entity identification task to obtain the entity identification;
and the determining unit is used for determining the information type of the information to be searched, and identifying the information to be searched according to the information type to obtain the semantic vector.
In another embodiment, the determining unit is configured to identify the information to be searched by using a semantic training algorithm if the information type is a text type, and use a feature vector obtained by the identification as the semantic vector; if the information type is a voice type, calling a voice recognition algorithm to recognize the information to be searched to obtain the information to be searched of a text type, recognizing the information to be searched of the text type by adopting the semantic training algorithm, and taking a feature vector obtained by recognition as the semantic vector; if the information type is a picture type, operating a picture feature extractor, taking an entity category to which the information to be searched belongs as a second extraction target of the picture feature extractor, learning the information to be searched according to the second extraction target, and taking a feature vector of the last layer in the picture feature extractor as the semantic vector; if the information type is a video type, processing the information to be searched by adopting a video key frame extraction algorithm to obtain the information to be searched of the picture type, operating the picture feature extractor, setting the second extraction target for the picture feature extractor, learning the information to be searched of the picture type according to the second extraction target, and taking the feature vector of the last layer in the picture feature extractor as the semantic vector.
In another embodiment, the second obtaining module is configured to query a preset information index corresponding to the entity identifier, and use a first sample entity indicated by the preset information index as the second candidate content; and/or acquiring an entity knowledge graph, mapping the entity identification to the entity knowledge graph for wandering, and acquiring a second sample entity which is output by the entity knowledge graph and has similarity with the entity identification greater than the first similarity threshold value as the second candidate content, wherein the entity knowledge graph describes similarity and similarity relation among a plurality of sample entities.
In another embodiment, the second obtaining module is configured to determine that the entity identifier is a designated sample entity corresponding to the entity knowledge graph, obtain a second sample entity that has a similarity relationship with the designated sample entity and whose similarity is greater than the first similarity threshold, continue to obtain the sample entity that has a similarity relationship with the second sample entity and whose similarity is greater than the first similarity threshold, and take all the obtained second sample entities as the second candidate content until an entity having a similarity relationship or whose similarity is greater than the first similarity threshold is not obtained; or, determining a target vector corresponding to the entity identifier in the semantic vector, mapping the target vector to the entity knowledge graph, calculating a first cosine value between the target vector and an entity vector of a sample entity in the entity knowledge graph, and extracting a second sample entity of which the first cosine value is greater than the first similarity threshold as the second candidate content.
In another embodiment, the third obtaining module includes:
the comparison unit is used for inputting the semantic vector into the vector model and comparing the semantic vector with a plurality of output vectors included in the vector model;
an extracting unit, configured to extract at least one candidate output vector from the plurality of output vectors, where a number of numerical values of the candidate output vector and the semantic vector that coincide on a same numerical position is greater than a number threshold;
and the calculating unit is used for calculating a second cosine value of the semantic vector and the at least one candidate output vector, and extracting a target output vector of which the second cosine value is larger than the second similarity threshold value as the third candidate content.
In another embodiment, the output module includes:
a counting unit, configured to count content intersections of the first candidate content, the second candidate content, and the third candidate content;
the dividing unit is used for respectively determining the similarity between each sub-content in the content intersection and the information to be searched, and dividing the content intersection into a plurality of content groups according to a preset similar grade interval, wherein the preset similar grade interval specifies the corresponding relation between the similar grade and the similarity;
and an output unit configured to extract a target content group having a similarity level higher than a level threshold value among the plurality of content groups as the search content and output the search content.
In another embodiment, the output module further includes:
the scoring unit is used for acquiring content scores corresponding to all sub-contents in the content groups for each content group in the plurality of content groups, and the content scores are at least set according to factors such as the distance and the quality of the sub-contents;
and the sorting unit is used for sorting all the sub-contents in the content group according to the sequence of the content scores from high to low.
According to a third aspect of the present invention, there is provided a computer device comprising a memory storing a computer program and a processor implementing the steps of the method of the first aspect when the processor executes the computer program.
According to a fourth aspect of the present invention, there is provided a readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of the method of the first aspect as set forth above.
By means of the technical scheme, the search content output method, the search content output device, the computer equipment and the readable storage medium provided by the invention have the advantages that information to be searched is analyzed to obtain information word segmentation, entity identification and semantic vectors of the information to be searched, further first candidate content comprising the information word segmentation is obtained, second candidate content corresponding to the entity identification and/or having the similarity with the entity identification larger than a first similarity threshold is obtained based on the entity identification, the semantic vectors are input into the vector model, third candidate content output by the vector model and having the similarity with the semantic vectors larger than a second similarity threshold is obtained so as to generate and output the search content, so that the information to be searched is searched by adopting various different methods and utilizing a plurality of search links, the multi-mode search of the information is realized, and the limitation of the search method is broken, the waste of searching materials is avoided, and the success rate and the accuracy rate of searching are improved.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
fig. 1 is a schematic flowchart illustrating a search content output method according to an embodiment of the present invention;
fig. 2A is a schematic flowchart illustrating a search content output method according to an embodiment of the present invention;
fig. 2B is a schematic flowchart illustrating a search content output method according to an embodiment of the present invention;
fig. 3A is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3B is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3C is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3D is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3E is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3F is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 3G is a schematic structural diagram illustrating a search content output apparatus according to an embodiment of the present invention;
fig. 4 shows a schematic device structure diagram of a computer apparatus according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.
An embodiment of the present invention provides a search content output method, as shown in fig. 1, the method includes:
101. and analyzing the information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched.
102. A first candidate content including information word-cutting is obtained.
103. And acquiring second candidate content based on the entity identification, wherein the second candidate content is at least a sample entity corresponding to the entity identification and/or having a similarity greater than a first similarity threshold with the entity identification.
104. And inputting the semantic vector into a vector model, and acquiring third candidate content output by the vector model, wherein the vector model is established by adopting a text material, a voice material and a video material, and the similarity between the third candidate content and the semantic vector is greater than a second similarity threshold.
105. And generating search content according to the first candidate content, the second candidate content and the third candidate content, and outputting the search content.
According to the method provided by the embodiment of the invention, the information to be searched is analyzed to obtain the information word segmentation, the entity identification and the semantic vector of the information to be searched, so that the first candidate content comprising the information word segmentation is obtained, the second candidate content corresponding to the entity identification and/or having the similarity with the entity identification larger than the first similarity threshold is obtained based on the entity identification, the semantic vector is input into the vector model, and the third candidate content which is output by the vector model and has the similarity with the semantic vector larger than the second similarity threshold is obtained, so that the search content is generated and output, so that the information to be searched is searched by using a plurality of search links by adopting a plurality of different methods, the multi-mode search of the information is realized, the limitation of the search method is broken, the waste of search materials is avoided, and the success rate and the accuracy rate of the search are improved.
An embodiment of the present invention provides a search content output method, as shown in fig. 2A, the method includes:
201. and analyzing the information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched.
The inventor realizes that the search platform can provide the user with the largest entrance for information search, which is an important link for connecting the user and the information, and the search platform as the most common search tool has become an essential part of people's life. Generally, a search platform provides various search modes, such as a common text search, a voice search with a direct speaking query, and a picture search by taking a picture or uploading a picture, which objectively forms a multi-mode interactive mode between the search platform and a user. In addition, some search platforms are characterized by scene search, for example, some search platforms of local life categories, the search targets are shops in each industry, so the materials provided by these search platforms of local life categories for performing search are various, and there are text materials (such as shop names, categories, addresses, commodities, evaluations, etc.), picture materials (such as door pictures, environment maps, dish maps, etc.), and video materials (such as video albums, shop visits, visitor recommended videos, etc.), which are all heterogeneous, however, most of the current common searches only use text materials, and do not use other search links, and a large amount of materials such as pictures, videos, etc. are not used directly in the links of search, which not only wastes materials, but also limits the recall ability of search, therefore, the invention provides a search content output method, which realizes the complete utilization of various different types of search materials by thoroughly analyzing the information to be searched and ensures that the search service provided for users is most comprehensive and complete to the greatest extent. In the present invention, the above-mentioned search platform of local life categories is taken as an example for explanation, and in the process of practical application, the method of the present invention can also be applied to search platforms in other various scenarios.
In order to implement the searching method in the whole invention, firstly, the information to be searched needs to be analyzed comprehensively, so that the information to be searched is searched in a plurality of searching links. In the present invention, it is proposed to use 3 search links to search for information, so 3 parsing methods are required to obtain 3 data used for searching in the search links, for details, see steps one to three below.
Step one, carrying out word segmentation on information to be searched according to a word segmentation template to obtain information word segmentation.
The parsing process shown in step one is essentially the process of segmenting the information to be searched in the most common text searching method. Generally, a word segmentation template including a series of words may be set, and the information to be searched is word segmented according to the words specified in the word segmentation template, so as to obtain information word segmentation. Or, in the practical application process, a word recognition tool may also be adopted to perform word recognition on the information to be searched according to the logic of the words, so as to obtain the information word segmentation of the information to be searched. It should be noted that the number of words included in the information keyword is not limited to one, and may be plural, and the present invention is not particularly limited thereto.
In the process of practical application, the information to be searched may be of various types, such as a text type, a voice type and a picture type, and if the information to be searched is of the text type, the information word segmentation of the information to be searched is directly obtained according to the process in the step one. If the information to be searched is in the voice type, a voice recognition algorithm can be called to convert the information to be searched in the voice type into the information to be searched in the character type, and then the process in the step one is executed to obtain the information word segmentation of the information to be searched. If the information to be searched is of the picture type, words can be extracted from the picture, then information word segmentation is carried out on the words according to the process in the step one, and if the words do not exist in the picture, the step one can be omitted.
And step two, establishing an entity identification task, and identifying the information to be searched by taking the search type of the information to be searched as the identification direction of the entity identification task to obtain an entity identifier.
Considering that some words in the information to be searched have actual physical meanings, if the words are divided according to characters, the words do not have analytical value, for example, for the word "coffee", the physical meaning of the whole words refers to a beverage of "coffee", and the words are divided to obtain "coffee" and "coffee", the drinks cannot be represented, and the physical meaning disappears. The Entity Recognition task may be an NLP (neural-linear Programming, neural-Linguistic sequencing) NER (Named Entity Recognition) task, and the Entity Recognition task may be implemented by using an LSTM (Long Short-Term Memory network) as a feature extractor and then using a CRF (Conditional Random Field) as an output layer.
And step three, determining the information type of the information to be searched, and identifying the information to be searched according to the information type to obtain a semantic vector.
The inventor realizes that if information search is carried out only by means of information word segmentation or entity identification, the information can not be extracted due to the fact that contents included in some materials are relatively obscure, and a situation of search blank exists to a certain extent. The semantic vector is a vector in a mathematical form set for the information to be searched according to the semantics of the information, so that corresponding materials are obtained through accurate calculation in the candidate process, and the whole searching process is more accurate and rigorous than other common searching processes.
Therefore, in step three, the semantic vector of the information to be searched needs to be determined. Considering that the information to be searched can be of a text type, a voice type, a picture type or a video type, different ways need to be adopted to identify semantic vectors of different types of information to be searched. Specifically, if the information type is a text type, the semantic training algorithm can be directly adopted to identify the information to be searched, and the feature vector obtained by identification is used as the semantic vector. The semantic training algorithm may be a Bert (speech representation model) pre-training algorithm. If the information type is the voice type, firstly, a voice recognition algorithm is required to be called to recognize the information to be searched, and the information to be searched of the text type is obtained. And then, identifying the information to be searched of the text type by adopting a semantic training algorithm, and taking the characteristic vector obtained by identification as a semantic vector. If the information type is a picture type, the picture feature extractor needs to be operated first. The image feature extractor may be implemented based on a CNN (Convolutional Neural Networks) model, or may also be implemented based on a VGG NET (Visual Geometry Group NET, deep Convolutional network) model, which is not specifically limited in the present invention. And then, taking the entity category to which the information to be searched belongs as a second extraction target of the image feature extractor, learning the information to be searched according to the second extraction target, and taking the feature vector of the last layer in the image feature extractor as a semantic vector. That is, it is necessary to set which category the information to be searched belongs to, for example, assuming that the information to be searched is "coffee", the corresponding entity category needs to be set as a drink, only the information about "coffee" searched in the drink range is valid, and in other ranges, for example, the information about "coffee source" searched in other ranges does not have any relationship with the current search and belongs to a meaningless search. In the practical application process, a 3-layer full-connection network can be added behind the CNN model, supervised learning is performed on the information to be searched, the first layers (specifically, convolutional layers and pooling layers) of the CNN model are reserved, and the output of the last layer of the layers is used as the final feature of the information to be searched, namely, the semantic vector. If the information type is a video type, firstly, processing the information to be searched by adopting a video key frame extraction algorithm, and converting the information to be searched of the video type into a picture type, thereby obtaining the information to be searched of the picture type. And then, operating the picture feature extractor, setting a second extraction target for the picture feature extractor according to the mode of extracting the semantic vector for the information to be searched of the picture type, learning the information to be searched of the picture type according to the second extraction target, and taking the feature vector of the last layer in the picture feature extractor as the semantic vector.
By executing the process from the first step to the third step, the complete analysis of the information to be searched is realized, and the information word segmentation, the entity identification and the semantic vector of the information to be searched are obtained. It should be noted that, in the process of actual application, the three steps do not have a sequence, and may be executed simultaneously, or may be executed in sequence, or may also be executed without the sequence described above, and the actual sequence in the actual scene is adopted, which is not specifically limited in this respect.
202. A first candidate content including information word-cutting is obtained.
In the embodiment of the invention, after the information word segmentation of the information to be searched is obtained, a round of search can be performed on the basis of the information word segmentation, and a group of candidate contents are obtained. When information is searched based on the information segment, the content including the information segment is substantially searched as the first candidate content. For example, assuming that the information word is "coffee", the acquired first candidate content may be "mocha coffee", "american coffee", or the like, and may be the first candidate content as long as the information word is included.
203. And acquiring second candidate content based on the entity identification.
In the embodiment of the invention, after the entity identifier of the information to be searched is acquired, a second round of search can be performed based on the entity identifier. Before searching for information based on the entity identifier, a series of sample entities need to be set, so that sample entities having a certain association with the entity identifier can be selected from the sample entities as second candidate content.
When setting the sample entities, two ways may be adopted, one way is to establish a task entity identification task such as NLP NER and the like as shown in the above step 201, and identify the materials preset in the search platform based on the entity identification task, so as to obtain a plurality of sample entities. And the other method is that entities are labeled manually according to materials such as texts and pictures, so that a plurality of sample entities are obtained. It should be noted that the number of sample entities is generally controlled to be over 10 ten thousand so as to ensure that the accuracy of the search can be over 95%. In the practical application process, the set sample entity is extracted from the materials of the search platform itself and is generally an entity of a local store in life, and therefore, the sample entity may also be referred to as a [ store entity ]. The entity identifier extracted in step 201 is obtained from the information to be searched provided by the user, and therefore, the entity identifier may also be referred to as a [ user entity ], so as to distinguish entities in terms of source and role.
The sample entities identified in the above process all have certain identifiers, for example, the identifier of the sample entity [ coffee ] is [ 1234sdf2 ], which corresponds to the identifier of the entity identified in step 201 one to one, that is, if the to-be-searched information includes the entity [ coffee ], the obtained entity identifier is [ 1234sdf2 ], so that when the sample entity is set, in order to subsequently query the relevant sample entity directly based on the entity identifier, the identifier of the sample entity may be used as the preset information index of the sample entity, and thus, the corresponding sample information entity may be obtained by searching the preset information index.
After the sample entities are determined, considering that the search directly provided by the sample entities is only a fuzzy matching search, and the search is not very high in accuracy to a certain extent, the associated sample entities of each sample entity can be determined by learning the knowledge graph of all the sample entities, the similarity between each sample entity and each associated sample entity is calculated, the association relationship and the similarity between the sample entities and the associated sample entities are expressed in a tree graph mode, an entity knowledge graph for describing the similarity and the similarity between a plurality of sample entities is generated, and some candidate contents are determined based on the entity knowledge graph to serve as the basis of the subsequent output search contents.
Through the process, the sample entity and the entity knowledge graph of the sample entity are determined, so that the searching process of the entity identification can be continuously carried out, and the sample entity corresponding to the entity identification and/or having the similarity with the entity identification larger than the first similarity threshold value is obtained to be used as second candidate content. With reference to the sample entity above, two ways may be used to obtain the second candidate content, one way is to directly query the preset information index corresponding to the entity identifier, and use the first sample entity indicated by the preset information index as the second candidate content. And the other method is to obtain an entity knowledge graph, map the entity identification to the entity knowledge graph for wandering, and obtain a second sample entity which is output by the entity knowledge graph and has similarity with the entity identification larger than a first similarity threshold value as a second candidate content. It should be noted that, when obtaining the second candidate content based on the entity knowledge graph, it may be determined that the entity is identified in the designated sample entity corresponding to the entity knowledge graph, obtain a second sample entity having a similarity relation with the designated sample entity and having a similarity greater than the first similarity threshold, and continue to obtain the sample entity having a similarity relation with the second sample entity and having a similarity greater than the first similarity threshold until an entity having a similarity relation or having a similarity greater than the first similarity threshold is not obtained, and use all the obtained second sample entities as the second candidate content. That is, there is Score (Score) representing the similarity between two sample entities on the line segment between each sample entity of the entity knowledge graph, a first similarity threshold is set for the Score, when the Score is smaller than the first similarity threshold, the migration is stopped, and the sample entities that are currently migrated are all used as the second candidate content. Or determining a target vector corresponding to the entity identifier in the semantic vector, mapping the target vector to the entity knowledge graph, calculating a first cosine value between the target vector and an entity vector of a sample entity in the entity knowledge graph, and extracting a second sample entity with the first cosine value larger than a first similarity threshold value as second candidate content. That is, the semantic vector extracted in step 201 is adopted, and the first cosine value between the target vector corresponding to the entity identifier in the semantic vector and the entity vector of the sample entity in the entity knowledge graph is calculated as the similarity of the two vectors, so long as the sample entity with the similarity greater than the first similarity threshold value can be used as the second candidate content.
It should be noted that, in order to ensure that the obtained sample entity is highly matched with the scene where the search platform is located, in the training process, a variety of materials with scene characteristics may be applied, where the scene characteristics may specifically be time, time period, holiday, weekend, city, location, POI (Point of interest), POI type (office building, residential area, etc.), crowd portraits (gender, age, purchasing power, etc.), which are characteristics of the search platform of local life category, and the recognition efficiency of the model may be greatly improved.
204. And inputting the semantic vector into the vector model, and acquiring third candidate content output by the vector model.
In the embodiment of the invention, after the semantic vector identification of the information to be searched is acquired, a third round of search can be performed based on the semantic vector. Before searching information based on the semantic vector, a vector model needs to be established by adopting a text material, a voice material and a video material preset in a search platform, and an output vector most similar to the semantic vector is determined as a third candidate content through the vector model. The second similarity threshold may be set to specify the similarity between the third candidate content and the semantic vector, that is, the similarity between the third candidate content and the semantic vector is greater than the second similarity threshold.
When the vector model is trained, firstly, sample information needs to be obtained, and materials in the sample information are analyzed to obtain a sample feature vector. In consideration of the fact that only the preset information in the search platform is used for training the vector model, the vector model is likely to be too single, and a certain difference exists between the vector model and the actual situation, therefore, in practical application, the sample information adopted for training the vector model is at least the historical operation information of the user and the preset information of the search platform. The types of materials included in the sample information are different, and generally include voice materials, video materials, text materials and picture materials, which need to be processed in different ways to obtain sample feature vectors. For the voice material, the voice material needs to be extracted from the material of the sample information, and a voice recognition algorithm is called to recognize the voice material, so as to obtain a prepared text material. The specific process is the same as the process of converting the voice type information to be searched into the text type information to be searched in step 201, and details are not repeated here. For video materials, the video materials need to be extracted from the materials of the sample information, and the video materials are extracted by adopting a video key frame extraction algorithm to obtain a prepared picture material. The specific process is the same as the process of converting the video-type information to be searched into the picture-type information to be searched in step 201, and details are not repeated here. After the type conversion is completed, original text materials are extracted from the materials of the sample information, and the original text materials and the prepared text materials are trained by adopting a semantic training algorithm to obtain text feature vectors. The training process is consistent with the process of extracting semantic vectors from the information to be searched shown in step 201, and may also be implemented by using a semantic training algorithm such as a Bert pre-training algorithm, which is not described herein again. And then extracting original picture materials from the materials of the sample information, operating a picture feature extractor, taking the entity categories to which the original picture materials and the prepared picture materials belong as first extraction targets of the picture feature extractor, learning the original picture materials and the prepared picture materials according to the first extraction targets, and taking the feature vector of the last layer in the picture feature extractor as a picture feature vector. The process of extracting the picture feature vector is the same as the process of extracting the semantic vector of the information to be searched of the picture type in step 201, and the used picture feature extractor may also be the CNN model or the VGG NET model shown in step 201, which is not described herein again. After the text feature vector and the picture feature vector are obtained, the text feature vector and the picture feature vector can be used as sample feature vectors so as to be used in a subsequent model training process.
Subsequently, the sample feature vectors are learned by using a sorting algorithm, and a plurality of sample vector groups are generated. The sorting algorithm can be a Pairwise training algorithm, sample feature vectors are combined together in a triplet mode, and generated triplets are used as sample vector groups. Specifically, the sample vector group is at least a triplet composed of a search word vector, a first name vector and a second name vector in the sample feature vectors, the first name vector is a name vector matched with the search word vector in the sample feature vectors, and the second name vector is a name vector matched with the search word vector in the sample feature vectors. That is, the generated sample vector group may be in the form of a < search term vector, first name vector, second name vector >.
And finally, respectively inputting the plurality of sample vector groups into the semantic matching model, acquiring the output vector of the last layer in the hidden layer of the semantic matching model, obtaining a plurality of output vectors of the plurality of sample vector groups, and taking the plurality of output vectors as the vector model. Since the degree of correlation between the vectors cannot be sufficiently reflected when the similarity is calculated by directly using the obtained sample feature vectors, the sample feature vectors are trained again by using the semantic matching model, thereby increasing the accuracy of the vector model. Specifically, the Semantic matching model used may be a DSSM (Deep Structured Semantic model) model, such that a plurality of sample vector groups are input into the DSSM, an output vector of the last layer of the hidden layer of the DSSM is taken as an output vector of the present stage, and thus a plurality of output vectors of the plurality of sample vector groups are taken as a vector model.
The above process is a generation process of the vector model, and then, a process of obtaining the third candidate content based on the semantic vector may be continuously performed. The specific process is as follows: firstly, a semantic vector is input into a vector model, and the semantic vector is compared with a plurality of output vectors included in the vector model. Subsequently, at least one candidate output vector is extracted among the plurality of output vectors. And the number of numerical values of the candidate output vector and the semantic vector which coincide on the same numerical value is larger than a number threshold value. For example, if the number threshold is 2, the semantic vector is [ 0.2343, 0.21, 0.84, 0.86 ], the output vector a is [ 0.2343, 1.35, 0.9234, 0.21 ], and the output vector B is [ 0.2343, 0.21, 0.84, -3.2 ], then the number of the output vector a and the semantic vector coincident in the same numerical position is only 0.2343, that is, 1, and does not satisfy the number threshold; the output vector B and the semantic vector have coincident values of 0.2343, 0.21, 0.84, that is, 3 at the same numerical position, and satisfy the number threshold, so the output vector B can be used as a candidate output vector. It should be noted that the process of extracting the candidate output vector can be directly implemented by using a KD (K-Dimensional, data structure for partitioning a K-Dimensional data space) tree scheme in an ANN (Approximate Nearest Neighbor) method. And finally, respectively calculating a second cosine value of the semantic vector and at least one candidate output vector, and extracting a target output vector of which the second cosine value is larger than a second similarity threshold value as a third candidate content.
205. And counting the content intersection of the first candidate content, the second candidate content and the third candidate content, respectively determining the similarity between each sub-content in the content intersection and the information to be searched, and dividing the content intersection into a plurality of content groups according to a preset similarity level interval.
After the first candidate content, the second candidate content, and the third candidate content are obtained, in order to combine the first candidate content, the second candidate content, and the third candidate content, content intersections of the first candidate content, the second candidate content, and the third candidate content may be counted, and output of the search content may be realized based on the content intersections. Considering that the search platform may also consider other factors such as distance, quality score, user evaluation and the like when outputting search content, and these factors may possibly cause some candidate content with high similarity to the information to be searched to be ranked later and not pushed to the user, therefore, after determining the content intersection, the content intersection may be divided into a plurality of content groups according to a preset similarity level interval, and the content included in each content group is ranked according to other factors, thereby avoiding that the content with high similarity cannot be shown. Specifically, a preset similarity level section that specifies a correspondence relationship between the similarity level and the similarity degree needs to be set first. For example, the similarity interval corresponding to the level a is not more than 1 and more than 0.8, the similarity interval corresponding to the level B is not more than 0.8 and more than 0.3, and the similarity interval corresponding to the level C is not more than 0.3 and more than 0, so that the content intersection can be divided into a plurality of content groups according to the similarity calculated in the above process for each content.
Then, for each content group in the plurality of content groups, obtaining content scores corresponding to all sub-contents included in the content group, wherein the content scores are set at least according to factors such as the distance and the quality of the sub-contents. And sorting all sub-contents included in the content group in order of the content scores from high to low.
206. And extracting a target content group with the similarity level higher than the level threshold value from the plurality of content groups as search content, and outputting the search content.
In the embodiment of the present invention, after the division of the content groups and the sorting of the contents in the content groups are completed, a target content group having a similarity level higher than a level threshold may be extracted from the plurality of content groups as search content, and the search content is output. Or, if the search platform has a presentation condition, all content groups can be presented to the user, and labeled according to the content level for the user to refer to. Alternatively, a specified number of the contents listed above may be picked up in each content group and output as search contents. The rule of the search content output may be changed freely according to different scenes, which is not particularly limited in the present invention.
In summary, the above-mentioned output process of the whole search content can be summarized as follows:
referring to fig. 2B, information to be searched input by the user side is obtained, the information to be searched may be pictures, texts, and voices, and preset materials such as pictures, texts, and videos provided by the shop side are prepared. And then, for the preset materials, training the text materials in the preset materials according to Bert pre-training, training the picture materials in the preset materials according to a CNN extractor, and respectively performing entity extraction and semantic extraction on training results obtained by training to obtain sample entities and vector models. And on the other hand, pre-training the information of the text type in the information to be searched according to Bert, training the Xisydni of the picture type in the information to be searched according to a CNN extractor, and respectively performing entity extraction and semantic extraction on training results obtained by training so as to obtain entity identification and semantic vectors. And finally, matching and recalling the obtained sample entity and the entity identification, matching and recalling the vector model and the semantic vector, determining the final search content according to the similarity, and outputting the search content.
According to the method provided by the embodiment of the invention, the information to be searched is analyzed to obtain the information word segmentation, the entity identification and the semantic vector of the information to be searched, so that the first candidate content comprising the information word segmentation is obtained, the second candidate content corresponding to the entity identification and/or having the similarity with the entity identification larger than the first similarity threshold is obtained based on the entity identification, the semantic vector is input into the vector model, and the third candidate content which is output by the vector model and has the similarity with the semantic vector larger than the second similarity threshold is obtained, so that the search content is generated and output, so that the information to be searched is searched by using a plurality of search links by adopting a plurality of different methods, the multi-mode search of the information is realized, the limitation of the search method is broken, the waste of search materials is avoided, and the success rate and the accuracy rate of the search are improved.
Further, as a specific implementation of the method shown in fig. 1, an embodiment of the present invention provides a search content output apparatus, as shown in fig. 3A, where the apparatus includes: a first parsing module 301, a first obtaining module 302, a second obtaining module 303, a third obtaining module 304 and an output module 305.
The first parsing module 301 is configured to parse information to be searched to obtain information word segmentation, entity identifiers, and semantic vectors of the information to be searched;
the first obtaining module 302 is configured to obtain a first candidate content including the information word-cutting;
the second obtaining module 303 identifies sample entities corresponding to the entity identifier and/or having a similarity greater than a first similarity threshold with the entity identifier;
the third obtaining module 304 is configured to input the semantic vector into a vector model, and obtain a third candidate content output by the vector model, where the vector model is established by using a text material, a voice material, and a video material, and a similarity between the third candidate content and the semantic vector is greater than a second similarity threshold;
the output module 305 is configured to generate search content according to the first candidate content, the second candidate content, and the third candidate content, and output the search content.
In a specific application scenario, as shown in fig. 3B, the apparatus further includes: a second parsing module 306, a generating module 307, an input module 308 and a determining module 309.
The second analyzing module 306 is configured to obtain sample information, analyze a material in the sample information to obtain a sample feature vector, where the sample information is at least history operation information and/or preset information;
the generating module 307 is configured to learn the sample feature vectors by using a sorting algorithm, and generate a plurality of sample vector groups, where the sample vector group is at least a triplet composed of a search word vector, a first name vector, and a second name vector in the sample feature vectors, the first name vector is a name vector in the sample feature vectors that matches the search word vector, and the second name vector is a name vector in the sample feature vectors that matches the search word vector;
the input module 308 is configured to input the plurality of sample vector groups to a semantic matching model, and obtain an output vector of a last layer in a hidden layer of the semantic matching model to obtain a plurality of output vectors of the plurality of sample vector groups;
the determining module 309 is configured to use the plurality of output vectors as the vector model.
In a specific application scenario, as shown in fig. 3C, the second parsing module 306 includes: the recognition unit 3061, the extraction unit 3062, the training unit 3063, the learning unit 3064, and the generation unit 3065.
The recognition unit 3061 is configured to extract the speech material from the material of the sample information, and call a speech recognition algorithm to recognize the speech material to obtain a prepared text material;
the extracting unit 3062 is configured to extract the video material from the material of the sample information, and extract the video material by using a video key frame extraction algorithm to obtain a prepared picture material;
the training unit 3063 is configured to extract an original text material from the material of the sample information, train the original text material and the prepared text material by using a semantic training algorithm, and obtain a text feature vector;
the learning unit 3064 is configured to extract an original picture material from the material of the sample information, operate a picture feature extractor, use an entity category to which the original picture material and the prepared picture material belong as a first extraction target of the picture feature extractor, learn the original picture material and the prepared picture material according to the first extraction target, and use a feature vector of a last layer in the picture feature extractor as a picture feature vector;
the generating unit 3065 is configured to use the text feature vector and the picture feature vector as the sample feature vector.
In a specific application scenario, as shown in fig. 3D, the first parsing module 301 includes: a segmentation unit 3011, a creation unit 3012 and a determination unit 3013.
The segmentation unit 3011 is configured to perform word segmentation on the information to be searched according to a word segmentation template to obtain the information word segmentation;
the establishing unit 3012 is configured to establish an entity identification task, and identify the information to be searched by using the search type of the information to be searched as an identification direction of the entity identification task to obtain the entity identifier;
the determining unit 3013 is configured to determine an information type of the information to be searched, and identify the information to be searched according to the information type to obtain the semantic vector.
In a specific application scenario, the determining unit 3013 is configured to, if the information type is a text type, identify the information to be searched by using a semantic training algorithm, and use a feature vector obtained by identification as the semantic vector; if the information type is a voice type, calling a voice recognition algorithm to recognize the information to be searched to obtain the information to be searched of a text type, recognizing the information to be searched of the text type by adopting the semantic training algorithm, and taking a feature vector obtained by recognition as the semantic vector; if the information type is a picture type, operating a picture feature extractor, taking an entity category to which the information to be searched belongs as a second extraction target of the picture feature extractor, learning the information to be searched according to the second extraction target, and taking a feature vector of the last layer in the picture feature extractor as the semantic vector; if the information type is a video type, processing the information to be searched by adopting a video key frame extraction algorithm to obtain the information to be searched of the picture type, operating the picture feature extractor, setting the second extraction target for the picture feature extractor, learning the information to be searched of the picture type according to the second extraction target, and taking the feature vector of the last layer in the picture feature extractor as the semantic vector.
In a specific application scenario, the second obtaining module 303 is configured to query a preset information index corresponding to the entity identifier, and use a first sample entity indicated by the preset information index as the second candidate content; and/or acquiring an entity knowledge graph, mapping the entity identification to the entity knowledge graph for wandering, and acquiring a second sample entity which is output by the entity knowledge graph and has similarity with the entity identification greater than the first similarity threshold value as the second candidate content, wherein the entity knowledge graph describes similarity and similarity relation among a plurality of sample entities.
In a specific application scenario, the second obtaining module 303 is configured to determine that the entity identifier is a designated sample entity corresponding to the entity knowledge graph, obtain a second sample entity that has a similarity relationship with the designated sample entity and whose similarity is greater than the first similarity threshold, continue to obtain the sample entity that has a similarity relationship with the second sample entity and whose similarity is greater than the first similarity threshold, and take all the obtained second sample entities as the second candidate content until an entity that has a similarity relationship or whose similarity is greater than the first similarity threshold cannot be obtained; or, determining a target vector corresponding to the entity identifier in the semantic vector, mapping the target vector to the entity knowledge graph, calculating a first cosine value between the target vector and an entity vector of a sample entity in the entity knowledge graph, and extracting a second sample entity of which the first cosine value is greater than the first similarity threshold as the second candidate content.
In a specific application scenario, as shown in fig. 3E, the third obtaining module 304 includes: an alignment unit 3041, an extraction unit 3042 and a calculation unit 3043.
The comparing unit 3041, configured to input the semantic vector into the vector model, and compare the semantic vector with a plurality of output vectors included in the vector model;
the extracting unit 3042 is configured to extract at least one candidate output vector from the plurality of output vectors, where the number of numerical values of the candidate output vector and the semantic vector that coincide on the same numerical value is greater than a number threshold;
the calculating unit 3043 is configured to calculate a second cosine value of the semantic vector and the at least one candidate output vector, and extract a target output vector with the second cosine value being greater than the second similarity threshold as the third candidate content.
In a specific application scenario, as shown in fig. 3F, the output module 305 includes: the device comprises a counting unit 3051, a dividing unit 3052 and an output unit 3053.
The statistics unit 3051 is configured to count content intersections of the first candidate content, the second candidate content, and the third candidate content;
the dividing unit 3052 is configured to determine similarity between each sub-content in the content intersection and the information to be searched, and divide the content intersection into a plurality of content groups according to a preset similar level interval, where the preset similar level interval specifies a correspondence between a similar level and the similarity;
the output unit 3053 is configured to extract, as the search content, a target content group having a similarity level higher than a level threshold from among the plurality of content groups, and output the search content.
In a specific application scenario, as shown in fig. 3G, the output module 305 further includes: a scoring unit 3054 and a sorting unit 3055.
The scoring unit is used for acquiring content scores corresponding to all sub-contents in the content groups for each content group in the plurality of content groups, and the content scores are at least set according to factors such as the distance and the quality of the sub-contents;
and the sorting unit is used for sorting all the sub-contents in the content group according to the sequence of the content scores from high to low.
The device provided by the embodiment of the invention obtains the information word segmentation, the entity identification and the semantic vector of the information to be searched by analyzing the information to be searched, further obtains the first candidate content comprising the information word segmentation, obtains the second candidate content corresponding to the entity identification and/or having the similarity with the entity identification larger than the first similarity threshold based on the entity identification, inputs the semantic vector into the vector model, and obtains the third candidate content which is output by the vector model and has the similarity with the semantic vector larger than the second similarity threshold so as to generate and output the search content, so that the information to be searched is searched by using a plurality of search links by adopting a plurality of different methods, the multi-mode search of the information is realized, the limitation of the search method is broken, the waste of search materials is avoided, and the success rate and the accuracy rate of the search are improved.
It should be noted that other corresponding descriptions of the functional units related to the search content output apparatus provided in the embodiment of the present invention may refer to the corresponding descriptions in fig. 1 and fig. 2A to fig. 2B, and are not repeated herein.
In an exemplary embodiment, referring to fig. 4, there is further provided a device, where the device 400 includes a communication bus, a processor, a memory, and a communication interface, and may further include an input/output interface and a display device, where the functional units may communicate with each other through the bus. The memory stores a computer program, and the processor executes the program stored in the memory and executes the search content output method in the above embodiment.
A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the search content output method. Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by hardware, and also by software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the implementation scenarios of the present application.
Those skilled in the art will appreciate that the figures are merely schematic representations of one preferred implementation scenario and that the blocks or flow diagrams in the figures are not necessarily required to practice the present application. Those skilled in the art will appreciate that the modules in the devices in the implementation scenario may be distributed in the devices in the implementation scenario according to the description of the implementation scenario, or may be located in one or more devices different from the present implementation scenario with corresponding changes. The modules of the implementation scenario may be combined into one module, or may be further split into a plurality of sub-modules.
The above application serial numbers are for description purposes only and do not represent the superiority or inferiority of the implementation scenarios.
The above disclosure is only a few specific implementation scenarios of the present application, but the present application is not limited thereto, and any variations that can be made by those skilled in the art are intended to fall within the scope of the present application.

Claims (9)

1. A search content output method, comprising:
analyzing information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched, wherein the entity identification is obtained by identifying the information to be searched by taking the search type of the information to be searched as an identification direction, and the semantic vector is obtained by identifying the information to be searched according to the information type of the information to be searched;
acquiring first candidate content comprising the information word cutting;
based on the entity identification, second candidate content is obtained, wherein the second candidate content is at least a sample entity corresponding to the entity identification and/or having a similarity with the entity identification larger than a first similarity threshold;
inputting the semantic vector into a vector model, and acquiring third candidate content output by the vector model, wherein the vector model is established by adopting a text material, a voice material and a video material, and the similarity between the third candidate content and the semantic vector is greater than a second similarity threshold;
and generating search content according to the first candidate content, the second candidate content and the third candidate content, and outputting the search content.
2. The method according to claim 1, wherein before the parsing the information to be searched to obtain the information word segmentation, the entity identifier and the semantic vector of the information to be searched, the method further comprises:
acquiring sample information, analyzing materials in the sample information to obtain a sample characteristic vector, wherein the sample information is at least historical operation information and/or preset information;
learning the sample feature vectors by adopting a sorting algorithm to generate a plurality of sample vector groups, wherein the sample vector groups are at least triples consisting of search word vectors, first name vectors and second name vectors in the sample feature vectors, the first name vectors are name vectors matched with the search word vectors in the sample feature vectors, and the second name vectors are name vectors matched with the search word vectors in the sample feature vectors;
respectively inputting the plurality of sample vector groups into a semantic matching model, and acquiring an output vector of the last layer in a hidden layer of the semantic matching model to obtain a plurality of output vectors of the plurality of sample vector groups;
using the plurality of output vectors as the vector model.
3. The method of claim 2, wherein the parsing the material in the sample information to obtain a sample feature vector comprises:
extracting the voice material from the material of the sample information, and calling a voice recognition algorithm to recognize the voice material to obtain a prepared text material;
extracting the video material from the material of the sample information, and extracting the video material by adopting a video key frame extraction algorithm to obtain a prepared picture material;
extracting original text materials from the materials of the sample information, and training the original text materials and the prepared text materials by adopting a semantic training algorithm to obtain text feature vectors;
extracting original picture materials from the materials of the sample information, operating a picture feature extractor, taking the entity categories to which the original picture materials and the prepared picture materials belong as a first extraction target of the picture feature extractor, learning the original picture materials and the prepared picture materials according to the first extraction target, and taking the feature vector of the last layer in the picture feature extractor as a picture feature vector;
and taking the text feature vector and the picture feature vector as the sample feature vector.
4. The method according to claim 1, wherein the parsing the information to be searched comprises:
if the information type is a text type, identifying the information to be searched by adopting a semantic training algorithm, and taking a feature vector obtained by identification as the semantic vector;
if the information type is a voice type, calling a voice recognition algorithm to recognize the information to be searched to obtain the information to be searched of a text type, recognizing the information to be searched of the text type by adopting the semantic training algorithm, and taking a feature vector obtained by recognition as the semantic vector;
if the information type is a picture type, operating a picture feature extractor, taking an entity category to which the information to be searched belongs as a second extraction target of the picture feature extractor, learning the information to be searched according to the second extraction target, and taking a feature vector of the last layer in the picture feature extractor as the semantic vector;
if the information type is a video type, processing the information to be searched by adopting a video key frame extraction algorithm to obtain the information to be searched of the picture type, operating the picture feature extractor, setting the second extraction target for the picture feature extractor, learning the information to be searched of the picture type according to the second extraction target, and taking the feature vector of the last layer in the picture feature extractor as the semantic vector.
5. The method of claim 1, wherein obtaining the second candidate content based on the entity identity comprises:
querying a preset information index corresponding to the entity identifier, and taking a first sample entity indicated by the preset information index as the second candidate content; and/or the presence of a gas in the gas,
and acquiring an entity knowledge graph, mapping the entity identification to the entity knowledge graph for wandering, and acquiring a second sample entity which is output by the entity knowledge graph and has similarity with the entity identification greater than the first similarity threshold value as the second candidate content, wherein the entity knowledge graph describes similarity and similarity relation among a plurality of sample entities.
6. The method of claim 5, wherein the mapping the entity identifier to the entity knowledge graph for wandering, and obtaining a second sample entity output by the entity knowledge graph with a similarity greater than the first similarity threshold with the entity identifier as the second candidate content comprises:
determining an appointed sample entity of the entity identification corresponding to the entity knowledge graph, acquiring a second sample entity which has a similarity relation with the appointed sample entity and the similarity is greater than the first similarity threshold, continuously acquiring the sample entity which has the similarity relation with the second sample entity and the similarity is greater than the first similarity threshold until the entity which has the similarity relation or the similarity is greater than the first similarity threshold cannot be acquired, and taking all the acquired second sample entities as the second candidate content; or the like, or, alternatively,
determining a target vector corresponding to the entity identifier in the semantic vector, mapping the target vector to the entity knowledge graph, calculating a first cosine value between the target vector and an entity vector of a sample entity in the entity knowledge graph, and extracting a second sample entity with the first cosine value larger than the first similarity threshold as the second candidate content.
7. The method of claim 1, wherein inputting the semantic vector into a vector model and obtaining a third candidate content output by the vector model comprises:
inputting the semantic vector into the vector model, and comparing the semantic vector with a plurality of output vectors included in the vector model;
extracting at least one candidate output vector from the plurality of output vectors, wherein the number of numerical values of the candidate output vector and the semantic vector which coincide on the same numerical value is greater than a number threshold value;
and calculating a second cosine value of the semantic vector and the at least one candidate output vector, and extracting a target output vector of which the second cosine value is larger than the second similarity threshold value as the third candidate content.
8. The method of claim 1, wherein generating search content based on the first candidate content, the second candidate content, and the third candidate content, and outputting the search content comprises:
counting the content intersection of the first candidate content, the second candidate content and the third candidate content;
respectively determining the similarity between each sub-content in the content intersection and the information to be searched, and dividing the content intersection into a plurality of content groups according to a preset similar grade interval, wherein the preset similar grade interval specifies the corresponding relation between the similar grade and the similarity;
and extracting a target content group with the similarity level higher than a level threshold value from the plurality of content groups as the search content, and outputting the search content.
9. A search content output apparatus characterized by comprising:
the system comprises a first analysis module, a second analysis module and a third analysis module, wherein the first analysis module is used for analyzing information to be searched to obtain information word segmentation, entity identification and semantic vector of the information to be searched, the entity identification is obtained by identifying the information to be searched by taking the search type of the information to be searched as an identification direction, and the semantic vector is obtained by identifying the information to be searched according to the information type of the information to be searched;
the first acquisition module is used for acquiring first candidate content comprising the information word segmentation;
a second obtaining module, configured to obtain second candidate content based on the entity identifier, where the second candidate content is at least a sample entity corresponding to the entity identifier and/or having a similarity with the entity identifier greater than a first similarity threshold;
a third obtaining module, configured to input the semantic vector into a vector model, and obtain a third candidate content output by the vector model, where the vector model is established using a text material, a voice material, and a video material, and a similarity between the third candidate content and the semantic vector is greater than a second similarity threshold;
and the output module is used for generating search content according to the first candidate content, the second candidate content and the third candidate content and outputting the search content.
CN202010497756.7A 2020-06-04 2020-06-04 Search content output method and device, computer equipment and readable storage medium Active CN111400607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010497756.7A CN111400607B (en) 2020-06-04 2020-06-04 Search content output method and device, computer equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010497756.7A CN111400607B (en) 2020-06-04 2020-06-04 Search content output method and device, computer equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111400607A CN111400607A (en) 2020-07-10
CN111400607B true CN111400607B (en) 2020-11-10

Family

ID=71430029

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010497756.7A Active CN111400607B (en) 2020-06-04 2020-06-04 Search content output method and device, computer equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111400607B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112015949B (en) * 2020-08-26 2023-08-29 腾讯科技(上海)有限公司 Video generation method and device, storage medium and electronic equipment
TWI774117B (en) * 2020-11-09 2022-08-11 財團法人資訊工業策進會 Knowledge graph establishment system and knowledge graph establishment method
CN112732883A (en) * 2020-12-31 2021-04-30 平安科技(深圳)有限公司 Fuzzy matching method and device based on knowledge graph and computer equipment
CN113392648B (en) * 2021-06-02 2022-10-18 北京三快在线科技有限公司 Entity relationship acquisition method and device
CN113204669B (en) * 2021-06-08 2022-12-06 以特心坊(深圳)科技有限公司 Short video search recommendation method, system and storage medium based on voice recognition
CN113641857A (en) * 2021-08-13 2021-11-12 三星电子(中国)研发中心 Visual media personalized search method and device
CN113505262B (en) * 2021-08-17 2022-03-29 深圳华声医疗技术股份有限公司 Ultrasonic image searching method and device, ultrasonic equipment and storage medium
CN113806487B (en) * 2021-09-23 2023-09-05 平安科技(深圳)有限公司 Semantic searching method, device, equipment and storage medium based on neural network
CN114547474A (en) * 2022-04-21 2022-05-27 北京泰迪熊移动科技有限公司 Data searching method, system, electronic equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102929925A (en) * 2012-09-20 2013-02-13 百度在线网络技术(北京)有限公司 Search method and device based on browsing content
CN103678418B (en) * 2012-09-25 2017-06-06 富士通株式会社 Information processing method and message processing device
CN104933081B (en) * 2014-03-21 2018-06-29 阿里巴巴集团控股有限公司 Providing method and device are suggested in a kind of search
CN106649786B (en) * 2016-12-28 2020-04-07 北京百度网讯科技有限公司 Answer retrieval method and device based on deep question answering
CN108052659B (en) * 2017-12-28 2022-03-11 北京百度网讯科技有限公司 Search method and device based on artificial intelligence and electronic equipment
CN109063221B (en) * 2018-11-02 2021-04-09 北京百度网讯科技有限公司 Query intention identification method and device based on mixed strategy

Also Published As

Publication number Publication date
CN111400607A (en) 2020-07-10

Similar Documents

Publication Publication Date Title
CN111400607B (en) Search content output method and device, computer equipment and readable storage medium
CN110147726B (en) Service quality inspection method and device, storage medium and electronic device
CN108304439B (en) Semantic model optimization method and device, intelligent device and storage medium
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
US20170109615A1 (en) Systems and Methods for Automatically Classifying Businesses from Images
KR20180122926A (en) Method for providing learning service and apparatus thereof
CN107766873A (en) The sample classification method of multi-tag zero based on sequence study
TW201504829A (en) Method and system for searching images
CN111783903B (en) Text processing method, text model processing method and device and computer equipment
CN112699645B (en) Corpus labeling method, apparatus and device
CN114049493B (en) Image recognition method and system based on intelligent agent atlas and readable storage medium
CN113094478B (en) Expression reply method, device, equipment and storage medium
CN113704507B (en) Data processing method, computer device and readable storage medium
CN111125457A (en) Deep cross-modal Hash retrieval method and device
CN113806588A (en) Method and device for searching video
CN113486664A (en) Text data visualization analysis method, device, equipment and storage medium
CN115712780A (en) Information pushing method and device based on cloud computing and big data
CN111368066B (en) Method, apparatus and computer readable storage medium for obtaining dialogue abstract
CN113961666B (en) Keyword recognition method, apparatus, device, medium, and computer program product
CN111859925B (en) Emotion analysis system and method based on probability emotion dictionary
CN113537206B (en) Push data detection method, push data detection device, computer equipment and storage medium
CN116303951A (en) Dialogue processing method, device, electronic equipment and storage medium
CN115983873A (en) Big data based user data analysis management system and method
CN116955707A (en) Content tag determination method, device, equipment, medium and program product
CN109446330B (en) Network service platform emotional tendency identification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant