CN113010771A - Training method and device for personalized semantic vector model in search engine - Google Patents

Training method and device for personalized semantic vector model in search engine Download PDF

Info

Publication number
CN113010771A
CN113010771A CN202110191195.2A CN202110191195A CN113010771A CN 113010771 A CN113010771 A CN 113010771A CN 202110191195 A CN202110191195 A CN 202110191195A CN 113010771 A CN113010771 A CN 113010771A
Authority
CN
China
Prior art keywords
query
vector
document
feature
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110191195.2A
Other languages
Chinese (zh)
Other versions
CN113010771B (en
Inventor
陈咨尧
陈强
梁龙军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202110191195.2A priority Critical patent/CN113010771B/en
Publication of CN113010771A publication Critical patent/CN113010771A/en
Application granted granted Critical
Publication of CN113010771B publication Critical patent/CN113010771B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a training method and a device for a personalized semantic vector model in a search engine, which relate to the technical field of application services of a block chain, and the training method comprises the following steps: acquiring a first query feature and M document features, wherein M is greater than 0; converting the first query feature into a first query vector, and respectively converting the M document features into M document vectors; and training a personalized semantic vector model by taking a preset similarity difference value as a training target based on the first query vector and the M document vectors. According to the training method, through the personalized semantic vector model, the semantic relevance of words and sentences input by a user can be considered, and further the recommendation accuracy and user experience of a search engine can be improved.

Description

Training method and device for personalized semantic vector model in search engine
Technical Field
The embodiment of the application relates to the technical field of application services of a blockchain, in particular to a training method and a training device of a personalized semantic vector model in a search engine.
Background
At present, a search engine generally divides words and sentences input by a user, and then takes a document with a top similarity score in a document library as a recommended document in an inverted manner, so that the user can conveniently and quickly search for a required document. However, there is a possibility that a recommended document obtained by word segmentation and a document actually searched by the user may come in and go out. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", in the above manner, the "road motor vehicle traffic rules" are first split into "roads", "motor vehicles" and "traffic rules", and then documents respectively hitting the words and phrases are selected in the document library in an inverted manner, for example, the documents "motor vehicles and what traffic rules are violated by speeding on non-motor vehicles roads", the documents "motor vehicles and non-motor vehicles in traffic rules", and the documents "motor vehicles and electric bicycles accidents" are preferentially recommended; however, in general, the user is a document related to a desire to recommend a specific traffic rule, such as the document "which traffic road driving rules are" or the document "what traffic road driving rules are" or the like. Therefore, although the actually recommended document is related to the words and phrases input by the user, the actually recommended document and the actually required document have a certain discrepancy, and the user experience is reduced.
Disclosure of Invention
The embodiment of the application provides a training method and a training device for an individualized semantic vector model in a search engine, and the correlation of words and sentences input by a user on semantics can be considered through the individualized semantic vector model, so that the recommendation accuracy and the user experience of the search engine can be improved. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", the personalized semantic vector model is used to semantically associate the segmented features, so that the finally recommended documents may be documents "what traffic road traffic rules are", and the like.
In one aspect, an embodiment of the present application provides a method for training a personalized semantic vector model in a search engine, including:
acquiring a first query feature and M document features, wherein M is greater than 0;
converting the first query feature into a first query vector, and respectively converting the M document features into M document vectors;
based on the first query vector and the M document vectors, taking a preset similarity difference value as a training target, and training a personalized semantic vector model;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature.
In another aspect, an embodiment of the present application provides a training apparatus for a personalized semantic vector model in a search engine, including:
the acquisition unit is used for acquiring a first query feature and M document features, wherein M is greater than 0;
the conversion unit is used for converting the first query feature into a first query vector and respectively converting the M document features into M document vectors;
the training unit is used for training an individualized semantic vector model by taking a preset similarity difference value as a training target based on the first query vector and the M document vectors;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature.
In some implementations, the personalized semantic vector model includes a query-side encoder, a positive-case-side encoder, a negative-case-side encoder, and a scoring module; wherein, this conversion module is used for specifically: converting the first query feature into the first query vector using the query-side encoder; converting the document features forming a positive example with the first query feature in the M document features into a first positive example vector in the M document vectors by using the positive example side encoder; converting the document features which form negative examples with the query features in the M document features into first negative example vectors in the M document vectors by using the negative example side encoder; wherein, this training unit is specifically used for: calculating a first difference value by using the scoring module; the first difference is a difference between a similarity score between the first positive vector and the first query vector and a similarity score between the first negative vector and the first query vector; based on the first difference, the similarity difference is used as a training target to train the personalized semantic vector model.
In some implementations, the negative-case-side encoder includes a negative-case-side text conversion module, a negative-case-side non-text conversion module, and a negative-case-side vector fusion module; the negative side encoder and the positive side encoder share parameters.
In some implementations, the query-side encoder and the positive-side encoder do not share parameters, and the query-side encoder and the negative-side encoder do not share parameters.
In some implementations, the query-side encoder includes a query-side text conversion module, a query-side non-text conversion module, and a query-side vector fusion module; the first query feature comprises a first textual feature and a first non-textual feature; wherein, this conversion module is used for specifically: converting the first text feature into a first text vector by using the query side text conversion module; converting the first non-text feature into a first non-text vector by using the query side non-text conversion module; and fusing the first text vector and the first non-text vector by using the query side vector fusion module to obtain the first query vector.
In some implementations, the first non-textual feature includes at least one of the following features of the user: age, gender, portrait, academic calendar, language used and cell phone system used.
In some implementations, the positive example-side encoder includes a positive example-side text conversion module, a positive example-side non-text conversion module, and a positive example-side vector fusion module; the document features forming a positive example with the first query feature in the M document features comprise a second text feature and a second non-text feature; wherein, this conversion module is used for specifically: converting the second text feature into a second text vector by using the normal side text conversion module; converting the second non-text feature into a second non-text vector by using the normal side non-text conversion module; and fusing the second text vector and the second non-text vector by using the normal side vector fusion module to obtain the first normal vector.
In some implementations, the second non-textual feature includes at least one of: authority of document content, authority of document author, and fan number of public number associated with document.
In some implementations, the obtaining unit is specifically configured to:
acquiring X non-text features in training data, wherein X is greater than 0; selecting N times from the X non-text features in a random selection mode to obtain N groups of non-text features; each group of the N groups of non-text features comprises Y non-text features, wherein N is more than 1, and X is more than Y and more than 0; and the N groups of non-text features are used as N training sets of the personalized semantic vector model, and Y non-text features included in one group of non-text features in the N groups of non-text features are used as non-text features in the first query feature in one training set in the N training sets and document non-text features in the M document features.
In some implementations, the apparatus can further include a display unit to:
acquiring a query request through a query window, wherein the query request comprises a second query characteristic;
converting the second query feature into a second query vector using the personalized semantic vector model;
based on the second query vector, selecting K document features with the similarity scores higher than that of the second query vector in a document library, wherein K is greater than 0;
and displaying the K document features below the query window in the descending order of the similarity scores.
In another aspect, the present application provides an electronic device, comprising:
a processor adapted to implement computer instructions; and the number of the first and second groups,
and the computer readable storage medium stores computer instructions, and the computer instructions are suitable for being loaded by the processor and executing the training method based on the image video quality evaluation model of active learning.
In another aspect, an embodiment of the present application provides a computer-readable storage medium, which stores computer instructions, which, when read and executed by a processor of a computer device, cause the computer device to perform the above training method for a personalized semantic vector model in a search engine.
In another aspect, embodiments of the present application provide a computer program product or a computer program, which includes computer instructions stored in a computer-readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to enable the computer device to execute the training method of the personalized semantic vector model in the search engine.
In the embodiment of the application, based on a first query vector and M document vectors, a preset similarity difference value is used as a training target, and a personalized semantic vector model is trained; in the process of training the personalized semantic vector model, the semantic relevance of words and sentences input by a user is considered, and further, in the recommendation process of a search engine, the recommendation accuracy and the user experience of the search engine can be improved. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", the personalized semantic vector model is used to semantically associate the segmented features, so that the finally recommended documents may be documents "what traffic road traffic rules are", and the like.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a double tower model provided in an embodiment of the present application.
Fig. 2 is an example of a display interface provided in an embodiment of the present application.
Fig. 3 is a schematic block diagram of a system framework of a search engine provided by an embodiment of the present application.
Fig. 4 is a schematic flowchart of a training method of a personalized semantic vector model in a search engine according to an embodiment of the present application.
Fig. 5 is a schematic structural diagram of a personalized semantic vector model provided in an embodiment of the present application.
Fig. 6 is a schematic structural diagram of a training apparatus for a personalized semantic vector model in a search engine according to an embodiment of the present application.
Fig. 7 is a schematic block diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The scheme provided by the application can relate to the technical field of block chains.
The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism and an encryption algorithm. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product services layer, and an application services layer.
The block chain underlying platform can comprise processing modules such as user management, basic service, intelligent contract and operation monitoring. The user management module is responsible for identity information management of all blockchain participants, and comprises public and private key generation maintenance (account management), key management, user real identity and blockchain address corresponding relation maintenance (authority management) and the like, and under the authorization condition, the user management module supervises and audits the transaction condition of certain real identities and provides rule configuration (wind control audit) of risk control; the basic service module is deployed on all block chain node equipment and used for verifying the validity of the service request, recording the service request to storage after consensus on the valid request is completed, for a new service request, the basic service firstly performs interface adaptation analysis and authentication processing (interface adaptation), then encrypts service information (consensus management) through a consensus algorithm, transmits the service information to a shared account (network communication) completely and consistently after encryption, and performs recording and storage; the intelligent contract module is responsible for registering and issuing contracts, triggering the contracts and executing the contracts, developers can define contract logics through a certain programming language, issue the contract logics to a block chain (contract registration), call keys or other event triggering and executing according to the logics of contract clauses, complete the contract logics and simultaneously provide the function of upgrading and canceling the contracts; the operation monitoring module is mainly responsible for deployment, configuration modification, contract setting, cloud adaptation in the product release process and visual output of real-time states in product operation, such as: alarm, monitoring network conditions, monitoring node equipment health status, and the like.
The platform product service layer provides basic capability and an implementation framework of typical application, and developers can complete block chain implementation of business logic based on the basic capability and the characteristics of the superposed business. The application service layer provides the application service based on the block chain scheme for the business participants to use.
More specifically, the scheme provided by the embodiment of the application is applicable to the technical field of application services of the block chain.
To facilitate an understanding of the present application, the following description is made of terms related to the present application.
Query (Query): refers to a question of a user search input.
Document (Document): the pointer returns results to the user input question, typically in the form of a document title or web page.
Click positive example: refers to a Query Document (Document) pair formed by a Query input by a user and a corresponding Document of click behavior in a search.
The strong negative example is a query document pair consisting of a query input by a user and an exposed but unchecked document in the search, or a query document pair which is similar in terms of the query and the document but different in essence meaning.
Weak negative example: refers to a query document pair consisting of a user-entered query and a randomly selected document in a search.
Batch (Batch): the size of the data volume fed in each batch when training the model.
Transformation (Transformer) model: for example, reference may be made to the models provided in the following articles: no Attention was paid to Attention (Attention Is All You Need).
Bi-directional Encoder representation (BERT) model from transform: is a language representation model and BERT aims to pre-train a deep bi-directional representation by jointly adjusting left and right contexts in all layers. Thus, only one additional output layer is required to fine-tune the pre-trained BERT, thereby creating the most advanced models for a wide range of tasks (e.g., answering questions and language inference tasks) without requiring extensive modification of the model structure specific to the task. For example, reference may be made to the models in the following articles: bert depth Bi-Directional transformation Pre-trained language understanding (Bert: Pre-training of deep bidirectional transformations for language understanding).
Fig. 1 is a schematic structural diagram of a double tower model 100 provided in an embodiment of the present application.
As shown in fig. 1, the double tower model 100 may include a query side parsing module 111, a query side encoder 112, a document side parsing module 121, a document side encoder 122, and a similarity determination unit 130. The query side analysis module 111 processes the input question into a query feature, the document side analysis module 121 processes the document into a document feature, the query side encoder 112 converts the obtained query feature into a query vector, the document side encoder 122 converts the obtained document feature into a document vector, and the similarity determination unit 130 obtains a similarity score between the query vector and the document vector based on the received query vector and document vector. During the training phase, the two-tower model 100 may be trained using positive, strong negative, and weak negative examples. In the prediction stage, the distance between the query vector and the document vector may be calculated using the two-tower model 100, and then, whether the query vector and the document vector are similar may be determined using the two-tower model 100 based on the distance between the query vector and the document vector.
It should be noted that, in the embodiment of the present application, whether the feature is directed to a query feature or a document feature, the feature may include a Text (Text) feature and a non-Text feature, and the embodiment of the present application is not particularly limited thereto. For example, the non-text features may include Social (Social) features and address (location) features. Non-textual features may also be referred to as searchers and contexts. Non-textual features, which are directed to query features, may be certain features that are relevant to the user; for example, the social characteristic may be attribute information of the user, and the address characteristic may be an address of the user or a location of the electronic device. For the non-text feature in the document feature, some features related to the document can be selected; for example, the social characteristic may be social information of the document, such as an authority of the document or an authority of an author of the document; the address characteristic may be the source of the document, such as the public number where the document is disclosed.
In a specific implementation, after the words and sentences input by the user are segmented, the document with the top similarity score in the document library is used as a recommended document in an inverted manner, so that the user can conveniently and quickly search for a required document. However, there is a possibility that a recommended document obtained by word segmentation and a document actually searched by the user may come in and go out. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", in the above manner, the "road motor vehicle traffic rules" are first split into "roads", "motor vehicles" and "traffic rules", and then documents respectively hitting the words and phrases are selected in the document library in an inverted manner, for example, the documents "motor vehicles and what traffic rules are violated by speeding on non-motor vehicles roads", the documents "motor vehicles and non-motor vehicles in traffic rules", and the documents "motor vehicles and electric bicycles accidents" are preferentially recommended; however, in general, the user is a document related to a desire to recommend a specific traffic rule, such as the document "which traffic road driving rules are" or the document "what traffic road driving rules are" or the like. Therefore, although the actually recommended document is related to the words and phrases input by the user, the actually recommended document and the actually required document have a certain discrepancy, and the user experience is reduced.
In addition, in the stage of training the double tower model 100, the positive examples, the strong negative examples, the weak negative examples and the non-text features (including the user's pictorial features and the document features and the non-text features) can be used for training the double tower model 100, different non-text features correspond to different scenes, however, due to the limitation of training data, the double tower model 100 cannot cover all actual scenes, and therefore, the documents actually recommended by the double tower model 100 and the words and phrases input by the user are related, but have a certain difference with the documents actually required by the user, and the user experience is reduced.
The embodiment of the application provides a training method for an individualized semantic vector model in a search engine, and the correlation of words and sentences input by a user on semantics can be considered through the individualized semantic vector model, so that the recommendation accuracy and the user experience of the search engine can be improved. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", the personalized semantic vector model is used to semantically associate the segmented features, so that the finally recommended documents may be documents "what traffic road traffic rules are", and the like.
Specifically, the personalized semantic vector model is mapped to a corresponding vector space by adopting a conversion (Transformer) model, and the query non-text features and the document non-text features are mapped to the corresponding vector space by adopting respective full connection layers; the document features of the positive and negative examples are put into a batch (batch) for parallel calculation, and the personalized semantic vector model is obtained through training, so that the expression of vectors generated by the personalized semantic vector model in mass recalls can be improved.
In addition, on the aspect of data construction, various actual scenes are simulated by adopting a sampling mode, and the robustness of the personalized semantic vector model in actual use is improved.
It should be noted that the personalized semantic vector model may be applied to, for example, an applet search engine, and may also be applied to a web page version search engine, which is not specifically limited in this embodiment of the present application. Fig. 2 is an example of a display interface provided in an embodiment of the present application, and as shown in fig. 2, in a display interface 200, a query keyword or a query word (i.e., a query or a query feature in the present application) is input through a query window, a zoom map of a recommended document 1 may be displayed in a display area of the recommended document 1 by using a personalized semantic vector model, a zoom map of the recommended document 2 is displayed in the display area of the recommended document 2, a zoom map of the recommended document 3 is displayed in the display area of the recommended document 3, and a zoom map of the recommended document 4 is displayed in the display area of the recommended document 4, so that a user may directly click a corresponding display area into a page on which a complete recommended document is displayed, thereby assisting the user in quickly searching for a target document.
Fig. 3 is a schematic block diagram of a system framework 300 of a search engine provided by an embodiment of the present application.
As shown in FIG. 3, the system framework 300 of the search engine includes a web Crawler (Crawler) generated data source 310, an offline vectorization task model 320 for vectorizing data in the data source, an Indexer 330 for the generated index repository (Indexer), a vector recall module 330 for searching for documents with the highest similarity, and an online vectorization task model 340 for vectorizing query features. In the implementation, firstly, the web page part is collected from the network, and then the original data is processed to form the data source 310; after passing through the offline vectorization module 320 and the indexer 330, the data in the data source 310 forms an index portion and waits for an input portion of a query keyword; finally, the vector recall module 330 may calculate, according to the query features, the search system to provide the query results sorted from large to small according to the similarity between the documents and the query features, and provide the sorted query results to the user.
Web crawlers: specifically, a breadth-first or depth-first policy may be adopted to traverse the network and download a document, starting from a pre-established global Resource Locator (URL) list. For example, after a web crawler accesses a web page, it analyzes it to extract a new URL, adds it to the access list, which is a hyperlink queue or stack maintained in the system, and then recursively repeats the access until the queue or stack is empty. Whether the design of the web crawler is reasonable or not directly influences the efficiency of the web crawler accessing the network, so that the quality of a search database can be influenced, and in addition, the influence of the web crawler on the network and the accessed sites must be considered when the web crawler is designed, so that the web crawler is the most main part of resources acquired in the whole search engine, but the efficiency of the web crawler does not completely depend on the design of a designer to a great extent, and the quality of the network also influences the crawling effect to a certain extent. The working result of the web crawler is to store the crawled web pages into the data source.
The index database is a huge database, and the web pages captured by the web crawler establish indexes through the indexer and are placed into the index database. The search engine may of course take different ways to build the index. The index should include a forward list from pages to index terms, and also an inverted list built for necessary data items, such as keywords to web pages, and so on. Other search engines do not index all words of the entire web page document, and some simply analyze the title or first few pieces of content of the web page document and then index the web page document because the words in the title or first few pieces of content are already well characterized in the document. In addition, different weights are given to words at different positions of the document when the index is established, because the words in the title are more important than the words in the common paragraph, and the words show the characteristics of the document well in semantics, so that the expression of the words in a mathematical model is increased when the index is established, and the words can be matched with the query words obviously when the vector recall module 340 is used for searching. These tasks are performed by indexer 330 and the end result of the tasks is the index library.
The vector recall module 340 is used for responding to the retrieval request of the user and tracking the retrieval behavior of the user. When the user submits a request, the vector recall module 340 obtains the data of the related documents and index words from the index, and then calculates the relevance of the web pages and the query words in a broad sense according to the corresponding algorithm. And finally, sorting according to the relevance and outputting the result to the user. The user receives the results page and responds to the results (e.g., accesses a web page within the results page) and the information is tracked and recorded by the vector recall module 340. In addition, it is also important to process the keywords before searching, and it may specifically include query expansion and query reformulation. The former is mainly to expand the keywords by using synonyms or synonyms, for example, when the keyword searched by the user is 'computer', the processing system can return the documents related to the synonyms such as 'computer', 'microcomputer', etc., thereby improving the accuracy of searching. The latter means that the keywords are modified appropriately by using the feedback information of the user. Finally, it should be noted that the document ranking obtained by the search engine is not the calculation result of a simple mathematical model, but the final ranking result obtained by combining many other relevant factors. One of the most important factors that may be considered here is the importance of the web page, and of course, the algorithm for determining the importance of the web page is also incorporated in the generalized vector recall module 340.
The indexer 330 can run off-line vector tasks periodically using the personalized semantic vector model and build indexes periodically on the vector recall module, specifically, the indexes can be built through various periods, for example, the indexes are updated every day in units of data of the last week, and are updated every week in units of data of the last three months or more. Alternatively, in the present application, an index built with data in units of data of the last week may be referred to as a hot index, and an index built with data in units of data of the last three months or more may be referred to as a cold index.
The offline vectorization task model 320 and the online vectorization task model 340 may be implemented by the personalized semantic vector model according to the present application. On the offline service side, documents in the index library are converted into low-dimensional document vectors through the personalized semantic vector model, and unique identifiers of the document vectors and an index library of the document vectors are formed; on the online service side, the query features are inferred into low-dimensional query vectors by the personalized semantic vector model in real time, and K documents closest to the low-dimensional query vectors are recalled from the index database through a vector retrieval tool and serve as input of a sorting process of the search system. By way of example, the vector retrieval tool may be Faiss, a clustering and similarity search library sourced by Facebook artificial intelligence (Facebook AI) teams, providing efficient similarity search and clustering for dense vectors, supporting billions-level vector search.
Fig. 4 is a schematic flow chart diagram of a training method 400 for a personalized semantic vector model in a search engine according to an embodiment of the present application. It should be noted that the solutions provided in the embodiments of the present application can be implemented by any electronic device having data processing capability. For example, the electronic device may be implemented as a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and an artificial intelligence platform, and the server may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
As shown in fig. 4, the method 400 may include:
s410, acquiring a first query feature and M document features, wherein M is greater than 0; in a specific implementation, the personalized semantic vector model can be trained by searching first query features and first document features in the behavior log by a user.
S420, converting the first query feature into a first query vector, and respectively converting the M document features into M document vectors;
s430, based on the first query vector and the M document vectors, taking a preset similarity difference value as a training target, and training a personalized semantic vector model;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature. For example, the cosine of the angle between the normal vector and the query vector may be used to represent the similarity score between the normal vector and the query vector. The cosine of the angle between the negative case vector and the query vector may be used to represent the similarity score between the negative case vector and the query vector.
In the embodiment of the application, based on a first query vector and M document vectors, a preset similarity difference value is used as a training target, and a personalized semantic vector model is trained; in the process of training the personalized semantic vector model, the semantic relevance of words and sentences input by a user is considered, and further, in the recommendation process of a search engine, the recommendation accuracy and the user experience of the search engine can be improved. For example, assuming that the words and phrases input by the user are "road motor vehicle traffic rules", the personalized semantic vector model is used to semantically associate the segmented features, so that the finally recommended documents may be documents "what traffic road traffic rules are", and the like.
Fig. 5 is a schematic structural diagram of a personalized semantic vector model 500 provided in an embodiment of the present application.
As shown in fig. 5, the personalized semantic vector model 500 includes a query side encoder (Encode)510, a positive side encoder (Pos Doc Encode)520, a negative side encoder (Neg Doc Encode)530, and a scoring (Score) module 540. The positive side encoder 520 is used for processing the document features with positive examples of the query features, the negative side encoder 530 is used for processing the document features with negative examples of the query features, and the scoring module 540 can also be called a pair Score Function (Pairwise Score Function) module and is used for calculating the difference value of two similarity scores. Based on this, the query-side encoder 510 is utilized to convert the first query feature into the first query vector; converting, by the positive example side encoder 520, a document feature forming a positive example with the first query feature from the M document features into a first positive example vector from the M document vectors; converting, by the negative-case-side encoder 530, a document feature that forms a negative case with the query feature from the M document features into a first negative-case vector from the M document vectors; calculating a first difference value using the scoring module 540; the first difference is a difference between a similarity score between the first positive vector and the first query vector and a similarity score between the first negative vector and the first query vector; based on the first difference, the personalized semantic vector model 500 is trained with the similarity difference as a training target.
In some embodiments, as shown in fig. 5, the query-side encoder 510 includes a query-side text conversion module 511, a query-side non-text conversion module 512, and a query-side vector fusion module 513; the first query feature comprises a first textual feature and a first non-textual feature; based on this, the query-side text conversion module 511 is utilized to convert the first text feature into a first text vector; converting the first non-text feature into a first non-text vector by using the query-side non-text conversion module 512; the first text vector and the first non-text vector are fused by the query side vector fusion module 513, so as to obtain the first query vector. Optionally, the first non-textual feature comprises at least one of the following features of the user: age, gender, portrait, academic calendar, language used and cell phone system used. Of course, the specific features included in the first non-textual feature described above are merely examples and should not be construed as limiting the application.
In some embodiments, the positive example side encoder 520 includes a positive example side text conversion module 521, a positive example side non-text conversion module 522, and a positive example side vector fusion module 523; the document features forming a positive example with the first query feature in the M document features comprise a second text feature and a second non-text feature; based on this, the second text feature is converted into a second text vector by the obverse-side text conversion module 521; converting the second non-text feature into a second non-text vector by using the obverse-side non-text conversion module 522; the first right-side vector is obtained by fusing the second text vector and the second non-text vector with the right-side vector fusion module 523. Optionally, the second non-textual feature comprises at least one of: authority of document content, authority of document author, and fan number of public number associated with document. Of course, the specific features included in the second non-textual feature described above are merely examples and should not be construed as limiting the application. In one implementation, the negative-case-side encoder 530 includes a negative-case-side text conversion module, a negative-case-side non-text conversion module, and a negative-case-side vector fusion module; the negative side encoder 530 and the positive side encoder 520 share parameters. In one implementation, the query-side encoder 510 and the positive-side encoder 520 do not share parameters, and the query-side encoder 510 and the negative-side encoder 530 do not share parameters. In the embodiment of the present application, the negative-case-side encoder 530 and the positive-case-side encoder 520 are configured to share parameters, and the query-side encoder 510, the positive-case-side encoder 520, and the query-side encoder 510 and the negative-case-side encoder 530 are designed not to share parameters, so that the learning capability of the personalized semantic vector model 500 for semantic distinction between positive cases and negative cases can be ensured, and further, the semantic identification performance of the personalized semantic vector model 500 can be ensured.
Aiming at the personalized semantic vector model 500, clicked document features are used for forming positive examples, non-clicked document features are exposed, document features recalled in a traditional mode are used for forming strong negative examples, randomly sampled document features are used for forming weak negative examples, and the personalized semantic vector model 500 is trained by adopting the positive examples, the strong negative examples and the weak negative examples, so that the personalized semantic vector model 500 has semantic understanding capability.
In some embodiments, the S410 may include:
acquiring X non-text features in training data, wherein X is greater than 0; selecting N times from the X non-text features in a random selection mode to obtain N groups of non-text features; each group of the N groups of non-text features comprises Y non-text features, wherein N is more than 1, and X is more than Y and more than 0; and the N groups of non-text features are used as N training sets of the personalized semantic vector model, and Y non-text features included in one group of non-text features in the N groups of non-text features are used as non-text features in the first query feature in one training set in the N training sets and document non-text features in the M document features.
For example, the text conversion module related to the present application may employ an optimal conversion (Transformer) module, and the non-text conversion module may employ a full connection layer module. The non-text features simulate the non-text features included in the actual scene or the missing non-text features in the actual scene in a sampling mode, so that the personalized semantic vector model 500 can deal with scenes including various non-text feature scenes or scenes including various non-text feature combinations. For example, suppose that a piece of training data has dozens of non-text query features and document features, 25% of the features are randomly selected as a scene lacking some non-text features and used as a training set, and the above operations are repeated to repeatedly sample the same piece of data for several times to obtain multiple training sets, so that the personalized semantic vector model 500 can cover different scenes lacking some non-text features as much as possible.
In some embodiments, the method 400 may further include:
acquiring a query request through a query window, wherein the query request comprises a second query characteristic; converting the second query feature into a second query vector using the personalized semantic vector model; based on the second query vector, selecting K document features with the similarity scores higher than that of the second query vector in a document library, wherein K is greater than 0; and displaying the K document features below the query window in the descending order of the similarity scores.
For example, the vector recall module 340 shown in fig. 3 may select, based on the second query vector, K document features with a similarity score closer to that of the second query vector from the document library, and the specific implementation may refer to the description in fig. 3, and in order to avoid repetition, details are not repeated here. In addition, it should be noted that a value of K may be preset, or may also be information carried in the query request, which is not limited in this embodiment of the application.
It should be understood that the personalized semantic vector model 500 shown in fig. 5 is only an example of the present application and should not be construed as a limitation of the present application. For example, in other alternative embodiments, the positive side encoder 520 and the negative side encoder 530 may be combined into one encoder, or the positive side encoder 520 and the negative side encoder 530 may respectively correspond to one scoring module, which is not specifically limited in this embodiment of the present application.
The preferred embodiments of the present application have been described in detail with reference to the accompanying drawings, however, the present application is not limited to the details of the above embodiments, and various simple modifications can be made to the technical solution of the present application within the technical idea of the present application, and these simple modifications are all within the protection scope of the present application. For example, the various features described in the foregoing detailed description may be combined in any suitable manner without contradiction, and various combinations that may be possible are not described in this application in order to avoid unnecessary repetition. For example, various embodiments of the present application may be arbitrarily combined with each other, and the same should be considered as the disclosure of the present application as long as the concept of the present application is not violated.
It should also be understood that, in the various method embodiments of the present application, the sequence numbers of the above-mentioned processes do not imply an execution sequence, and the execution sequence of the processes should be determined by their functions and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.
The method provided by the embodiment of the present application is explained above, and the device provided by the embodiment of the present application is explained below.
Fig. 6 is a schematic block diagram of a training apparatus 600 for a personalized semantic vector model in a search engine according to an embodiment of the present application.
As shown in fig. 6, the apparatus 600 may include:
an obtaining unit 610, configured to obtain a first query feature and M document features, where M > 0;
a converting unit 620, configured to convert the first query feature into a first query vector, and convert the M document features into M document vectors respectively;
a training unit 630, configured to train a personalized semantic vector model based on the first query vector and the M document vectors, with a preset similarity difference as a training target;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature.
In some embodiments, the personalized semantic vector model includes a query-side encoder, a positive-case-side encoder, a negative-case-side encoder, and a scoring module; the conversion module 620 is specifically configured to: converting the first query feature into the first query vector using the query-side encoder; converting the document features forming a positive example with the first query feature in the M document features into a first positive example vector in the M document vectors by using the positive example side encoder; converting the document features which form negative examples with the query features in the M document features into first negative example vectors in the M document vectors by using the negative example side encoder; wherein, the training unit 630 is specifically configured to: calculating a first difference value by using the scoring module; the first difference is a difference between a similarity score between the first positive vector and the first query vector and a similarity score between the first negative vector and the first query vector; based on the first difference, the similarity difference is used as a training target to train the personalized semantic vector model.
In some embodiments, the negative-case-side encoder includes a negative-case-side text conversion module, a negative-case-side non-text conversion module, and a negative-case-side vector fusion module; the negative side encoder and the positive side encoder share parameters.
In some embodiments, the query-side encoder and the positive-side encoder do not share parameters, and the query-side encoder and the negative-side encoder do not share parameters.
In some embodiments, the query-side encoder includes a query-side text conversion module, a query-side non-text conversion module, and a query-side vector fusion module; the first query feature comprises a first textual feature and a first non-textual feature; the conversion module 620 is specifically configured to: converting the first text feature into a first text vector by using the query side text conversion module; converting the first non-text feature into a first non-text vector by using the query side non-text conversion module; and fusing the first text vector and the first non-text vector by using the query side vector fusion module to obtain the first query vector.
In some embodiments, the first non-textual feature comprises at least one of the following features of the user: age, gender, portrait, academic calendar, language used and cell phone system used.
In some embodiments, the positive side encoder includes a positive side text conversion module, a positive side non-text conversion module, and a positive side vector fusion module; the document features forming a positive example with the first query feature in the M document features comprise a second text feature and a second non-text feature; the conversion module 620 is specifically configured to: converting the second text feature into a second text vector by using the normal side text conversion module; converting the second non-text feature into a second non-text vector by using the normal side non-text conversion module; and fusing the second text vector and the second non-text vector by using the normal side vector fusion module to obtain the first normal vector.
In some embodiments, the second non-textual feature comprises at least one of: authority of document content, authority of document author, and fan number of public number associated with document.
In some embodiments, the obtaining unit 610 is specifically configured to:
acquiring X non-text features in training data, wherein X is greater than 0; selecting N times from the X non-text features in a random selection mode to obtain N groups of non-text features; each group of the N groups of non-text features comprises Y non-text features, wherein N is more than 1, and X is more than Y and more than 0; and the N groups of non-text features are used as N training sets of the personalized semantic vector model, and Y non-text features included in one group of non-text features in the N groups of non-text features are used as non-text features in the first query feature in one training set in the N training sets and document non-text features in the M document features.
In some embodiments, the apparatus 600 may further comprise a display unit for:
acquiring a query request through a query window, wherein the query request comprises a second query characteristic;
converting the second query feature into a second query vector using the personalized semantic vector model;
based on the second query vector, selecting K document features with the similarity scores higher than that of the second query vector in a document library, wherein K is greater than 0;
and displaying the K document features below the query window in the descending order of the similarity scores.
It is to be understood that apparatus embodiments and method embodiments may correspond to one another and that similar descriptions may refer to method embodiments. To avoid repetition, further description is omitted here. Specifically, the apparatus 600 may correspond to a corresponding main body in executing the method 400 of the embodiment of the present application, and each unit in the apparatus 600 is respectively for implementing a corresponding flow in the method 400, and is not described herein again for brevity.
It should also be understood that the units in the video processing apparatus related to the embodiments of the present application may be respectively or entirely combined into one or several other units to form one or several other units, or some unit(s) therein may be further split into multiple functionally smaller units to form one or several other units, which may achieve the same operation without affecting the achievement of the technical effects of the embodiments of the present application. The units are divided based on logic functions, and in practical application, the functions of one unit can be realized by a plurality of units, or the functions of a plurality of units can be realized by one unit. In other embodiments of the present application, the video processing apparatus may also include other units, and in practical applications, these functions may also be implemented by being assisted by other units, and may be implemented by cooperation of a plurality of units. According to another embodiment of the present application, a video processing apparatus according to an embodiment of the present application may be configured by running a computer program (including program codes) capable of executing steps involved in the corresponding method on a general-purpose computing device including a general-purpose computer such as a Central Processing Unit (CPU), a random access storage medium (RAM), a read only storage medium (ROM), and the like, and a storage element, and a video processing method according to an embodiment of the present application may be implemented. The computer program may be loaded on a computer-readable storage medium, for example, and loaded and executed in an electronic device through the computer-readable storage medium to implement the methods of the embodiments of the present application.
In other words, the above-mentioned units may be implemented in hardware, may be implemented by instructions in software, and may also be implemented in a combination of hardware and software. Specifically, the steps of the method embodiments in the present application may be implemented by integrated logic circuits of hardware in a processor and/or instructions in the form of software, and the steps of the method disclosed in conjunction with the embodiments in the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software in the decoding processor. Alternatively, the software may reside in random access memory, flash memory, read only memory, programmable read only memory, electrically erasable programmable memory, registers, and the like, as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps in the above method embodiments in combination with hardware thereof.
Fig. 7 is a schematic structural diagram of an electronic device 700 provided in an embodiment of the present application.
As shown in fig. 7, the electronic device 700 includes at least a processor 710 and a computer-readable storage medium 720. Wherein the processor 710 and the computer-readable storage medium 720 may be connected by a bus or other means. The computer-readable storage medium 720 is used to store a computer program 721, the computer program 721 comprising computer instructions, the processor 710 being used to execute the computer instructions stored by the computer-readable storage medium 720. The processor 710 is the computational core and control core of the electronic device 700, and is adapted to implement one or more computer instructions, particularly adapted to load and execute one or more computer instructions to implement a corresponding method flow or a corresponding function.
By way of example, processor 710 may also be referred to as a Central Processing Unit (CPU). Processor 710 may include, but is not limited to: general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like.
By way of example, the computer-readable storage medium 720 may be a high-speed RAM memory or a Non-volatile memory (Non-volatile memory), such as at least one disk memory; optionally, at least one computer-readable storage medium may be located remotely from the processor 710. In particular, computer-readable storage media 720 includes, but is not limited to: volatile memory and/or non-volatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).
As shown in fig. 7, the electronic device 700 may also include a transceiver 730. The processor 710 may control the transceiver 730 to communicate with other devices, and in particular, may transmit information or data to the other devices or receive information or data transmitted by the other devices. The transceiver 730 may include a transmitter and a receiver. The transceiver 730 may further include an antenna, and the number of antennas may be one or more.
It should be noted that, the components in the electronic device 700 are connected by a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus, and this is not limited in this embodiment of the present application.
In one implementation, the electronic device 700 can be the training apparatus 600 for personalized semantic vector model in the search engine shown in fig. 7; the computer-readable storage medium 720 has stored therein computer instructions; computer instructions stored in the computer-readable storage medium 720 are loaded and executed by the processor 710 to implement the corresponding steps in the method embodiment shown in FIG. 7; in a specific implementation, the computer instructions in the computer-readable storage medium 720 are loaded by the processor 710 and executed to perform corresponding steps, which are not described herein again to avoid repetition.
According to another aspect of the present application, a computer-readable storage medium (Memory) is provided, which is a Memory device in the electronic device 700 and is used for storing programs and data. Such as computer-readable storage medium 720. It is understood that the computer readable storage medium 720 herein may comprise both built-in storage media in the electronic device 700 and, of course, extended storage media supported by the electronic device 700. The computer readable storage medium provides a storage space that stores an operating system of the electronic device 700. Also stored in the memory space are one or more computer instructions, which may be one or more computer programs 721 (including program code), suitable for loading and execution by the processor 710.
According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. Such as computer program 721. At this time, the electronic device 700 may be a computer, and the processor 710 reads the computer instructions from the computer-readable storage medium 720, and the processor 710 executes the computer instructions, so that the computer performs the video processing method provided in the above-described various alternative ways.
In other words, when implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes of the embodiments of the present application are executed in whole or in part or to realize the functions of the embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.).
Those of ordinary skill in the art will appreciate that the various illustrative elements and process steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Finally, it should be noted that the above mentioned embodiments are only specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present application, and all such changes or substitutions should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (13)

1. A training method of a personalized semantic vector model in a search engine is characterized by comprising the following steps:
acquiring a first query feature and M document features, wherein M is greater than 0;
converting the first query feature into a first query vector, and converting the M document features into M document vectors respectively;
based on the first query vector and the M document vectors, taking a preset similarity difference value as a training target, and training a personalized semantic vector model;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature.
2. The method of claim 1, wherein the personalized semantic vector model comprises a query-side encoder, a positive-case-side encoder, a negative-case-side encoder, and a scoring module;
wherein, the converting the first query feature into a query vector and the converting the M document features into M document vectors respectively includes:
converting, with the query-side encoder, the first query feature into the first query vector; converting, by the positive example side encoder, a document feature forming a positive example with the first query feature in the M document features into a first positive example vector in the M document vectors; converting, by the negative case-side encoder, document features in the M document features that form negative cases with the query feature into a first negative case vector in the M document vectors;
wherein, the training of the personalized semantic vector model based on the query vector and the M document vectors by using a preset similarity difference value as a training target comprises:
calculating a first difference value by using the scoring module; the first difference is a difference between a similarity score between the first positive example vector and the first query vector and a similarity score between the first negative example vector and the first query vector;
and training the personalized semantic vector model by taking the similarity difference as a training target based on the first difference.
3. The method of claim 2, wherein the negative-case-side encoder comprises a negative-case-side text conversion module, a negative-case-side non-text conversion module, and a negative-case-side vector fusion module; the negative side encoder and the positive side encoder share parameters.
4. The method of claim 2, wherein the query-side encoder and the positive-side encoder do not share parameters and the query-side encoder and the negative-side encoder do not share parameters.
5. The method of claim 2, wherein the query-side encoder comprises a query-side text conversion module, a query-side non-text conversion module, and a query-side vector fusion module; the first query feature comprises a first textual feature and a first non-textual feature;
wherein the converting the first query feature into a first query vector comprises:
converting the first text feature into a first text vector by using the query side text conversion module; converting the first non-text feature into a first non-text vector by using the query side non-text conversion module; and fusing the first text vector and the first non-text vector by using the query side vector fusion module to obtain the first query vector.
6. The method of claim 5, wherein the first non-textual feature comprises at least one of the following features of a user: age, gender, portrait, academic calendar, language used and cell phone system used.
7. The method of claim 2, wherein the positive side encoder comprises a positive side text conversion module, a positive side non-text conversion module, and a positive side vector fusion module; document features of the M document features that form a positive example with the first query feature include a second textual feature and a second non-textual feature;
wherein the converting the M document features into M document vectors respectively comprises:
converting the second text feature into a second text vector by using the normal side text conversion module; converting the second non-text feature into a second non-text vector by using the normal side non-text conversion module; and fusing the second text vector and the second non-text vector by using the normal-example lateral vector fusion module to obtain the first normal-example vector.
8. The method of claim 7, wherein the second non-textual feature comprises at least one of: authority of document content, authority of document author, and fan number of public number associated with document.
9. The method of claim 1, wherein obtaining the first query feature and the M document features comprises:
acquiring X non-text features in training data, wherein X is greater than 0;
selecting N times from the X non-text features in a random selection mode to obtain N groups of non-text features; each of the N sets of non-textual features includes Y non-textual features, N > 1, X > Y > 0;
wherein the N groups of non-text features are used as N training sets of the personalized semantic vector model, and Y non-text features included in one group of non-text features of the N groups of non-text features are used as non-text features in the first query feature and document non-text features in the M document features in one training set of the N training sets.
10. The method of claim 1, further comprising:
acquiring a query request through a query window, wherein the query request comprises a second query characteristic;
converting the second query feature into a second query vector using the personalized semantic vector model;
selecting K document features with the similarity scores higher than that of the second query vector in a document library based on the second query vector, wherein K is greater than 0;
and displaying the K document features below the query window according to the sequence of similarity scores from large to small.
11. An apparatus for training a personalized semantic vector model in a search engine, comprising:
the acquisition unit is used for acquiring a first query feature and M document features, wherein M is greater than 0;
the conversion unit is used for converting the first query feature into a first query vector and respectively converting the M document features into M document vectors;
the training unit is used for training an individualized semantic vector model by taking a preset similarity difference value as a training target based on the first query vector and the M document vectors;
the similarity difference value is the difference value between the similarity score between the positive example vector and the query vector and the similarity score between the negative example vector and the query vector, the positive example vector is a document feature converted vector forming a positive example with the query feature, and the negative example vector is a document feature converted vector forming a negative example with the query feature.
12. An electronic device, comprising:
a processor adapted to execute a computer program;
computer-readable storage medium, in which a computer program is stored which, when being executed by the processor, carries out a method of training a personalized semantic vector model in a search engine according to any one of claims 1 to 10.
13. A computer-readable storage medium for storing a computer program which causes a computer to execute the training method of a personalized semantic vector model in a search engine according to any one of claims 1 to 10.
CN202110191195.2A 2021-02-19 2021-02-19 Training method and device for personalized semantic vector model in search engine Active CN113010771B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110191195.2A CN113010771B (en) 2021-02-19 2021-02-19 Training method and device for personalized semantic vector model in search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110191195.2A CN113010771B (en) 2021-02-19 2021-02-19 Training method and device for personalized semantic vector model in search engine

Publications (2)

Publication Number Publication Date
CN113010771A true CN113010771A (en) 2021-06-22
CN113010771B CN113010771B (en) 2023-08-22

Family

ID=76403769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110191195.2A Active CN113010771B (en) 2021-02-19 2021-02-19 Training method and device for personalized semantic vector model in search engine

Country Status (1)

Country Link
CN (1) CN113010771B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821731A (en) * 2021-11-23 2021-12-21 湖北亿咖通科技有限公司 Information push method, device and medium
CN117312513A (en) * 2023-09-27 2023-12-29 数字广东网络建设有限公司 Document search model training method, document search method and related device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186495A1 (en) * 2013-12-31 2015-07-02 Quixey, Inc. Latent semantic indexing in application classification
JP2017010249A (en) * 2015-06-22 2017-01-12 日本電信電話株式会社 Parameter learning device, sentence similarity calculation device, method, and program
CN106407311A (en) * 2016-08-30 2017-02-15 北京百度网讯科技有限公司 Method and device for obtaining search result
US20170329842A1 (en) * 2016-05-13 2017-11-16 General Electric Company System and method for entity recognition and linking
CN111159343A (en) * 2019-12-26 2020-05-15 上海科技发展有限公司 Text similarity searching method, device, equipment and medium based on text embedding
CN111339421A (en) * 2020-02-28 2020-06-26 腾讯科技(深圳)有限公司 Information search method, device, equipment and storage medium based on cloud technology
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150186495A1 (en) * 2013-12-31 2015-07-02 Quixey, Inc. Latent semantic indexing in application classification
JP2017010249A (en) * 2015-06-22 2017-01-12 日本電信電話株式会社 Parameter learning device, sentence similarity calculation device, method, and program
US20170329842A1 (en) * 2016-05-13 2017-11-16 General Electric Company System and method for entity recognition and linking
CN106407311A (en) * 2016-08-30 2017-02-15 北京百度网讯科技有限公司 Method and device for obtaining search result
CN111159343A (en) * 2019-12-26 2020-05-15 上海科技发展有限公司 Text similarity searching method, device, equipment and medium based on text embedding
CN111339421A (en) * 2020-02-28 2020-06-26 腾讯科技(深圳)有限公司 Information search method, device, equipment and storage medium based on cloud technology
CN111950269A (en) * 2020-08-21 2020-11-17 清华大学 Text statement processing method and device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113821731A (en) * 2021-11-23 2021-12-21 湖北亿咖通科技有限公司 Information push method, device and medium
CN117312513A (en) * 2023-09-27 2023-12-29 数字广东网络建设有限公司 Document search model training method, document search method and related device

Also Published As

Publication number Publication date
CN113010771B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
US9449271B2 (en) Classifying resources using a deep network
US7680858B2 (en) Techniques for clustering structurally similar web pages
CN108280114B (en) Deep learning-based user literature reading interest analysis method
CN117114001A (en) Determining a paraphrasing interrelationship across documents based on resolution and identification of named entities
CN104915413A (en) Health monitoring method and health monitoring system
CN112580352B (en) Keyword extraction method, device and equipment and computer storage medium
Yang Developing an ontology-supported information integration and recommendation system for scholars
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN113010771B (en) Training method and device for personalized semantic vector model in search engine
CN111813905A (en) Corpus generation method and device, computer equipment and storage medium
CN110362663A (en) Adaptive more perception similarity detections and parsing
WO2023122051A1 (en) Contextual clarification and disambiguation for question answering processes
CN113515589A (en) Data recommendation method, device, equipment and medium
US20170235835A1 (en) Information identification and extraction
CN112765966B (en) Method and device for removing duplicate of associated word, computer readable storage medium and electronic equipment
Bai et al. A rumor detection model incorporating propagation path contextual semantics and user information
Babbar et al. Real-time traffic, accident, and potholes detection by deep learning techniques: a modern approach for traffic management
CN116680481B (en) Search ranking method, apparatus, device, storage medium and computer program product
Azzam et al. A question routing technique using deep neural network for communities of question answering
Lytvyn et al. Content Formation Method in the Web Systems.
Maia et al. A tag-based transformer community question answering learning-to-rank model in the home improvement domain
Sun et al. Research on question retrieval method for community question answering
Li et al. Research on hot news discovery model based on user interest and topic discovery
JP2010282403A (en) Document retrieval method
CN117609479B (en) Model processing method, device, equipment, medium and product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40046493

Country of ref document: HK

GR01 Patent grant
GR01 Patent grant