CN117272937B - Text coding model training method, device, equipment and storage medium - Google Patents

Text coding model training method, device, equipment and storage medium Download PDF

Info

Publication number
CN117272937B
CN117272937B CN202311457583.6A CN202311457583A CN117272937B CN 117272937 B CN117272937 B CN 117272937B CN 202311457583 A CN202311457583 A CN 202311457583A CN 117272937 B CN117272937 B CN 117272937B
Authority
CN
China
Prior art keywords
question
answer
text
sample
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202311457583.6A
Other languages
Chinese (zh)
Other versions
CN117272937A (en
Inventor
颜泽龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311457583.6A priority Critical patent/CN117272937B/en
Publication of CN117272937A publication Critical patent/CN117272937A/en
Application granted granted Critical
Publication of CN117272937B publication Critical patent/CN117272937B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a text coding model training method, a device, equipment and a storage medium, relates to the field of artificial intelligence, and particularly relates to the field of large models. The model training method comprises the following steps: acquiring a first training data sample, wherein the first training data sample comprises a first question and a positive sample answer corresponding to the first question; determining a negative sample answer in the at least one answer according to the similarity between the first question and the at least one second question and the similarity between the positive sample answer and the at least one answer corresponding to the at least one second question; determining a second training data sample, the second training data sample comprising a first question, a positive sample answer, and a negative sample answer; and training the text coding model according to the second training data sample. According to the text encoder training method and device based on the training data sample set, training can be conducted on the text encoder based on the training data sample set, the text encoder is facilitated to output more accurate text vector characterization, and accuracy of sentence vector retrieval is facilitated to be improved.

Description

Text coding model training method, device, equipment and storage medium
Technical Field
The embodiment of the application relates to the technical field of artificial intelligence, in particular to a text coding model training method, a device, equipment and a storage medium.
Background
Deep learning can abstract various unstructured data (such as voice, picture, video, language words, behaviors and the like) generated by people, objects and scenes in the physical world to become multidimensional representation vectors, and the relationship of the physical world can be obtained through distance mathematical operation of the representation vectors. The object is represented as a dense vector through the depth model, the vector can reflect the core characteristics of the object, and then the text with similar semantics is obtained through the distance comparison of the vectors. If on a very large scale representation vector, searching for the vector closest to the input is called vector retrieval.
Sentence vectors represent sentences of indefinite length as fixed length vector representations. In the related art, sentence vector search can be applied in a knowledge question-answer scene. The data related to sentence vector retrieval includes questions (queries) and documents (documents) related to the questions. And respectively encoding the query and the document into corresponding vector representations by using a depth encoder, and calculating the similarity of the two vector representations by using the vector inner product or the cosine similarity to be used as a basis for judging whether the query and the document are related. However, the current sentence vector search scheme has an unsatisfactory search effect, and often results which are not in line with expectations are generated when a problem is replied.
Disclosure of Invention
The application provides a text coding model training method, a device, equipment and a storage medium, which can train a text coder based on a high-quality training data sample set, the text encoder is facilitated to output more accurate text vector characterization, and the sentence vector retrieval accuracy is facilitated to be improved.
In a first aspect, an embodiment of the present application provides a training method for a text coding model, including:
acquiring a first training data sample, wherein the first training data sample comprises a first question and a positive sample answer corresponding to the first question;
determining a negative sample answer in at least one answer according to the similarity of the first question and at least one second question and the similarity of the positive sample answer and at least one answer corresponding to the at least one second question;
determining a second training data sample comprising the first question, the positive sample answer, and the negative sample answer;
and training the text coding model according to the second training data sample.
In a second aspect, an embodiment of the present application provides a sentence vector retrieval method, including:
Acquiring a question text;
inputting the problem text into a text coding model to obtain a first vector representation corresponding to the problem text; wherein the text encoding model is obtained according to the training method as described in the first aspect;
obtaining a corpus, wherein the corpus comprises at least one document;
inputting said at least one document into said text encoding model, obtaining at least one second vector representation corresponding to the at least one document;
determining a target document corresponding to the problem text from the at least one document according to the similarity of the first vector representation and the at least one second vector representation;
and inputting the problem text and the target document into a pre-trained large language model to obtain a reply of the problem text.
In a third aspect, an embodiment of the present application provides a training device for a text coding model, including:
the acquisition unit is provided with a first training data sample, wherein the first training data sample comprises a first question and a positive sample answer corresponding to the first question;
a determining unit, configured to determine a negative sample answer from at least one answer according to a similarity between the first question and at least one second question and a similarity between the positive sample answer and at least one answer corresponding to the at least one second question;
The determining unit is further configured to determine a second training data sample, where the second training data sample includes the first question, the positive sample answer, and the negative sample answer;
and the training unit is used for training the text coding model according to the second training data sample.
In a fourth aspect, an embodiment of the present application provides a sentence vector retrieving apparatus, including:
an acquisition unit for acquiring a question text;
the text coding model is used for inputting the problem text and obtaining a first vector representation corresponding to the problem text; wherein the text encoding model is obtained according to the training method as described in the first aspect;
the obtaining unit is further used for obtaining a corpus, and the corpus comprises at least one document;
the text coding model is also used for inputting the at least one document to obtain at least one second vector representation corresponding to the at least one document;
a determining unit, configured to determine a target document corresponding to the question text from the at least one document according to a similarity between the first vector representation and the at least one second vector representation;
and the large language model is used for inputting the problem text and the target document to obtain a reply of the problem text.
In a fifth aspect, embodiments of the present application provide an electronic device, including: a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory for performing the method as in the first or second aspect.
In a sixth aspect, embodiments of the present application provide a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform a method as in the first or second aspect.
In a seventh aspect, embodiments of the present application provide a computer program product comprising computer program instructions for causing a computer to perform the method as in the first or second aspect.
In an eighth aspect, embodiments of the present application provide a computer program that causes a computer to perform the method as in the first or second aspect.
According to the similarity between the first question and at least one second question and the similarity between the positive sample answer corresponding to the first question and at least one answer corresponding to the at least one second question, the negative sample answer is determined in the at least one answer, so that high-quality negative sample answers are found from all candidate documents for each question, and the negative sample answers meet the condition that certain relation exists between the negative sample answers and the first question but are not related and are easily mistaken as related samples, and therefore a high-quality training data sample set is automatically built. Training the text encoder based on the high-quality training data sample set can help the text encoder to output more accurate text vector characterization, and is beneficial to improving the accuracy of sentence vector retrieval.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of an application scenario of the solution of the embodiment of the present application;
FIG. 2 is a schematic diagram of a system architecture according to an embodiment of the present application;
FIG. 3 is a schematic flow chart of a model training method according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of another model training method according to an embodiment of the present application;
FIG. 5 is a schematic flow chart diagram of another model training method according to an embodiment of the present application;
FIG. 6 is a flow diagram of model training according to an embodiment of the present application;
FIG. 7 is a schematic flow chart diagram of a sentence vector retrieval method in accordance with an embodiment of the present application;
FIG. 8 is a flow chart of sentence vector retrieval according to an embodiment of the present application;
FIG. 9 is a schematic block diagram of a model training apparatus according to an embodiment of the present application;
FIG. 10 is a schematic block diagram of a sentence vector retrieving apparatus in accordance with an embodiment of the present application;
fig. 11 is a schematic block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
It should be understood that in the embodiments of the present application, "B corresponding to a" means that B is associated with a. In one implementation, B may be determined from a. It should also be understood that determining B from a does not mean determining B from a alone, but may also determine B from a and/or other information.
In the description of the present application, unless otherwise indicated, "at least one" means one or more, and "a plurality" means two or more. In addition, "and/or" describes an association relationship of the association object, and indicates that there may be three relationships, for example, a and/or B may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.
It should be further understood that the description of the first, second, etc. in the embodiments of the present application is for purposes of illustration and distinction only, and does not represent a specific limitation on the number of devices in the embodiments of the present application, and should not constitute any limitation on the embodiments of the present application.
It should also be appreciated that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
The embodiment of the application is applied to the technical field of artificial intelligence.
Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include, for example, sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions. With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.
Embodiments of the present application may relate to natural language processing (Nature Language processing, NLP) in artificial intelligence technology. NLP is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
Embodiments of the present application may also relate to Machine Learning (ML) in artificial intelligence technology, where ML is a multi-domain interdisciplinary, and relates to multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.
Embodiments of the present application may also relate to Pre-training models (PTMs) in artificial intelligence technology. The pre-training model is also called a basic stone model and a large model, which refers to a deep neural network (Deep neural network, DNN) with large parameters, trains the deep neural network on massive unlabeled data, utilizes the function approximation capability of the large-parameter DNN to enable PTM to extract common characteristics on the data, and is suitable for downstream tasks through technologies such as fine tuning (fine tuning), efficient fine tuning (PEFT) of parameters, prompt-tuning and the like. Therefore, the pre-training model can achieve ideal effects in a small sample (Few-shot) or Zero sample (Zero-shot) scene. PTM can be classified according to the data modality of the process into a language model (ELMO, BERT, GPT), a visual model (swin-transducer, viT, V-MOE), a speech model (VALL-E), a multi-modal model (ViBERT, CLIP, flamingo, gato), etc., wherein a multi-modal model refers to a model that builds a representation of the characteristics of two or more data modalities. The pre-training model is an important tool for outputting Artificial Intelligence Generation Content (AIGC), and can also be used as a general interface for connecting a plurality of specific task models.
Currently, sentence vector search schemes can be applied in knowledge question-answering scenarios. The data related to sentence vector retrieval includes questions (queries) and documents (documents) related to the questions. And respectively encoding the query and the document into corresponding vector representations by using a depth encoder, and calculating the similarity of the two vector representations by using the vector inner product or the cosine similarity to be used as a basis for judging whether the query and the document are related.
In training the depth encoder, a contrast learning mode is adopted, each batch has K (query) pairs, and the query is divided into K (document) pairs i Corresponding documents i As positive samples, other documents in the same batch j As the query i Is not equal to i). It is desirable in training that the query be sufficiently close to its corresponding positive sample and sufficiently far from the negative sample. Illustratively, the loss function employed in training may be expressed as follows:
wherein,representing the vector representation of the query via the encoder, < >>Representing vector representations obtained by the document via the encoder, sim represents calculating similarity scores between the vector representations, +.>Is a super parameter. The training process aims at minimizing the loss function, thereby optimizing the parameters of the query encoder and the document encoder at the same time.
During model reasoning, a trained document encoder is utilized to encode documents in a knowledge base to obtain vector characterization and the vector characterization is stored in a vector library (such as a fasss vector library). When a user asks online, the trained query encoder is utilized to encode the query of the user asking to be a vector representation, the vector representation of the closest document is retrieved from a vector library, and the corresponding document is returned.
However, the current sentence vector search scheme has an unsatisfactory search effect, and often results which are not in line with expectations are generated when a problem is replied.
In view of this, the embodiments of the present application provide a model training method, apparatus, device, and storage medium, which can train a text encoder based on a high-quality training data sample set, and is conducive to outputting more accurate text vector characterization by the text encoder, and is conducive to improving accuracy of sentence vector retrieval.
Specifically, a first training data sample may be obtained, where the first training data sample includes a first question and a positive sample answer corresponding to the first question; determining a negative sample answer in at least one answer according to the similarity between the first question and at least one second question and the similarity between the positive sample answer and at least one answer corresponding to the at least one second question; determining a second training data sample comprising a first question, a positive sample answer, and a negative sample answer; and training the text coding model according to the second training data sample.
According to the similarity between the first question and at least one second question and the similarity between the positive sample answer corresponding to the first question and at least one answer corresponding to the at least one second question, the negative sample answer is determined in the at least one answer, so that high-quality negative sample answers are found from all candidate documents for each question, and the negative sample answers meet the condition that certain relation exists between the negative sample answers and the first question but are not related and are easily mistaken as related samples, and therefore a high-quality training data sample set is automatically built. Training the text encoder based on the high-quality training data sample set can help the text encoder to output more accurate text vector characterization, and is beneficial to improving the accuracy of sentence vector retrieval.
The embodiment of the application can be applied to a knowledge question and answer scene, such as a knowledge question and answer scene of a non-player character (NPC) in a game scene. In a game scene, high-quality negative sample answers are found for all documents for each query by utilizing (query) data knowledge in the game field, a high-quality training data sample set is constructed, and then a text encoder is trained based on the high-quality training data sample set, so that the text encoder outputs more accurate text vector representation, sentence vector retrieval can be more accurately carried out based on an accurate text vector representation model, and the reply effect of NPC (non-point code) and a user during a dialogue is improved. The specific presentation forms of the knowledge questions and answers are various, such as configuring a question and answer dialog box to an interface of a game as shown in (a) of fig. 1, or configuring a question and answer dialog box around an NPC character as shown in (b) of fig. 1.
A system architecture suitable for the present application is described below in connection with the accompanying drawings.
Fig. 2 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 2, the system architecture may include a user device 101, a data acquisition device 102, a training device 103, an execution device 104, a database 105, and a content library 106.
The data acquisition device 102 is configured to read training data from the content library 106, and store the read training data in the database 105. The training data related to the embodiment of the application includes, without limitation, question-answer data pairs in a general corpus, or question-answer data pairs in each specific field (such as a game field, a literature field, a financial field, or other fields).
Training device 103 trains the text encoding model based on training data maintained in database 105. In this embodiment of the present application, the training device 103 may obtain a high-quality negative sample answer based on the original training data, so as to obtain high-quality training data. Specifically, the training device 103 may determine a negative sample answer corresponding to the first question in the at least one answer based on the similarity between the first question and the at least one second question and the similarity between the positive sample answer corresponding to the first question and the at least one answer corresponding to the at least one second question, and further train the text encoding model according to the first question, the positive sample answer and the negative sample answer of the first question.
The text encoding model obtained by training device 103 may output a vector representation of the text. The text encoding model obtained by training device 103 may be applied to different systems or devices. The text encoding model may encode all documents in the content library, converting all documents into vector representations that exist in the vector library.
In addition, referring to fig. 2, the execution device 104 is configured with an I/O interface 107, and performs data interaction with an external device. Such as receiving a problem sent by the user equipment 101 via the I/O interface. The computing module 109 in the execution device 104 encodes the inputted question using the trained text encoding model, outputs a vector representation of the question, and further retrieves a vector representation of the document in the vector library that most closely resembles the vector representation corresponding to the question. The computing module 109 may return the corresponding document to the user device 101 via the I/O interface.
The user device 101 may include a mobile phone, a tablet computer, a notebook computer, a palm computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, a mobile internet device (mobile internet device, MID), or other terminal devices.
The execution device 104 may be a server. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligent platforms, and the like. Servers may also become nodes of the blockchain. The server may be one or more. Where the server is multiple, there are at least two servers for providing different services and/or there are at least two servers for providing the same service, such as providing the same service in a load balancing manner, which embodiments of the present application are not limited.
In this embodiment, the execution device 104 is connected to the user device 101 through a network. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, GSM), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, a telephony network, etc.
It should be noted that fig. 2 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the positional relationship between the devices, the modules, and the like shown in the drawings does not constitute any limitation. In some embodiments, the data acquisition device 102 may be the same device as the user device 101, the training device 103, and the execution device 104. The database 105 may be distributed over one server or over a plurality of servers, and the content library 106 may be distributed over one server or over a plurality of servers.
The following describes the technical solutions of the embodiments of the present application in detail through some embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.
Fig. 3 is a schematic flow chart of a training method 300 of a text encoding model according to an embodiment of the present application, where the method 300 may be performed by any electronic device having data processing capabilities, e.g., the electronic device may be implemented as a server or a terminal device, e.g., the training device 103 of fig. 2, which is not limited in this application. As shown in fig. 3, method 300 includes steps 310 through 340.
At 310, a first training data sample is obtained, the first training data sample including a first question and a positive sample answer corresponding to the first question.
Specifically, the first training data sample may be obtained from a generic corpus, or may be obtained from business data in a specific domain (such as a game domain, a literature domain, a financial domain, or other domains, etc.), which is not limited in this application. For example, the first question and a positive sample answer corresponding to the first question may be obtained from service data in the game field, and the first training data sample may be obtained. When the first training data sample is acquired by utilizing the business data in the specific field, the model can learn the knowledge in the specific field, and the capability of the model in the specific field is enhanced.
Illustratively, the first question may be one question in the business data of the specific domain, and the positive sample answer is a document related to the first question in all candidate documents of the business data. As an example, a plurality of training data samples may be obtained from the traffic data. As a specific example, the first training data sample may be represented as (query, pos) pair data, where query represents a first question, and pos represents a document related to the query, i.e., a positive sample answer corresponding to the first question.
320 determining a negative sample answer from the at least one answer based on the similarity of the first question to the at least one second question and the similarity of the positive sample answer to the at least one answer corresponding to the at least one second question.
Illustratively, the at least one second problem is another problem in the business data that is different from the first problem. Accordingly, the document related to each second question may be determined as an answer corresponding to each second question among all candidate documents of the business data. It should be understood that, the answer corresponding to each second question is a positive sample answer corresponding to each second question, and each second question and the answer corresponding to the second question may be used as a training data sample.
In some embodiments, referring to fig. 4, a negative sample answer to the first question may be determined according to the following steps 321 to 323.
321, determining a first similarity of the first question to each of the at least one second question, and a second similarity of the positive sample answer to the answer corresponding to each second question.
In some embodiments, the first question and at least one second question may be input into a language model, respectively, resulting in a vector representation of the first question and a vector representation of each second question; and respectively inputting the positive sample answer and at least one answer corresponding to at least one second question into the language model to obtain a vector representation of the positive sample answer and a vector representation of the at least one answer.
Exemplary, let the first question be the ith data query i The second problem is the jth data query j For example, a vector representation of a first problem output by a language model may be noted asThe vector characterization of the second problem can be noted asThe method comprises the steps of carrying out a first treatment on the surface of the The vector representation of the positive sample answer of the first question output by the language model may be noted +.>The vector representation of the answer to the second question may be noted +.>
In particular, the language model may be an Encoder (Encoder) for encoding the input text to obtain a vector representation of the input text of a particular length, such as converting natural language into a dense vector representation. By way of example, the language model may be a pre-trained model. As a specific example, the language model may be an M3E (Moka Massive Mixed Embedding) model. The M3E model is trained by MokaAI, and is obtained by training a sample training set through sentences comprising tens of millions (more than 2200 tens of thousands) of fields of Chinese encyclopedia, finance, medical treatment, law, news, academic and the like, and supports the calculation of the homoplasmic text similarity of Chinese and English bilingual.
Then, according to the vector representation of the first problem and the vector representation of each second problem, obtaining a first similarity between the first problem and each second problem; and obtaining a second similarity of the positive sample answer and the answer of each second question according to the vector representation of the positive sample answer of the first question and the vector representation of the at least one answer of the at least one second question.
Wherein the first similarity of the first question to each of the second questions, i.e. the vector similarity of the first question to each of the second questions, may refer to a distance measure between the vector representation of the first question and the vector representation of each of the second questions. The second similarity between the positive sample answer of the first question and the answer of each second question, i.e. the vector similarity of the positive sample answer of the first question and the answer of each second question, may refer to a distance measure between the vector representation of the positive sample answer of the first question and the vector representation of the answer of each second question.
The vector similarity can be noted as sim, which is a distance measure for calculating the distance between two vectors, and common methods for calculating the vector similarity are vector inner-sum and vector cosine distances. For example, the similarity between two vector characterizations may be obtained by calculating the vector inner product or cosine distance of the two vector characterizations, which is not limited in this application. By way of example only, and not by way of limitation,can represent data query i Vector characterization of +.>And data query j Vector characterization of +.>Is one example of a first similarity;can represent data query i Answer of- >Vector characterization and data query of (a) j Answer of->Cosine distance of the vector representation of (a), i.e. one example of the second similarity.
322 determining a score for each of the second questions based on the first similarity and the second similarity.
For example, a score for each second question may be determined based on the difference between the first and second similarities. As a specific example, let the first question be the ith data query i The second problem is the jth data query j For example, data query j Score of (2)The following formula (1) shows:
wherein,representing data query i Vector characterization of>Representing data query j Vector characterization of>Representing data query i Vector characterization and data query of (a) j Cosine distance of vector representation of>Representing data query i Vector representation of answers, ->Representing data query j Vector representation of answers, ->Representing data query i Vector characterization and data query of answers to (a) j Cosine distance of the vector representation of the answer to (c).
323, determining the negative sample answer in at least one answer according to the score of each second question.
Specifically, the score of the second question may represent a relationship between the similarity of the first question and the second question, and the similarity of the answer of the first question and the answer of the second question, so that a negative sample answer corresponding to the first question may be determined from at least one answer corresponding to the second question based on the score of the second question.
In some embodiments, the answer corresponding to the highest scoring second question may be determined to be a negative sample answer. Specifically, the second question with the highest score may be determined according to the score of each second question, and then the answer corresponding to the second question with the highest score is determined as the negative sample answer.
Illustratively, in equation (1) above, the j-th data of the highest score in the at least one second problem may be selectedAs->Is a negative sample answer to (a). Specifically, data query j And data query j In the event of a close proximity to one another,close to 0, data query j Answers to (a) and data query j When the answer phase of (a) is very different +.>Approximately (-1), at this time, the case is->Highest. Thus, the second question corresponding to the highest score is close to the first question, while the answer to the second question is opposite to the answer to the first question. The answer corresponding to the second question is now in a relationship to the first question but not related thereto. Meanwhile, since the answer of the second question corresponds to the second question and the second question is related to the first question, the answer of the second question is easily mistaken as a related sample answer of the first question. Therefore, the embodiment of the application can obtain the high-quality negative sample answer of the first question.
A second training data sample is determined 330, the second training data sample comprising a first question, a positive sample answer, and a negative sample answer.
Specifically, the second training data sample includes the first training data sample in step 310 and the negative sample answer determined in step 320. For example, the second training data sample may be expressed as (query, pos, neg), where the query represents the first question; pos represents the document associated with the query, i.e., the positive sample answer to the first question; neg represents a document that is not related to the query, i.e., a negative sample answer to the first question. Neg is a high-quality negative sample automatically constructed in the embodiment of the application, and the second training data sample is a high-quality training data sample automatically constructed in the embodiment of the application.
340 training the text encoding model based on the second training data sample.
In some embodiments, the training data sample set may be obtained by determining at least one second training data sample through steps 310 through 330 described above to train the text encoding model. Specifically, at least one second training data sample can be used as one or more batch input text coding models, and model parameters are optimized and adjusted according to model output to obtain a trained text coding model.
Therefore, according to the similarity between the first question and at least one second question and the similarity between the positive sample answer corresponding to the first question and at least one answer corresponding to the at least one second question, the negative sample answer is determined in the at least one answer, so that a high-quality negative sample answer is found from all candidate documents for each question, and the negative sample answer meets the condition that the negative sample answer has a certain relation with the first question but is not related and is easily mistaken as a related sample, and therefore the high-quality training data sample set is automatically built. Training the text encoder based on the high-quality training data sample set can help the text encoder to output more accurate text vector characterization, and is beneficial to improving the accuracy of sentence vector retrieval.
In some embodiments, referring to fig. 5, the text encoding model may be trained by the following steps 341 through 344. Illustratively, the text encoding model is described below as including a first text encoding model and a second text encoding model.
At 341, inputting at least one first question in the at least one second training data sample into the first text encoding model to obtain a vector representation of the at least one first question.
For example, at least one second training data sample may be used as a batch to train the model. Wherein, take the first problem asFor example, can be +.>For positive sample, ++>Negative samples are all mutually negative with all negative sample answers neg in the same batch. At the same time, it is also possible to->For positive sample, ++>All problem queries within the same batch are negative samples of each other.
Specifically, each first question in at least one second training data sample may be input into the first text encoding model to obtain a vector representation of each first question. As shown in FIG. 6, questions in each training data sample within the batch may be input into a first text encoding model 610, resulting in a first vector characterization.
The first text encoding model may encode the input text (e.g., questions) to obtain a vector representation of a particular length of the input text, such as converting natural language into a dense vector representation. For example, the first text encoding model may include a pre-trained language model. As a specific example, the first text encoding model may be an M3E model, which is not limited in this application.
342, inputting at least one positive sample answer and at least one negative sample answer in the at least one second training data sample into the second text encoding model to obtain a vector representation of the at least one positive sample answer and the at least one negative sample answer.
Specifically, a positive sample answer corresponding to each first question in at least one second training data sample can be input into a second text coding model to obtain a vector representation of a specific length of the positive sample answer of each first question; and inputting the negative sample answers corresponding to each first question in at least one second training data sample into the second text coding model to obtain a vector representation of a specific length of the negative sample answer of each first question. For example, with continued reference to FIG. 6, positive and negative sample answers in each training data sample within the batch may be respectively input into a second text encoding model 620, resulting in a second vector sign. Wherein the second vector representation comprises a vector representation of the positive sample answer and a vector representation of the negative sample answer.
The second text encoding model may encode the input text (e.g., questions) to obtain a vector representation of a particular length of the input text, such as converting natural language into a dense vector representation. Illustratively, the second text encoding model may include a pre-trained language model. As a specific example, the second text encoding model may be an M3E model, which is not limited in this application.
In some embodiments, the first text encoding model and the second text encoding model have the same network structure and share weight parameters. In other words, the first text encoding model and the second text encoding model may be the same model, or the first text encoding model and the second text encoding model may be the same model.
That is, the embodiments of the present application do not distinguish between a text encoding model for encoding a question and a text encoding model for encoding a document, but use the same encoding model to encode text in different formats (e.g., questions and answers). By using the same coding model to code texts in different formats, on one hand, the text coding model can support texts in different forms, such as texts simultaneously supporting two forms of questions and answers, for example, a query can be used for retrieving the query, or a query can be used for retrieving documents. On the other hand, the number of the models and the model parameters can be reduced, so that the storage cost of the models is saved, and the model training efficiency is improved.
343 determining a loss function based on the vector characterization of the at least one first question, the at least one positive sample answer and the at least one negative sample answer.
Specifically, a contrast learning manner may be adopted to determine the loss function according to the vector characterization of at least one second question, at least one positive sample answer and at least one negative sample answer in at least one second training data sample in each batch. For example, with continued reference to fig. 6, the first token vector and the second token vector may be input to an operator 630, outputting a loss function.
In some embodiments, the first contrast loss may be determined from a vector characterization of each first question and a positive sample answer corresponding to each first question, and a vector characterization of each first question and at least one negative sample answer. The loss function may then be determined from the first contrast loss. That is, the loss function includes a first contrast loss.
Specifically, there are K training data samples (query, pos, neg) for each batch. First, for the ith training data sample, toFor positive sample, ++>Negative samples with all negative sample answers neg in the same batch can be obtained according to +.>And positive sample->Distance between ∈0->And negative sampleThe distance between them, a first contrast loss is obtained. As a means of Specific example, first contrast loss->Can be represented by the following formula (2):
wherein,representing a vector representation of the query via a text coding model,/->Representing vector characterization of pos via text coding model,/->Representing neg vector representations obtained via the text encoding model, sim represents calculating similarity scores between the vector representations, ++>Is a super parameter. The training process aims at minimizing the loss function so that the query is sufficiently close to the positive sample pos and sufficiently far from the negative sample neg.
In some embodiments, the second contrast loss may be determined from each positive sample answer and a vector representation of the first question corresponding to each positive sample answer, and each positive sample answer and a vector representation of the at least one negative sample question; the negative sample questions comprise at least one question except the first questions corresponding to each positive sample answer in the first questions.
Specifically, for the ith training sample data, toFor positive sample, ++>The corresponding positive sample is +.>,/>Except +.>All but->Are negative samples (i is not equal to j) and can therefore be based on +.>And positive sample- >Distance between ∈0->And negative sample->The distance between them, a second contrast loss is obtained. As a specific example, second contrast loss +.>Can be represented by the following formula (3):
wherein,representing a vector representation of the query via a text coding model,/->Representing vector characterizations of pos via a text encoding model, sim represents calculating similarity scores between vector characterizations, ++>Is a super parameter. The training process aims at minimizing the loss function so that +.>Keep up with the positive sample->Is sufficiently close to the negative sampleIs far enough apart.
The loss function may then be determined from the first contrast loss and the second contrast loss. That is, the loss function may include a first contrast loss and a second contrast loss. Exemplary, loss functionCan be expressed as the following formula (4):
thus, embodiments of the present application can be implemented to, based on the constructed higher quality negative-sample dataPositive sample, ->Negative samples are all mutually negative with all negative sample answers neg in the same batch, and +.>For positive sample, ++>All problem queries within the same batch are negative samples from each other to determine the contrast loss function.
And 344, adjusting parameters of the first text coding model and the second text coding model according to the loss function to obtain a trained text coding model. The text coding model comprises a first text coding model and a second text coding model.
In particular, the minimization of the loss function may be targeted during the training process, thereby enablingMake ∈>Keep up with the positive sample->Is sufficiently close to the negative sample neg. Alternatively, when +.>At the same time, positive samples can be used to make +.>Keep up with the positive sample->Is sufficiently close to the negative sample neg and at the same time sufficiently far from the negative sample neg>Sample with positive signIs sufficiently close to the negative sample +.>Is far enough to optimize the parameters of the text encoding model.
Illustratively, with continued reference to FIG. 6, parameters of the first text encoding model 610 and the second text encoding model 620 may be updated according to a loss function to obtain a trained text encoding model. Wherein when the first text encoding model 610 and the second text encoding model 620 share weight parameters, the first text encoding model 610 and the second text encoding model 620 are the same model.
Therefore, the embodiment of the application simultaneously uses the training mode of the contrast learningThe method is a positive sample, can further be beneficial to supporting texts in different forms by the text coding model, such as supporting texts in two forms of questions and answers at the same time, and is further beneficial to improving the quality of sentence vector generated by the model.
Fig. 7 is a schematic flow chart of a sentence vector retrieving method 700 according to an embodiment of the present application, where the method 700 may be performed by any electronic device having data processing capabilities, for example, the electronic device may be implemented as a server or a terminal device, for example, the executing device 104 in fig. 2, which is not limited in this application. As shown in fig. 7, method 700 includes steps 710 through 760.
And 710, acquiring the problem text.
For example, when the text coding model is obtained by training a training sample set obtained according to service data in a specific field, the text coding model has knowledge in the specific field and can recover from problems in the specific field. At this time, the question text (query) acquired in step 710 may be the question text of the specific field.
720, inputting the problem text into a text coding model to obtain a first vector representation corresponding to the problem text; the text coding model is obtained according to the training method 200 of the text coding model. In particular, the text coding model may be referred to the related description above, and will not be described here again.
In particular, the text encoding model may encode the input question text, outputting a vector representation of a particular length of the text question, i.e., a first vector representation. Since the text encoding model is trained using a high quality training sample set, the text encoding model can output a more accurate first vector representation of the problem text correspondence.
730, obtaining a corpus, the corpus comprising at least one document.
Illustratively, the corpus is a corpus corresponding to the problematic text in step 710, that is, the documents included in the corpus include documents of the problematic text in step 710. For example, when the question text is a question related to a business in a specific domain, the corpus contains documents related to business data in the specific domain, and based on the corpus, the documents corresponding to the question text can be obtained.
740, inputting the at least one document into the text encoding model to obtain at least one second vector representation corresponding to the at least one document. Specifically, the text coding model is the same as the text coding model in step 720, and is obtained according to the training method 200 of the text coding model. In particular, the text coding model may be referred to the related description above, and will not be described here again.
In particular, the text encoding model may encode an input document, outputting a vector representation of a particular length of the document, i.e., at least one second vector representation. For example, the at least one second vector representation may be stored in a fasss vector library. Because the text encoding model is trained using a high quality training sample set, the text encoding model can output a more accurate representation of the second vector corresponding to the document.
And 750, determining a target document corresponding to the problem text from the at least one document according to the similarity of the first vector representation and the at least one second vector representation.
For example, at least one document associated with the problem text may be retrieved from the at least one document as the target document based on the vector similarity of the first vector representation and the at least one second vector representation. As a specific example, topN documents with highest vector similarity may be determined as target documents related to the question text in step 710.
760, the question text and the target document are entered into a pre-trained large language model, resulting in a reply to the question text.
Illustratively, the original question text in step 710 may be entered into a large language model (Large Language Model, LLM) along with the target answer obtained in step 750, allowing the large language model to generate a more appropriate and accurate answer. Where large language models refer to deep learning models trained using large amounts of text data, natural language text may be generated or meaning of the language text understood. By way of example, the large language model may be a pre-trained model, which is not limited in this application.
Fig. 8 is a schematic flow chart of sentence vector retrieval according to an embodiment of the present application. As shown in fig. 8, a corpus in a particular domain may first be organized, including a plurality of documents 810. The documents 810 in the corpus are then encoded using a trained text encoder 820 and the encoded results are stored in a database 830. By way of example, the database may include a faiss vector library, as not limited in this application. In particular, text encoder 820 may be model trained using method 200 of model training described above. Upon receipt of the question 840 for use, the question 840 is encoded using the text encoder 820 to obtain a vector representation of the question 840. Thereafter, at least one relevant document relevant to the problem 840 is retrieved in the database 830 based on the vector similarity between the vector representation of the problem 840 and the vector representations of the documents stored in the database 830. The retrieved relevant documents are then entered into LLM860 along with questions 840 to provide the necessary external knowledge for the model to guide, allowing LLM860 to generate a more appropriate reply.
As a specific example, in a search scenario in the game field, the Top4 recall accuracy of the open source M3E is 75%, and according to the text coding model trained in the embodiment of the present application, the recall accuracy of the model may be 83%.
Therefore, according to the embodiment of the application, the text encoder is trained based on the high-quality training data sample set, so that the text encoder outputs more accurate text vector representation, at least one document in the input problem text and the corpus is further encoded into corresponding vector representations through the text encoder, the similarity between the input problem text and the vector representations corresponding to the document is facilitated to obtain the document relevant to the input problem text, the problem text and the relevant document are input into a large language model together, necessary external knowledge is provided for the model as a guide, more accurate and proper reply of the problem text is obtained, and the accuracy of sentence vector retrieval is facilitated to be improved.
The specific embodiments of the present application have been described in detail above with reference to the accompanying drawings, but the present application is not limited to the specific details of the above embodiments, and various simple modifications may be made to the technical solutions of the present application within the scope of the technical concept of the present application, and all the simple modifications belong to the protection scope of the present application. For example, the specific features described in the above embodiments may be combined in any suitable manner, and in order to avoid unnecessary repetition, various possible combinations are not described in detail. As another example, any combination of the various embodiments of the present application may be made without departing from the spirit of the present application, which should also be considered as disclosed herein.
It should be further understood that, in the various method embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic of the processes, and should not constitute any limitation on the implementation process of the embodiments of the present application. It is to be understood that the numbers may be interchanged where appropriate such that the described embodiments of the application may be implemented in other sequences than those illustrated or described.
Method embodiments of the present application are described above in detail, and apparatus embodiments of the present application are described below in detail in conjunction with fig. 10-11.
Fig. 9 is a schematic block diagram of a model training apparatus 10 according to an embodiment of the present application. As shown in fig. 9, the model training apparatus 10 may include an acquisition unit 11, a determination unit 12, and a training unit 13.
An acquiring unit 11, configured to acquire a first training data sample, where the first training data sample includes a first question and a positive sample answer corresponding to the first question;
a determining unit 12, configured to determine a negative sample answer from at least one answer corresponding to at least one second question according to the similarity between the first question and the at least one second question and the similarity between the positive sample answer and the at least one answer;
The determining unit 12 is further configured to determine a second training data sample, where the second training data sample includes the first question, the positive sample answer, and the negative sample answer;
and the training unit 13 is used for training the text coding model according to the second training data sample.
In some embodiments, the determining unit 12 is specifically configured to:
determining a first similarity of the first question to each of the at least one second question, and a second similarity of the positive sample answer to the answer corresponding to each of the second questions;
determining a score for each of the second questions based on the first and second similarities;
and determining the negative sample answer in the at least one answer according to the score of each second question.
In some embodiments, the determining unit 12 is specifically configured to:
and determining an answer corresponding to the second question with the highest score as the negative sample answer.
In some embodiments, the determining unit 12 is specifically configured to:
inputting the first problem and the at least one second problem into a language model respectively to obtain vector representation of the first problem and vector representation of each second problem;
Obtaining the first similarity according to the vector representation of the first problem and the vector representation of each second problem;
respectively inputting the positive sample answer and the at least one answer into the language model to obtain a vector representation of the positive sample answer and a vector representation of the at least one answer;
and obtaining the second similarity according to the vector representation of the positive sample answer and the vector representation of the at least one answer.
In some embodiments, the training unit 13 is specifically configured to:
inputting at least one first question in at least one second training data sample into a first text coding model to obtain a vector representation of at least one first question;
inputting at least one positive sample answer and at least one negative sample answer in at least one second training data sample into a second text coding model to obtain vector characterization of at least one positive sample answer and at least one negative sample answer;
determining a loss function based on vector characterizations of at least one of the first question, at least one of the positive sample answers, and at least one of the negative sample answers;
According to the loss function, parameters of the first text coding model and the second text coding model are adjusted, and the trained text coding model is obtained; wherein the text encoding model includes the first text encoding model and the second text encoding model.
In some embodiments, the first text encoding model and the second text encoding model have the same network structure and share weight parameters.
In some embodiments, the training unit 13 is specifically configured to:
determining a first contrast loss based on the vector characterization of each of the first questions and the positive sample answer corresponding to each of the first questions, and the vector characterization of each of the first questions and at least one of the negative sample answers;
and determining the loss function according to the first contrast loss.
In some embodiments, the training unit 13 is specifically configured to:
determining a second contrast loss based on each of the positive sample answers and the vector representation of the first question corresponding to each of the positive sample answers, and each of the positive sample answers and the vector representation of at least one negative sample question; wherein the negative sample questions comprise at least one question of the first questions except for the first question corresponding to each positive sample answer;
Determining the loss function based on the first contrast loss and the second contrast loss.
In some embodiments, the text encoding model includes an M3E model.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the model training apparatus 10 shown in fig. 9 may perform the above-described method embodiment, and the foregoing and other operations and/or functions of each module in the model training apparatus 10 are respectively for implementing the corresponding flow in the above-described method 200, which is not repeated herein for brevity.
Fig. 10 is a schematic block diagram of a sentence vector retrieving apparatus 20 of the embodiment of the present application. As shown in fig. 10, the sentence vector retrieving apparatus 20 may include an acquisition unit 21, a text encoding model 22, a determination unit 23, and a large language model 24.
An acquisition unit 21 for acquiring a question text;
a text encoding model 22, configured to input the question text, and obtain a first vector representation corresponding to the question text; wherein the text encoding model is obtained according to the training method of any one of claims 1-9;
The obtaining unit 21 is further configured to obtain a corpus, where the corpus includes at least one document;
the text encoding model 22 is further configured to input the at least one document, and obtain at least one second vector representation corresponding to the at least one document;
a determining unit 23, configured to determine a target document corresponding to the question text from the at least one document according to the similarity of the first vector representation and the at least one second vector representation;
and the large language model 24 is used for inputting the problem text and the target document and obtaining a reply of the problem text.
It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here. Specifically, the sentence vector retrieving apparatus 20 shown in fig. 10 may perform the above-mentioned method embodiment, and the foregoing and other operations and/or functions of each module in the sentence vector retrieving apparatus 20 are respectively for implementing the corresponding flow in the above-mentioned method 700, and are not repeated herein for brevity.
The apparatus of the embodiments of the present application are described above in terms of functional modules in conjunction with the accompanying drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiments in the embodiments of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in software form, and the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.
Fig. 11 is a schematic block diagram of an electronic device 30 provided in an embodiment of the present application.
As shown in fig. 11, the electronic device 30 may include:
a memory 31 and a processor 32, the memory 31 being for storing a computer program and for transmitting the program code to the processor 32. In other words, the processor 32 may call and run a computer program from the memory 31 to implement the methods in the embodiments of the present application.
For example, the processor 32 may be configured to perform the above-described method embodiments according to instructions in the computer program.
In some embodiments of the present application, the processor 32 may include, but is not limited to:
a general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), field programmable gate array (Field Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
In some embodiments of the present application, the memory 31 includes, but is not limited to:
volatile memory and/or nonvolatile memory. The nonvolatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable EPROM (EEPROM), or a flash Memory. The volatile memory may be random access memory (Random Access Memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (Double Data Rate SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and Direct memory bus RAM (DR RAM).
In some embodiments of the present application, the computer program may be partitioned into one or more modules that are stored in the memory 31 and executed by the processor 32 to perform the methods provided herein. The one or more modules may be a series of computer program instruction segments capable of performing the specified functions, which are used to describe the execution of the computer program in the electronic device.
As shown in fig. 11, the electronic device 30 may further include:
a transceiver 33, the transceiver 33 being connectable to the processor 32 or the memory 31.
The processor 32 may control the transceiver 33 to communicate with other devices, and in particular, may send information or data to other devices or receive information or data sent by other devices. The transceiver 33 may include a transmitter and a receiver. The transceiver 33 may further include antennas, the number of which may be one or more.
It will be appreciated that the various components in the electronic device are connected by a bus system that includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.
The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments. Alternatively, embodiments of the present application also provide a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of the method embodiments described above.
When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces, in whole or in part, a flow or function consistent with embodiments of the present application. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., a floppy disk, a hard disk, a magnetic tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a Solid State Disk (SSD)), or the like.
It will be appreciated that in the specific implementation of the present application, when the above embodiments of the present application are applied to specific products or technologies and relate to data related to user information and the like, user permission or consent needs to be obtained, and the collection, use and processing of the related data needs to comply with the relevant laws and regulations and standards.
Those of ordinary skill in the art will appreciate that the various illustrative modules and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.
The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
The foregoing is merely a specific embodiment of the present application, but the protection scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes or substitutions are covered in the protection scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (14)

1. A method for training a text encoding model, comprising:
acquiring a first training data sample, wherein the first training data sample comprises a first question and a positive sample answer corresponding to the first question;
Determining a negative sample answer in at least one answer according to the similarity of the first question and at least one second question and the similarity of the positive sample answer and at least one answer corresponding to the at least one second question; the similarity between the first question and the at least one second question is larger than a preset value, and the similarity between the positive sample answer and at least one answer corresponding to the at least one second question is smaller than the preset value;
determining a second training data sample comprising the first question, the positive sample answer, and the negative sample answer;
training the text coding model according to the second training data sample;
wherein determining the negative sample answer from the at least one answer according to the similarity between the first question and the at least one second question and the similarity between the positive sample answer and the at least one answer corresponding to the at least one second question comprises:
determining a first similarity of the first question to each of the at least one second question, and a second similarity of the positive sample answer to the answer corresponding to each of the second questions;
Determining a score for each of the second questions based on the first and second similarities;
and determining the negative sample answer in the at least one answer according to the score of each second question.
2. The method of claim 1, wherein said determining the negative sample answer from the at least one answer based on the score of each second question comprises:
and determining an answer corresponding to the second question with the highest score as the negative sample answer.
3. The method of claim 1, wherein the determining a first similarity of the first question to each of the at least one second question and a second similarity of the positive sample answer to the answer corresponding to each of the second questions comprises:
inputting the first problem and the at least one second problem into a language model respectively to obtain vector representation of the first problem and vector representation of each second problem;
obtaining the first similarity according to the vector representation of the first problem and the vector representation of each second problem;
respectively inputting the positive sample answer and the at least one answer into the language model to obtain a vector representation of the positive sample answer and a vector representation of the at least one answer;
And obtaining the second similarity according to the vector representation of the positive sample answer and the vector representation of the at least one answer.
4. The method of claim 1, wherein training the text encoding model based on the second training data samples comprises:
inputting at least one first question in at least one second training data sample into a first text coding model to obtain a vector representation of at least one first question;
inputting at least one positive sample answer and at least one negative sample answer in at least one second training data sample into a second text coding model to obtain vector characterization of at least one positive sample answer and at least one negative sample answer;
determining a loss function based on vector characterizations of at least one of the first question, at least one of the positive sample answers, and at least one of the negative sample answers;
according to the loss function, parameters of the first text coding model and the second text coding model are adjusted, and the trained text coding model is obtained; wherein the text encoding model includes the first text encoding model and the second text encoding model.
5. The method of claim 4, wherein the first text encoding model and the second text encoding model have the same network structure and share weight parameters.
6. The method of claim 4, wherein said determining a loss function based on vector characterizations of at least one of said first question, at least one of said positive sample answers, and at least one of said negative sample answers comprises:
determining a first contrast loss based on the vector characterization of each of the first questions and the positive sample answer corresponding to each of the first questions, and the vector characterization of each of the first questions and at least one of the negative sample answers;
and determining the loss function according to the first contrast loss.
7. The method of claim 6, wherein the method further comprises:
determining a second contrast loss based on each of the positive sample answers and the vector representation of the first question corresponding to each of the positive sample answers, and each of the positive sample answers and the vector representation of at least one negative sample question; wherein the negative sample questions comprise at least one question of the first questions except for the first question corresponding to each positive sample answer;
Wherein said determining said loss function from said first contrast loss comprises:
determining the loss function based on the first contrast loss and the second contrast loss.
8. The method of any of claims 1-7, wherein the text encoding model comprises an M3E model.
9. A sentence vector retrieval method, comprising:
acquiring a question text;
inputting the problem text into a text coding model to obtain a first vector representation corresponding to the problem text; wherein the text encoding model is obtained according to the training method of any one of claims 1-8;
obtaining a corpus, wherein the corpus comprises at least one document;
inputting the at least one document into the text coding model to obtain at least one second vector representation corresponding to the at least one document;
determining a target document corresponding to the problem text from the at least one document according to the similarity of the first vector representation and the at least one second vector representation;
and inputting the problem text and the target document into a pre-trained large language model to obtain a reply of the problem text.
10. A training device for a text encoding model, comprising:
the acquisition unit is provided with a first training data sample, wherein the first training data sample comprises a first question and a positive sample answer corresponding to the first question;
a determining unit, configured to determine a negative sample answer from at least one answer according to a similarity between the first question and at least one second question and a similarity between the positive sample answer and at least one answer corresponding to the at least one second question; the similarity between the first question and the at least one second question is larger than a preset value, and the similarity between the positive sample answer and at least one answer corresponding to the at least one second question is smaller than the preset value;
the determining unit is further configured to determine a second training data sample, where the second training data sample includes the first question, the positive sample answer, and the negative sample answer;
the training unit is used for training the text coding model according to the second training data sample;
the determining unit is specifically configured to:
determining a first similarity of the first question to each of the at least one second question, and a second similarity of the positive sample answer to the answer corresponding to each of the second questions;
Determining a score for each of the second questions based on the first and second similarities;
and determining the negative sample answer in the at least one answer according to the score of each second question.
11. A sentence vector retrieving apparatus, comprising:
an acquisition unit for acquiring a question text;
the text coding model is used for inputting the problem text and obtaining a first vector representation corresponding to the problem text; wherein the text encoding model is obtained according to the training method of any one of claims 1-8;
the obtaining unit is further used for obtaining a corpus, and the corpus comprises at least one document;
the text coding model is also used for inputting the at least one document to obtain at least one second vector representation corresponding to the at least one document;
a determining unit, configured to determine a target document corresponding to the question text from the at least one document according to a similarity between the first vector representation and the at least one second vector representation;
and the large language model is used for inputting the problem text and the target document to obtain a reply of the problem text.
12. An electronic device comprising a processor and a memory, the memory having instructions stored therein that when executed by the processor cause the processor to perform the method of any of claims 1-9.
13. A computer storage medium for storing a computer program, the computer program comprising instructions for performing the method of any one of claims 1-9.
14. A computer program product comprising computer program code which, when run by an electronic device, causes the electronic device to perform the method of any one of claims 1-9.
CN202311457583.6A 2023-11-03 2023-11-03 Text coding model training method, device, equipment and storage medium Active CN117272937B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311457583.6A CN117272937B (en) 2023-11-03 2023-11-03 Text coding model training method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311457583.6A CN117272937B (en) 2023-11-03 2023-11-03 Text coding model training method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN117272937A CN117272937A (en) 2023-12-22
CN117272937B true CN117272937B (en) 2024-02-23

Family

ID=89210679

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311457583.6A Active CN117272937B (en) 2023-11-03 2023-11-03 Text coding model training method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117272937B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460396A (en) * 2017-09-20 2018-08-28 腾讯科技(深圳)有限公司 The negative method of sampling and device
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field
EP3832519A1 (en) * 2019-12-05 2021-06-09 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for evaluating translation quality
CN114547267A (en) * 2022-02-22 2022-05-27 武汉纺织大学 Intelligent question-answering model generation method and device, computing equipment and storage medium
CN115147849A (en) * 2022-06-17 2022-10-04 支付宝(杭州)信息技术有限公司 Training method of character coding model, character matching method and device
WO2023274187A1 (en) * 2021-07-01 2023-01-05 北京有竹居网络技术有限公司 Information processing method and apparatus based on natural language inference, and electronic device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111177328B (en) * 2018-11-12 2023-04-28 阿里巴巴集团控股有限公司 Question-answer matching system and method, question-answer processing device and medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460396A (en) * 2017-09-20 2018-08-28 腾讯科技(深圳)有限公司 The negative method of sampling and device
EP3832519A1 (en) * 2019-12-05 2021-06-09 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for evaluating translation quality
CN111881264A (en) * 2020-09-28 2020-11-03 北京智源人工智能研究院 Method and electronic equipment for searching long text in question-answering task in open field
WO2023274187A1 (en) * 2021-07-01 2023-01-05 北京有竹居网络技术有限公司 Information processing method and apparatus based on natural language inference, and electronic device
CN114547267A (en) * 2022-02-22 2022-05-27 武汉纺织大学 Intelligent question-answering model generation method and device, computing equipment and storage medium
CN115147849A (en) * 2022-06-17 2022-10-04 支付宝(杭州)信息技术有限公司 Training method of character coding model, character matching method and device

Also Published As

Publication number Publication date
CN117272937A (en) 2023-12-22

Similar Documents

Publication Publication Date Title
US11151177B2 (en) Search method and apparatus based on artificial intelligence
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
WO2020228376A1 (en) Text processing method and model training method and apparatus
CN107273503B (en) Method and device for generating parallel text in same language
CN112069302B (en) Training method of conversation intention recognition model, conversation intention recognition method and device
CN114565104A (en) Language model pre-training method, result recommendation method and related device
CN112131883B (en) Language model training method, device, computer equipment and storage medium
WO2020155619A1 (en) Method and apparatus for chatting with machine with sentiment, computer device and storage medium
CN109766418B (en) Method and apparatus for outputting information
JP2022169743A (en) Information extraction method and device, electronic equipment, and storage medium
CN113806487A (en) Semantic search method, device, equipment and storage medium based on neural network
CN116662502A (en) Method, equipment and storage medium for generating financial question-answer text based on retrieval enhancement
CN117093687A (en) Question answering method and device, electronic equipment and storage medium
CN112307738B (en) Method and device for processing text
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN114282528A (en) Keyword extraction method, device, equipment and storage medium
CN116431788B (en) Cross-modal data-oriented semantic retrieval method
CN116975221A (en) Text reading and understanding method, device, equipment and storage medium
CN116701444A (en) Information retrieval method, intelligent question-answering method and risk prevention and control method
CN117272937B (en) Text coding model training method, device, equipment and storage medium
CN116384405A (en) Text processing method, text classification method and emotion recognition method
US11880664B2 (en) Identifying and transforming text difficult to understand by user
CN114925681A (en) Knowledge map question-answer entity linking method, device, equipment and medium
CN115114937A (en) Text acquisition method and device, computer equipment and storage medium
CN113779225B (en) Training method of entity link model, entity link method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant