CN113282711A - Internet of vehicles text matching method and device, electronic equipment and storage medium - Google Patents

Internet of vehicles text matching method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113282711A
CN113282711A CN202110622070.0A CN202110622070A CN113282711A CN 113282711 A CN113282711 A CN 113282711A CN 202110622070 A CN202110622070 A CN 202110622070A CN 113282711 A CN113282711 A CN 113282711A
Authority
CN
China
Prior art keywords
text
model
vector
matched
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110622070.0A
Other languages
Chinese (zh)
Other versions
CN113282711B (en
Inventor
邹博松
王卉捷
宋娟
郭盈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Software Evaluation Center
Original Assignee
China Software Evaluation Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Software Evaluation Center filed Critical China Software Evaluation Center
Priority to CN202110622070.0A priority Critical patent/CN113282711B/en
Publication of CN113282711A publication Critical patent/CN113282711A/en
Application granted granted Critical
Publication of CN113282711B publication Critical patent/CN113282711B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a method and a device for matching texts in the Internet of vehicles, electronic equipment and a storage medium, which are used for solving the problem that a better text matching effect is difficult to obtain from the aspect of semantic representation. The method comprises the following steps: acquiring a text to be matched, and extracting abstract contents of the text to be matched and dependency syntax core components of the text to be matched; performing word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors; performing fusion processing on the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector; and matching and sequencing the plurality of search texts according to the input expression vector by using a text matching model to obtain a plurality of sequenced search texts.

Description

Internet of vehicles text matching method and device, electronic equipment and storage medium
Technical Field
The application relates to the technical field of Internet of vehicles and natural language processing, in particular to a method and a device for matching Internet of vehicles with texts, electronic equipment and a storage medium.
Background
The Internet of Vehicles (IoV) means that vehicle-mounted devices on a vehicle effectively utilize all vehicle dynamic information in an information network platform through a wireless communication technology, and provide different functional services during vehicle operation. IoV, which is also referred to as V2X for short (vehicular networking), it means that devices onboard a vehicle can exchange information through different types of communication, including: car-to-infrastructure (V2I), car-to-Network (V2N), car-to-car (V2V), car-to-Pedestrian (V2P), or car-to-device (V2D).
Text matching (Text Match), which is an important problem in Natural Language Processing (NLP), many tasks in NLP can be abstracted as Text matching problem; specific examples thereof include: the process of searching web pages in a search engine can be abstracted into the problem that the search engine matches a correlation web page set corresponding to a user search Text (Query Text); similarly, the automatic question-answering task can be abstracted into a question matched with the satisfaction degree of the candidate answer and the question, and the text duplication eliminating task can be abstracted into a question matched with the similarity degree of the query text and the text to be duplicated.
In the current internet of vehicles industry, information search and information exchange between automobiles and the internet of vehicles V2X are mostly realized based on text matching, and most of the current internet of vehicles text matching methods are realized based on machine learning methods; machine learning based methods such as: after the keywords are input by the user, in a scenario that the search engine performs relevance ranking on the webpage files according to the keywords, the search engine may use a Term Frequency-Inverse Document Frequency (TF-IDF) algorithm or a Vector Space Model (VSM) algorithm to rank the search results. In a specific practical process, it is found that text matching is performed by using a machine learning-based method only by using factors such as word frequency, inverse document frequency and document length, and such a text matching method has many problems, for example: the language in the automatic question-and-answer task is a combined structure question (for example, a candidate answer of 'high-speed railway from Shanghai to Beijing' is given for a question of 'high-speed railway from Beijing to Shanghai') and a question of language ambiguity and synonymity, and the like. That is, the search results are ranked by using the vocabulary level similarity factor using the machine learning-based method, and it is difficult to achieve a good text matching effect from the aspect of semantic representation.
Disclosure of Invention
An object of the embodiment of the application is to provide a method and a device for matching texts in a vehicle networking system, an electronic device and a storage medium, which are used for solving the problem that a better text matching effect is difficult to obtain from the aspect of semantic representation.
The embodiment of the application provides a method for matching texts in the Internet of vehicles, which comprises the following steps: acquiring a text to be matched, and extracting abstract contents of the text to be matched and dependency syntax core components of the text to be matched; performing word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors; performing fusion processing on the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector; and matching and sequencing the plurality of search texts according to the input expression vector by using a text matching model to obtain a plurality of sequenced search texts, wherein the text matching model is obtained by multi-task combined training. In the implementation process, the input expression vector is obtained by performing word segmentation, vectorization and fusion processing on the abstract content, the dependency syntax core components and the text to be matched, so that the input expression vector can better express the text content from the aspect of semantic expression, and sentence component vectors used for distinguishing the abstract content and the dependency syntax core components are added in the input expression vector, so that a text matching model can more easily distinguish the core abstract content and the dependency syntax core components of the text, and performs matching sequencing according to the core abstract content and the dependency syntax core components capable of better expressing the semantics, thereby effectively improving the text matching effect from the aspect of semantic expression.
Optionally, in this embodiment of the present application, extracting abstract content of a text to be matched and a dependency syntax core component of the text to be matched includes: using a pre-trained generative pre-training model as a summary extraction model to extract a summary of the text to be matched to obtain the summary content of the text to be matched; and extracting a main predicate relation component, a moving guest relation component, an inter-guest relation component, a structure-in-shape component and/or a core relation component in the text to be matched by using a dependency analysis tool, and determining the main predicate relation component, the moving guest relation component, the inter-guest relation component, the structure-in-shape component and/or the core relation component as dependency syntax core components of the text to be matched. In the implementation process, the abstract content and the dependency syntax core component of the text to be matched are extracted, so that the text matching model can distinguish the core component of the text more easily, matching sequencing is carried out according to the abstract content and the core component which can better represent the semantics, and the text matching effect is effectively improved in the aspect of semantic representation.
Optionally, in this embodiment of the present application, before performing summarization on a to-be-matched text by using a pre-trained generative pre-trained model as a summarization model, the method further includes: acquiring a text data set and an abstract data set, wherein abstract texts in the abstract data set are obtained by abstracting sample texts in the text data set; and training the generative pre-training network by using the text data set and the abstract data set to obtain a generative pre-training model. In the implementation process, the generative pre-training model is trained independently by using the text data set and the abstract data set instead of directly adopting the existing generative pre-training model on the Internet, so that the problem that the existing generative pre-training model is not applicable is avoided, the generalization capability of the generative pre-training model trained independently is better, and the accuracy is higher.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; matching and ordering the plurality of search texts according to the input representation vector by using a text matching model, comprising: extracting a feature vector of an input representation vector by using a feature extraction model; and performing matching sorting on text vectors corresponding to the plurality of search texts according to the feature vectors by using a deep network model. In the implementation process, the feature vectors of the input expression vectors are extracted by using the feature extraction model, and the text vectors corresponding to the retrieval texts are matched and sorted according to the feature vectors by using the deep network model, so that the feature vectors capable of being expressed in a semantic meaning aspect are used for matching and sorting, and the text matching effect is effectively improved in the semantic meaning aspect.
Optionally, in this embodiment of the present application, before performing matching sorting on a plurality of search texts according to an input representation vector by using a text matching model, the method further includes: acquiring a text data set, an abstract data set and a dependency data set; and performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model. In the implementation process, the feature extraction model is subjected to multi-task joint training by using the abstract data set and the dependency data set, so that the obtained text matching model can further improve the capture capability of the core abstract content and the dependency syntactic core component in the text, and the accuracy of text matching of the text matching model is improved.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text and a plurality of negative sample texts; training a deep web model using a text dataset, comprising: predicting a prediction matching text corresponding to the query content sample by using a depth network model in the text matching model; calculating PairWise loss values between the predicted matching text and the positive sample text and between the predicted matching text and the negative sample text; calculating a Listwise loss value among the query content sample, the positive sample text and the plurality of negative sample texts; and training a deep network model in the text matching model according to the PairWise loss value and the ListWise loss value. In the implementation process, by combining ideas of PairWise and ListWise and training the deep network model in the text matching model according to the PairWise loss value and the ListWise loss value, the situation that the model is trained only through one of a PointWise loss value, a PairWise loss value or a ListWise loss value is avoided, so that the model has the advantages of PairWise, namely, the loss values of similar sentences and dissimilar sentences have larger distance (margin), and the model also has the advantages of ListWise, namely, the model comprehensively considers a plurality of texts, and the generalization capability of the text matching model is improved.
Optionally, in this embodiment of the present application, the feature extraction model uses a Roberta model or a BERT model. By adopting the Roberta model as the feature extraction model to extract the feature vector of the input expression vector, the accuracy of extracting the feature vector can be further improved compared with other models.
In the training process of the text matching model in the Internet of vehicles text matching method, the input expression vector capable of representing semantic expression is fed to the pre-training language models such as Roberta or BERT, and further, the text matching model is subjected to combined training by using the text matching task, the part-of-speech prediction task and the dependency relationship task, so that the text matching model has better semantic expression capability, and then the text matching or text retrieval is performed by using the jointly trained text matching model, and the text matching or text retrieval effect can be effectively improved in the aspect of semantic expression.
The embodiment of the application further provides a device is matchd to car networking text, includes: the text content extraction module is used for acquiring a text to be matched and extracting abstract content of the text to be matched and a dependency syntax core component of the text to be matched; the vector matrix obtaining module is used for carrying out word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, and the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors; the expression vector obtaining module is used for carrying out fusion processing on the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector; and the text matching and sequencing module is used for performing matching and sequencing on the plurality of search texts according to the input expression vector by using a text matching model to obtain a plurality of sequenced search texts, wherein the text matching model is obtained by multi-task combined training.
Optionally, in an embodiment of the present application, the text content extracting module includes: the abstract content extraction module is used for extracting an abstract of the text to be matched by using a pre-trained generative pre-training model as an abstract extraction model to obtain the abstract content of the text to be matched; and the dependency relationship analysis module is used for extracting the major-minor relationship component, the moving-guest relationship component, the inter-guest relationship component, the structure-in-shape component and/or the core relationship component in the text to be matched by using a dependency analysis tool, and determining the major-minor relationship component, the moving-guest relationship component, the inter-guest relationship component, the structure-in-shape component and/or the core relationship component as the dependency syntax core component of the text to be matched.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; a text matching and ranking module comprising: the characteristic vector extraction module is used for extracting a characteristic vector of the input representation vector by using a characteristic extraction model; and the vector matching and sorting module is used for performing matching and sorting on the text vectors corresponding to the plurality of search texts according to the feature vectors by using the deep network model.
Optionally, in this embodiment of the application, the internet of vehicles text matching apparatus further includes: the training data acquisition module is used for acquiring a text data set, an abstract data set and a dependency data set; and the matching model obtaining module is used for performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text and a plurality of negative sample texts; a matching model obtaining module comprising: the matching text prediction module is used for predicting a prediction matching text corresponding to the query content sample by using a depth network model in the text matching model; the first loss calculation module is used for calculating PairWise loss values among the prediction matching texts, the positive sample texts and the negative sample texts; the second loss calculation module is used for calculating a Listwise loss value among the query content sample, the positive sample text and the negative sample texts; and the network model training module is used for training the deep network model in the text matching model according to the PairWise loss value and the ListWise loss value.
Optionally, in this embodiment of the present application, the feature extraction model adopts a Roberta model or a BERT model.
An embodiment of the present application further provides an electronic device, including: a processor and a memory, the memory storing processor-executable machine-readable instructions, the machine-readable instructions when executed by the processor performing the method as described above.
Embodiments of the present application also provide a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the method as described above.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flow chart of a text matching method in a car networking system according to an embodiment of the present application;
FIG. 2 is a schematic structural diagram of a text matching model provided in an embodiment of the present application;
FIG. 3 is a schematic flow chart of a training text matching model provided in an embodiment of the present application;
fig. 4 is a schematic structural diagram of a text matching device in a car networking according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
Detailed Description
The technical solution in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
Before introducing the text matching method for the internet of vehicles provided by the embodiment of the present application, some concepts related in the embodiment of the present application are introduced:
natural Language Processing (NLP) refers to the study of problems associated with Natural Language cognition as understanding (understating) Natural Language requires extensive knowledge about the world and the ability to manipulate that knowledge.
Dependency Parsing (Dependency Parsing) refers to revealing a syntactic structure by analyzing Dependency relationships among components in a linguistic unit, and proposing that a core verb in a sentence is a central component that governs other components. Dependency is a binary asymmetric relationship between a core word and its dependencies, the core word of a sentence is usually a verb, and all other words are either dependent on the core word or associated with it through a dependent path. The dependency grammar is used for analyzing sentences into a dependency syntax tree, describing the dependency relationship among all the words and indicating the syntactic collocation relationship among the words, and the collocation relationship is associated with semantics.
Vectorization (Vectorization) processing may refer to using vector representation for the plurality of regular character sequences, that is, converting the character sequences into a vector form; of course, in a specific implementation process, the character sequence may be vectorized, or a plurality of regular character sequence may be participled (Tokenization) to obtain a plurality of words, if a plurality of words are represented by vectors, a plurality of Word vectors (Word Vector) are obtained, and if a Sentence is represented by vectors, a plurality of Sentence vectors (sequence Vector) are obtained.
Joint training, also known as Joint Learning (Joint Learning), refers to Joint training of multiple neural networks in a model using a multi-task Learning framework, specifically, for example: and respectively using a multi-task learning framework to train the neural network models such as the feature extraction model and the deep network model in sequence or simultaneously.
KL divergence (KLD), which is called relative entropy in information systems, randomness in continuous time sequences, information gain in statistical model inference, and information divergence in information systems; the KL divergence is a measure of the asymmetry of the difference of the two probability distributions P and Q; the KL divergence is a measure of the number of additional average bits required to encode samples of the P-compliant distribution using the Q-based distribution; typically, P represents the true distribution of the data, and Q represents the theoretical distribution of the data, an estimated model distribution, or an approximate distribution of P.
It should be noted that the text matching method for the internet of vehicles provided by the embodiment of the present application may be executed by an electronic device, where the electronic device refers to a device terminal or a server having a function of executing a computer program, and the device terminal includes, for example: smart phones, Personal Computers (PCs), tablet computers, or Mobile Internet Devices (MIDs), etc. The server is, for example: x86 server and non-x 86 server, non-x 86 server includes: mainframe, minicomputer, and UNIX server.
Before introducing the text matching method for the internet of vehicles provided by the embodiment of the present application, an application scenario applicable to the text matching method for the internet of vehicles is introduced, where the application scenario includes but is not limited to: the System comprises the fields of car networking information retrieval, collaborative filtering, commodity recommendation, advertisement push, a car networking Question and answer System (QA System) and the like; the application scenarios of the car networking information retrieval are specifically as follows: the user provides the text content to the search engine, and then the search engine matches a plurality of related webpage sets according to the relevance of the text content, or searches files semantically matched with the target file content in all files in the local file system, and the like. The application scenarios of the car networking question-answering system are as follows: the user asks 'how to go from Beijing to New York', and the Internet of vehicles question-answering system provides candidate answers which best meet the requirements of the user.
Please refer to a schematic flow chart of a text matching method in the internet of vehicles according to the embodiment of the present application shown in fig. 1; the Internet of vehicles text matching method has the main thought that the text content can be better represented from the aspect of semantic representation through the input representation vector obtained by performing word segmentation, vectorization and fusion processing on abstract content, dependency syntax core components and a text to be matched, the abstract content and the dependency syntax core components are extracted, a text matching model can more easily distinguish the core components of the text, matching and sequencing are performed according to the abstract content and the core components which can better represent the semantics, and the text matching effect is effectively improved from the aspect of semantic representation; the internet of vehicles text matching method can comprise the following steps:
step S110: and acquiring a text to be matched, and extracting abstract content of the text to be matched and a dependency syntax core component of the text to be matched.
The dependency syntax core component refers to a core part obtained by performing dependency relationship analysis on a text to be matched, that is, the dependency syntax core component can represent semantic core component content in the text to be matched, and the dependency syntax core component specifically includes: the main core part of the statement with the relations of a major-minor relation, a moving-guest relation, a direct indirect object relation and the like. The text to be matched refers to text content to be matched, for example: the contents of questions posed in the automatic question-answering system, etc.
The above embodiment of obtaining the text to be matched, the digest content, and the dependent syntax core component in step S110 may include:
step S111: and acquiring a text to be matched.
There are many ways to acquire the text to be matched in step S111, including but not limited to: the first acquisition mode is that texts to be matched sent by other terminal equipment are received, and the texts to be matched are stored in a file system, a database or mobile storage equipment; the second obtaining method obtains a pre-stored text to be matched, specifically for example: acquiring a text to be matched from a file system, or acquiring the text to be matched from a database, or acquiring the text to be matched from a mobile storage device; and in the third acquisition mode, the text to be matched on the Internet is acquired by using software such as a browser and the like, or other application programs are used for accessing the Internet to acquire the text to be matched.
Step S112: and using the pre-trained generative pre-training model as a summary extraction model to extract the summary of the text to be matched, so as to obtain the summary content of the text to be matched.
A Generative Pre-Training (GPT) model, also referred to as GPT model or GPT-2 model for short, is a large-scale language model based on a transformer (transformer) published by OpenAI; the model adopts a training mode of Pre-training (Pre-training) and Fine-tuning (Fine-tuning), and can be used for tasks such as classification, reasoning, question answering, similarity and the like.
It can be understood that before the generative pre-training model GPT model or GPT-2 is used, the generative pre-training model needs to be trained separately, and one generative pre-training model is trained separately by using the text data set and the abstract data set instead of directly using the existing generative pre-training model on the internet, thereby avoiding the problem that the existing generative pre-training model is not suitable. The above-mentioned individual training process of the generative pre-training model includes, for example: acquiring a text data set and an abstract data set, wherein abstract texts in the abstract data set are obtained by abstracting abstract sample texts in the text data set, the acquired text data set and the abstract data set can be artificial labeling data sets for artificially collecting and abstracting abstract, and under the condition that the number of the artificial labeling data sets is large enough, a neural network model trained by the artificial labeling data sets can be used for replacing the artificial extraction of abstract data, and labeling is carried out to obtain a machine labeling data set; and training the generative pre-training network by using a text data set and a summary data set in a Supervised Learning (Supervised Learning) mode to obtain a generative pre-training model.
The embodiment of step S112 described above is, for example: and (3) using a pre-trained generative pre-training model (namely the GPT model or the GPT-2 model) as a summary extraction network model or a summary generation network model to extract a summary of the text to be matched, so as to obtain the summary content of the text to be matched. Of course, in a specific implementation process, other generative pre-training models may be used as the abstract extraction network model, including: MatchSum model and BertSum model, etc.; the summary may also be used to generate a network model, including: mask Sequence to Sequence (MASS) Model and Unified Language Model (UNILM), etc.
Step S113: and performing dependency relationship analysis on the text to be matched by using a dependency analysis tool to obtain a dependency syntax core component of the text to be matched.
The embodiment of step S113 described above is, for example: performing dependency relationship analysis on a text to be matched by using a dependency analysis tool, thereby extracting a main predicate relationship (namely, Subject-Verb, shown as SBV) component, a moving predicate relationship (namely, Verb-Object, shown as VOB) component, an intermediate predicate relationship (namely, index-Object, shown as IOB) component, an intermediate structure (namely, ADVerbinary, shown as ADV) component and/or a core relationship (namely, head, shown as HED) component from the text to be matched, and determining the main predicate relationship component, the moving predicate relationship component, the intermediate relation component, the syntax intermediate structure component and/or the core relationship component as a dependency core component of the text to be matched; dependency analysis tools that may be used herein include, but are not limited to: the Pytrch-based LTP4 tool, the HanLP tool, and the DDParser tool, among others.
After step S110, step S120 is performed: and performing word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors.
Since there are many permutation combinations of the sentence component vector, the token embedding vector, the position embedding vector and the reverse order position embedding vector, there are many embodiments of the step S120, and in a specific practical process, a partial vector may be selected according to specific situations to perform a fusion process, for example, only a sentence component vector, a token embedding vector and a position embedding vector are selected. Of course, all vectors may be selected for the fusion process, and this embodiment includes:
step S121: and performing word segmentation and vectorization processing on the abstract content, the dependency syntax core component and the text to be matched to obtain a token embedding vector corresponding to the abstract content, a token embedding vector corresponding to the dependency syntax core component and a token embedding vector corresponding to the text to be matched.
Token Embedding (Token Embedding), similar to Word Embedding (wordleddings), refers to mapping a multidimensional space from each Word part (Word Piece), complete Word or other special character to a continuous Vector space, and a Token Embedding Vector (Token Embedding Vector) refers to a mapped Vector.
The embodiment of step S121 described above is, for example: segmenting the abstract content, the dependency syntax core component and the text to be matched (Tokenization) by using a segmentation method based on grammar and rules, a mechanical segmentation method (namely based on a dictionary) or a statistical method to obtain a plurality of words; wherein, the mechanical word segmentation method is as follows: a forward maximum matching method, a reverse maximum matching method and a least segmentation method based on a dictionary; statistical-based methods such as: hidden Markov Model (HMM) method, N-gram (N-gram) method, conditional random field method, and the like. Then, a pre-training Language model (PLMs) is used to perform vectorization processing on the characters in each of the plurality of words, so as to obtain a token embedding vector corresponding to the digest content, a token embedding vector corresponding to the dependency syntax core component, and a token embedding vector corresponding to the text to be matched. The PLMs are also referred to as pre-training models for short, and refer to neural network models obtained by using a large amount of text corpora as training data and performing semi-supervised machine learning on a neural network by using the training data, wherein the pre-training models contain text structure relationships in language models.
Step S122: and performing word segmentation and vectorization processing on the abstract content, the dependency syntax core component and the text to be matched to obtain a sentence component vector corresponding to the abstract content, a sentence component vector corresponding to the dependency syntax core component and a sentence component vector corresponding to the text to be matched.
The sentence component vector refers to a vector for indicating that the Token (Token) belongs to one of the digest content, the dependency syntax core component, or the text to be matched, and specifically includes: assume that 1 represents that the token belongs to the digest content, 2 represents that the token belongs to the dependency syntax core component, and 3 represents that the token belongs to the text to be matched.
Specifically, for example, in step S122: after the digest content, the dependency syntax core component and the text to be matched are subjected to word segmentation (Token), the Token quantities (Token) of the digest content, the dependency syntax core component and the text to be matched can be counted and respectively obtained, and sentence component vectors corresponding to the digest content, the dependency syntax core component and the text to be matched can be obtained by using sentence component vectorization according to the Token quantities of the digest content, the dependency syntax core component and the text to be matched.
Step S123: and vectorizing the abstract content, the dependency syntax core component and each word position in the text to be matched to obtain a position embedded vector corresponding to the abstract content book, a position embedded vector corresponding to the dependency syntax core component book and a position embedded vector corresponding to the text to be matched.
Position embedding (position embedding), similar to the above token embedding, is different in that the position embedding vector is to vectorize the position of the token, not to vectorize the token itself. The position-embedded vector is a vector obtained by vectorizing the position of the token.
The embodiment of step S123 described above is, for example: and performing vectorization processing on the abstract content, the dependency syntax core component and each word position in the text to be matched by using a GloVe model, a word2vec model, a FastText model and the like to obtain a position embedded vector corresponding to the abstract content book, a position embedded vector corresponding to the dependency syntax core component book and a position embedded vector corresponding to the text to be matched.
It will be appreciated that the position embedding vector described above has two forms: the first form is to combine the abstract content, the dependency syntax core component and the text to be matched together to calculate the position, that is, each Word Embedding (Word Embedding) has only one unique position Embedding vector in the abstract content, the dependency syntax core component and the text to be matched to represent the position of the Word Embedding; specific examples thereof include: assuming that the digest content is "[ CLS ] I like cat [ sep ]", the dependency syntax core component is "[ CLS ] I like cat [ sep ]", and the text to be matched is "[ CLS ] I just like cat [ sep ]", there is "I" in both the digest content and the text to be matched, but their positions are not the same, the position index of "I" in the digest content is 1, and the position index of "I" in the text to be matched is 8, together, the position embedding vector is statistically obtained, which can effectively represent the positions of the same word in different sentence components (digest content, dependency syntax core component, or text to be matched). The second form is to count positions of the digest content, the dependency syntax core component and the text to be matched respectively, that is, there may be a case where position indexes or position embedding vectors are the same in the digest content, the dependency syntax core component and the text to be matched, specifically for example: assuming that the digest content is "[ CLS ] I like cat [ sep ]", the dependent syntax core component is "[ CLS ] like cat [ sep ]", and the text to be matched is "[ CLS ] I just like cat [ sep ]", there are "I" in both the digest content and the text to be matched, and the relative index position of "I" in the digest content and the text to be matched is 1, and the relative index position of "[ CLS ] in the digest content, the dependent syntax core component, and the text to be matched is 0.
Step S124: and vectorizing the abstract content, the dependency syntax core components and the reverse order position of each word in the text to be matched to obtain a reverse order position embedded vector.
Reverse-Position Embedding (Reverse-Position Embedding), similar to the above Position Embedding, is different in that the Position Embedding is to vectorize the positive-order Position of the token, and the Reverse-order Position Embedding is to vectorize the Reverse-order Position of the token. The reverse position embedded vector is a vector obtained by vectorizing the reverse position of the token. Similarly, the reverse-order position embedding vector has two forms: the first form is to combine the abstract content, the dependency syntax core component and the text to be matched together to calculate the position; the second form is to count positions of the abstract content, the dependency syntax core component and the text to be matched respectively, and specifically, the description of the two forms of the position embedding vector can be referred to.
The embodiment of step S124 described above is, for example: and vectorizing the abstract content, the dependency syntax core component and the reverse order position of each word in the text to be matched by using a GloVe model, a word2vec model, a FastText model and the like to obtain a reverse order position embedded vector corresponding to the abstract content, a reverse order position embedded vector corresponding to the dependency syntax core component and a reverse order position embedded vector corresponding to the text to be matched.
After step S120, step S130 is performed: and respectively carrying out fusion processing on the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector.
The embodiment of the step S130 includes: in a first embodiment, the embedded vectors are summed (sum), for example: assuming that The dimensions of The vectors are all 2 dimensions, and The sentence component vector is [0.11, 0.07], The token embedding vector is [0.02, 0.13], The position embedding vector is [0.04, 0.03], and The reverse position embedding vector is [0.07, 0.04], The input representation vector (The input representation) obtained by adding up (sum) is [0.24, 0.27 ]; in a second embodiment, the embedded vectors are subjected to concatenation (concat) fusion, for example: assuming that the dimensions of the vectors are all 2 dimensions, and the sentence component vector is [0.11, 0.07], the token embedding vector is [0.02, 0.13], the position embedding vector is [0.04, 0.03], and the reverse order position embedding vector is [0.07, 0.04], the input representation vector obtained by concatenation (concat) fusion is [0.11, 0.07, 0.02, 0.13, 0.04, 0.03, 0.07, 0.04 ].
In the implementation process, the token embedding vector, the position embedding vector and the reverse order position embedding vector are fused in the input expression vector, so that the input expression vector can better express the text content from the aspect of semantic expression, and the text matching effect is effectively improved from the aspect of semantic expression.
After step S130, step S140 is performed: and matching and sequencing the plurality of search texts according to the input expression vector by using a text matching model to obtain a plurality of sequenced search texts, wherein the text matching model is obtained by multi-task combined training.
Please refer to fig. 2, which illustrates a schematic structural diagram of a text matching model provided in the embodiment of the present application; a Text Matching Model (Text Matching Model) which is a neural network Model for Text Matching; the text matching model comprises: a feature extraction model and a depth network model; the depth network model includes a plurality of morphers (transformers), shown in the figure as 12 morphers (transformers), each of which includes a Self-Attention (Self-Attention) layer, a first regularization layer, a fully connected layer, and a second regularization layer, which are connected in sequence. Due to the fact that a plurality of Self-Attention (Self-Attention) layers are arranged in the deep network model, the text matching model comprising the deep network model can better notice core components of PairWise data (namely positive sample text and negative sample text) and ListWise data (namely query content sample and a plurality of sample texts), a matching data Pair (Pair) of the positive sample text and the negative sample text is expanded into a matching sequence (List) of the query content sample and the plurality of sample texts, and the generalization capability of the text matching model is greatly enhanced.
The implementation of step S140 may include:
step S141: feature vectors representing vectors of the input representation are extracted using a feature extraction model.
The embodiment of step S131 described above is, for example: feature vectors of the input representation vector are extracted using an advanced Roberta model or a Bidirectional Encoder Representation (BERT) model as a feature extraction model.
Step S142: and matching and sequencing text vectors corresponding to the plurality of search texts according to the feature vectors by using a deep network model to obtain a plurality of sequenced search texts.
The embodiment of step S142 described above is, for example: sequentially using 12 deformers (transformers) to perform matching sorting on text vectors corresponding to the plurality of search texts to obtain a plurality of sorted search texts; the text matching model is obtained through multi-task joint training, and each deformer comprises a self-attention layer, a first regularization layer, a full-connection layer and a second regularization layer which are connected in sequence.
In the implementation process, firstly, abstract content of a text to be matched and a dependency syntax core component of the text to be matched are extracted, then, the abstract content, the dependency syntax core component and the text to be matched are subjected to word segmentation, vectorization and fusion processing to obtain an input expression vector fused with sentence component vectors, and finally, a text matching model is used for carrying out matching sequencing on a plurality of retrieval texts according to the input expression vector. That is to say, the input expression vector obtained by performing word segmentation, vectorization and fusion processing on the abstract content, the dependency syntax core component and the text to be matched enables the input expression vector to better express the text content from the semantic expression aspect of the sentence component, and extracts the abstract content and the dependency syntax core component, so that the abstract content and the core component of the text can be more easily distinguished by a text matching model, matching sequencing is performed according to the abstract content and the core component capable of better expressing the semantics, and the text matching effect is effectively improved from the semantic expression aspect.
Please refer to fig. 3, which is a schematic flowchart illustrating a process of training a text matching model according to an embodiment of the present application; it is understood that before the text matching model of the above step S140 is used, the text matching model also needs to be trained, and the training of the text matching model may include:
step S210: a text data set, a summary data set, and a dependency data set are obtained.
The implementation of step S210 may include:
step S211: the method comprises the steps of obtaining a pre-trained abstract extraction network model, a dependency analysis tool and a text data set, wherein the text data set comprises a plurality of sample texts and part-of-speech tag values corresponding to the sample texts.
The abstract extraction network model, the dependency analysis tool, and the text data set in step S211 are obtained, for example: the first acquisition mode is that the abstract extraction network model, the dependency analysis tool and/or the text data set sent by other terminal equipment are received, and the abstract extraction network model, the dependency analysis tool and/or the text data set are stored in a file system, a database or mobile storage equipment; the second obtaining method obtains a pre-stored abstract extraction network model, a dependency analysis tool and/or a text data set, specifically for example: acquiring the data set from a file system, a database and/or a mobile storage device; in the third acquisition mode, software such as a browser or other application programs are used for accessing the Internet to acquire the abstract extraction network model, the dependency analysis tool and/or the text data set.
Step S212: the method comprises the steps of using a pre-trained abstract extraction network model or an abstract generation network model to extract an abstract of a sample text to obtain an abstract text of the sample text, conducting part-of-speech prediction on the abstract text of the sample text to obtain a part-of-speech tag value of the abstract text, and then adding the abstract text of the sample text and the part-of-speech tag value of the abstract text into an abstract data set.
The implementation principle and implementation manner of step S212 are similar to those of step S112, and therefore, the implementation principle and implementation manner will not be described here, and if it is not clear, reference may be made to the description of step S112.
Step S213: a dependency analysis tool is used to extract the dependency syntax core components of the sample text and add the dependency syntax core components of the sample text as dependency tag values to the dependency data set.
The implementation principle and implementation manner of step S213 are similar to those of step S113, and therefore, the implementation principle and implementation manner will not be described here, and if it is not clear, reference may be made to the description of step S113.
After step S210, step S220 is performed: and performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Wherein the text data set includes: querying a content sample, a positive sample text and a plurality of negative sample texts; the query content sample refers to a text content sample that needs to be matched, and specifically includes: and automatically asking and answering questions in the task, or text to be searched submitted by a user in the search engine task, and the like. The positive sample text refers to a matching content text with similarity, correlation or satisfaction exceeding a preset threshold, and specifically includes: the most appropriate answer content text of the question in the automatic question and answer task, or the matching of the content text which the user most wants to search in the search engine task, etc. Similarly, the negative sample text refers to a matching content text of which the similarity, the correlation or the satisfaction does not exceed a preset threshold; specific examples thereof include: the presence of user-unwanted content text in search engine tasks, etc.
The above-mentioned implementation manner of performing the multi-task joint training on the feature extraction model by using the summary data set and the dependency data set in step S220 is, for example: performing multi-task Joint Training (Joint Training) on the feature extraction model according to the abstract data set and the dependency data set by using a multi-task learning frame; wherein multitasking includes, but is not limited to: a plurality of tasks such as part-of-speech prediction and dependency analysis corresponding to each Token (Token), and a multitask learning framework that can be used includes but is not limited to: a Multi-gate Mixture-of-Experts (MMoE) framework, and the like.
In a specific practice process, a loss function can be used for each task in the multi-task joint training to calculate a loss value of each task; specific examples thereof include: calculating a first loss value between the part-of-speech predicted value and the part-of-speech tag value corresponding to each Token (Token) by using a multi-classification cross entropy loss function, wherein the first loss value can be represented by L1; and/or, calculating a second loss value between the dependency prediction value and the dependency tag value in the dependency analysis task using a multi-class cross entropy loss function, which may be represented using L2.
The above-mentioned embodiment of training the deep web model by using the text data set in step S220 may include:
step S221: and predicting the predicted matching text corresponding to the query content sample by using a deep network model in the text matching model.
The embodiment of step S221 described above is, for example: performing word segmentation and vectorization on abstract contents in the abstract data set, dependency syntax core components in the dependency data set and sample texts in the text data set to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors; respectively fusing the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector; and then, extracting a feature vector of the input expression vector by using a feature extraction model in a text matching model, finally, performing matching sorting on text vectors corresponding to a plurality of search texts by using a depth network model according to the feature vector to obtain a plurality of sorted search texts, and determining the plurality of sorted search texts as prediction matching texts.
It is understood that, when calculating different loss values, the above-described embodiment of determining the sorted retrieval texts as the predicted matching texts is different, and specifically, for example: in calculating the PairWise loss value, predicting matching text comprises: positive sample text and negative sample text, then the retrieved text found most similar (or matching) to the predicted matching text from the sorted plurality of retrieved texts needs to be regarded as the positive sample text, and the retrieved text found least similar (or least matching) to the predicted matching text from the sorted plurality of retrieved texts needs to be regarded as the negative sample text. In calculating the ListWise loss value, predicting the matching text comprises: a positive sample text and a plurality of negative sample texts, then it is necessary to take the search text found from the sorted search texts that is most similar (or matches) to the predicted matching text as the positive sample text, and take the search texts other than the positive sample text as the plurality of negative sample texts.
Step S222: and calculating PairWise loss values between the predicted matching text and the positive sample text and the negative sample text.
The embodiment of step S222 described above includes, for example: predicting the corresponding prediction matching text of the query content sample by using a deep network model in the text matching model, and using a fourth loss function
Figure BDA0003099441890000191
Calculating PairWise loss values between the predicted matching text and the positive sample text and between the predicted matching text and the negative sample text; wherein L isPairWiseRepresenting Pairwise loss values between the predicted matching text and the positive and negative sample texts, m representing a preset boundary threshold, qiRepresenting the ith prediction matching text, ci +Representing the ith positive sample text, ci -Represents the ith negative sample text, hθ(qic i +) Indicating that the ith prediction is calculated to match the text with the ithiAbsolute correlation value between individual positive sample texts, hθ(qi,ci -) Indicating that an absolute correlation score between the ith prediction matching text and the ith negative sample text is calculated.
Step S223: a ListWise loss value is computed between the query content sample, the positive sample text, and the plurality of negative sample texts.
The embodiment of step S222 described above includes, for example: first according to the formula Scorej=hθ(qi,cij) Calculating an absolute relevance Score between the query content sample and each sample text in the plurality of sample texts to obtain a plurality of relevance scores, wherein ScorejRepresenting the j-th relevance score, q, in a text data setiRepresenting the ith prediction matching text, cijRepresenting a jth sample text, h, of an ith set of a plurality of sample textsθ(qi,cij) Indicating that the absolute correlation degree between the ith prediction matching text and the jth sample text in the ith set in the plurality of sample texts is calculatedA value, the plurality of sample texts comprising: a positive sample text and a plurality of negative sample texts. Then, the formula S ═ softmax ([ Score ] is used1,Score2,…,Scorem]) Normalizing the multiple relevancy scores to obtain a normalized relevancy score set; where S represents the normalized set of relevance scores, Score1,Score2,…,ScoremA plurality of relevance scores is represented. Finally, using the formula
Figure BDA0003099441890000201
Normalizing the relevance label corresponding to the query content sample to obtain a normalized relevance label, wherein Y represents the normalized relevance label, Y' represents the relevance label corresponding to the query content sample,
Figure BDA0003099441890000202
represents the sum of all the correlation tags; according to the KL divergence between the normalized set of relevance scores and the normalized relevance label and the specific value of the KL divergence is calculated as the Listwise loss value, L can be used as the Listwise loss valueListWiseTo indicate.
Step S224: and training a deep network model in the text matching model according to the PairWise loss value and the ListWise loss value.
The embodiment of step S224 described above is, for example: constructing a third loss function of the deep network model according to the PairWise loss value and the ListWise loss value, and training the deep network model in the text matching model by using the third loss function to obtain a trained deep network model; the calculation process using the third loss function can be expressed as L3=m×LPairWise+n×LListWiseWherein L is3Representing a third loss value, m and n being two different hyperparameters, LPairWiseRepresents PairWise loss value, LListWiseRepresenting the ListWise loss value.
In a specific implementation process, the feature extraction model and the deep network model are usually trained alternately and jointly using a multitask learning framework, so that the Loss function of the feature extraction model and the Loss function of the deep network model may be merged into a total Loss function, where the total Loss function may be expressed as Loss ═ a × L1+ b × L2+ c × L3; where L1 denotes the first loss value, L2 denotes the second loss value, L3 denotes the third loss value, and a, b, and c are different superparameters.
In the training process of the text matching model in the Internet of vehicles text matching method, the input expression vector capable of representing semantic expression is fed to the pre-training language models such as Roberta or BERT, and further, the text matching model is subjected to combined training by using the text matching task, the part-of-speech prediction task and the dependency relationship task, so that the text matching model has better semantic expression capability, and then the text matching or text retrieval is performed by using the jointly trained text matching model, and the text matching or text retrieval effect can be effectively improved in the aspect of semantic expression.
Please refer to fig. 4, which is a schematic structural diagram of a text matching device in a car networking system according to an embodiment of the present application; the embodiment of the application provides a device 300 is matchd to car networking text, includes:
the text content extracting module 310 is configured to obtain a text to be matched, and extract abstract content of the text to be matched and a dependency syntax core component of the text to be matched.
A vector matrix obtaining module 320, configured to perform word segmentation and vectorization on the digest content, the dependency syntax core component, and the text to be matched, to obtain an embedded vector matrix, where the embedded vector matrix includes sentence component vectors, token embedded vectors, position embedded vectors, and/or reverse-order position embedded vectors;
and the expression vector obtaining module 330 is configured to perform fusion processing on the sentence component vector, the token embedding vector, the position embedding vector, and/or the reverse-order position embedding vector to obtain an input expression vector.
And the text matching and sorting module 340 is configured to perform matching and sorting on the multiple search texts according to the input representation vector by using a text matching model, so as to obtain a plurality of sorted search texts, where the text matching model is obtained through multi-task joint training.
Optionally, in an embodiment of the present application, the text content extracting module includes:
and the abstract content extraction module is used for extracting the abstract of the text to be matched by using the pre-trained generative pre-training model as an abstract extraction model to obtain the abstract content of the text to be matched.
And the dependency relationship analysis module is used for extracting the major-minor relationship component, the moving-guest relationship component, the inter-guest relationship component, the structure-in-shape component and/or the core relationship component in the text to be matched by using a dependency analysis tool, and determining the major-minor relationship component, the moving-guest relationship component, the inter-guest relationship component, the structure-in-shape component and/or the core relationship component as the dependency syntax core component of the text to be matched.
Optionally, in an embodiment of the present application, the text matching model includes: a feature extraction model and a depth network model; a text matching and ranking module comprising:
and the characteristic vector extraction module is used for extracting the characteristic vector of the input representation vector by using the characteristic extraction model.
And the vector matching and sorting module is used for performing matching and sorting on the text vectors corresponding to the plurality of search texts according to the feature vectors by using the deep network model.
Optionally, in this embodiment of the application, the internet of vehicles text matching apparatus further includes:
and the training data acquisition module is used for acquiring the text data set, the abstract data set and the dependency data set.
And the matching model obtaining module is used for performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain a text matching model.
Optionally, in an embodiment of the present application, the text data set includes: querying a content sample, a positive sample text and a plurality of negative sample texts; a matching model obtaining module comprising:
and the matched text prediction module is used for predicting the predicted matched text corresponding to the query content sample by using the depth network model in the text matching model.
And the first loss calculation module is used for calculating PairWise loss values between the prediction matching text and the positive sample text and between the prediction matching text and the negative sample text.
And the second loss calculation module is used for calculating a Listwise loss value among the query content sample, the positive sample text and the negative sample texts.
And the network model training module is used for training the deep network model in the text matching model according to the PairWise loss value and the ListWise loss value.
It should be understood that the apparatus corresponds to the above-mentioned embodiment of the internet of vehicles text matching method, and can perform the steps related to the above-mentioned embodiment of the method, and the specific functions of the apparatus can be referred to the above description, and the detailed description is appropriately omitted here to avoid redundancy. The device includes at least one software function that can be stored in memory in the form of software or firmware (firmware) or solidified in the Operating System (OS) of the device.
Please refer to fig. 5, which illustrates a schematic structural diagram of an electronic device according to an embodiment of the present application. An electronic device 400 provided in an embodiment of the present application includes: a processor 410 and a memory 420, the memory 420 storing machine-readable instructions executable by the processor 410, the machine-readable instructions when executed by the processor 410 performing the method as above.
Embodiments of the present application also provide a computer-readable storage medium 430, where the computer-readable storage medium 430 stores a computer program, and the computer program is executed by the processor 410 to perform the above method.
The computer-readable storage medium 430 may be implemented by any type of volatile or nonvolatile storage device or combination thereof, such as a Static Random Access Memory (SRAM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), an Erasable Programmable Read-Only Memory (EPROM), a Programmable Read-Only Memory (PROM), a Read-Only Memory (ROM), a magnetic Memory, a flash Memory, a magnetic disk, or an optical disk.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The apparatus embodiments described above are merely illustrative, and for example, the flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
In addition, functional modules of the embodiments in the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an alternative embodiment of the embodiments of the present application, but the scope of the embodiments of the present application is not limited thereto, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the embodiments of the present application, and all the changes or substitutions should be covered by the scope of the embodiments of the present application.

Claims (10)

1. A method for matching texts in the Internet of vehicles is characterized by comprising the following steps:
acquiring a text to be matched, and extracting abstract content of the text to be matched and a dependency syntax core component of the text to be matched;
performing word segmentation and vectorization on the abstract content, the dependency syntax core components and the text to be matched to obtain an embedded vector matrix, wherein the embedded vector matrix comprises sentence component vectors, token embedded vectors, position embedded vectors and/or reverse-order position embedded vectors;
performing fusion processing on the sentence component vector, the token embedding vector, the position embedding vector and/or the reverse order position embedding vector to obtain an input expression vector;
and matching and sequencing the plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sequenced search texts, wherein the text matching model is obtained by multi-task combined training.
2. The method according to claim 1, wherein the extracting the abstract content of the text to be matched and the dependency syntax core component of the text to be matched comprises:
using a pre-trained generative pre-training model as a summary extraction model to extract a summary of the text to be matched, and obtaining the summary content of the text to be matched;
extracting a major and minor relationship component, a moving object relationship component, an inter-object relationship component, a structure-in-shape component and/or a core relationship component in the text to be matched by using a dependency analysis tool, and determining the major and minor relationship component, the moving object relationship component, the inter-object relationship component, the structure-in-shape component and/or the core relationship component as a dependency syntax core component of the text to be matched.
3. The method according to claim 2, wherein before the abstracting the text to be matched by using the pre-trained generative pre-trained model as an abstracting model, the method further comprises:
acquiring a text data set and an abstract data set, wherein abstract texts in the abstract data set are obtained by abstracting sample texts in the text data set;
and training the generative pre-training network by using the text data set and the abstract data set to obtain the generative pre-training model.
4. The method of claim 1, wherein the text matching model comprises: a feature extraction model and a depth network model; the using a text matching model to match and sort the plurality of search texts according to the input representation vector comprises:
extracting a feature vector of the input representation vector using the feature extraction model;
and performing matching sorting on text vectors corresponding to a plurality of retrieval texts according to the feature vectors by using the deep network model.
5. The method of claim 4, further comprising, prior to said match sorting a plurality of search texts according to the input representation vector using a text matching model:
acquiring a text data set, an abstract data set and a dependency data set;
and performing multi-task joint training on the feature extraction model by using the abstract data set and the dependency data set, and training the deep network model by using the text data set to obtain the text matching model.
6. The method of claim 5, wherein the text data set comprises: querying a content sample, a positive sample text and a plurality of negative sample texts; the training of the deep web model using the text dataset includes:
predicting a prediction matching text corresponding to the query content sample by using a deep network model in the text matching model;
calculating PairWise loss values between the predicted matching text and the positive sample text and the negative sample text;
calculating a Listwise loss value between the query content sample, the positive sample text, and the plurality of negative sample texts;
and training a deep network model in the text matching model according to the PairWise loss value and the ListWise loss value.
7. The method of any one of claims 4-6, wherein the feature extraction model employs a Roberta model.
8. A car networking text matching device, characterized by comprising:
the text content extraction module is used for acquiring a text to be matched and extracting abstract content of the text to be matched and a dependency syntax core component of the text to be matched;
a vector matrix obtaining module, configured to perform word segmentation and vectorization on the digest content, the dependency syntax core component, and the text to be matched to obtain an embedded vector matrix, where the embedded vector matrix includes a sentence component vector, a token embedded vector, a position embedded vector, and/or a reverse-order position embedded vector;
a representation vector obtaining module, configured to perform fusion processing on the sentence component vector, the token embedding vector, the position embedding vector, and/or the reverse order position embedding vector to obtain an input representation vector;
and the text matching and sequencing module is used for performing matching and sequencing on the plurality of search texts according to the input representation vector by using a text matching model to obtain a plurality of sequenced search texts, wherein the text matching model is obtained by multi-task combined training.
9. An electronic device, comprising: a processor and a memory, the memory storing machine-readable instructions executable by the processor, the machine-readable instructions, when executed by the processor, performing the method of any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, performs the method of any one of claims 1 to 7.
CN202110622070.0A 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium Active CN113282711B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110622070.0A CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110622070.0A CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN113282711A true CN113282711A (en) 2021-08-20
CN113282711B CN113282711B (en) 2023-09-22

Family

ID=77283433

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110622070.0A Active CN113282711B (en) 2021-06-03 2021-06-03 Internet of vehicles text matching method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113282711B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434642A (en) * 2021-08-27 2021-09-24 广州云趣信息科技有限公司 Text abstract generation method and device and electronic equipment
CN113918702A (en) * 2021-10-25 2022-01-11 北京航空航天大学 Semantic matching-based online legal automatic question-answering method and system
CN116610776A (en) * 2022-12-30 2023-08-18 摩斯智联科技有限公司 Intelligent question-answering system of Internet of vehicles
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
CN108681574A (en) * 2018-05-07 2018-10-19 中国科学院合肥物质科学研究院 A kind of non-true class quiz answers selection method and system based on text snippet
US20190325864A1 (en) * 2018-04-16 2019-10-24 Google Llc Automated assistants that accommodate multiple age groups and/or vocabulary levels
CN110851604A (en) * 2019-11-12 2020-02-28 中科鼎富(北京)科技发展有限公司 Text classification method and device, electronic equipment and storage medium
CN111753496A (en) * 2020-06-22 2020-10-09 平安付科技服务有限公司 Industry category identification method and device, computer equipment and readable storage medium
CN111966832A (en) * 2020-08-21 2020-11-20 网易(杭州)网络有限公司 Evaluation object extraction method and device and electronic equipment
WO2021051518A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text data classification method and apparatus based on neural network model, and storage medium
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844413A (en) * 2016-11-11 2017-06-13 南京缘长信息科技有限公司 The method and device of entity relation extraction
CN108319627A (en) * 2017-02-06 2018-07-24 腾讯科技(深圳)有限公司 Keyword extracting method and keyword extracting device
US20190325864A1 (en) * 2018-04-16 2019-10-24 Google Llc Automated assistants that accommodate multiple age groups and/or vocabulary levels
CN108681574A (en) * 2018-05-07 2018-10-19 中国科学院合肥物质科学研究院 A kind of non-true class quiz answers selection method and system based on text snippet
WO2021051518A1 (en) * 2019-09-17 2021-03-25 平安科技(深圳)有限公司 Text data classification method and apparatus based on neural network model, and storage medium
CN110851604A (en) * 2019-11-12 2020-02-28 中科鼎富(北京)科技发展有限公司 Text classification method and device, electronic equipment and storage medium
CN111753496A (en) * 2020-06-22 2020-10-09 平安付科技服务有限公司 Industry category identification method and device, computer equipment and readable storage medium
CN111966832A (en) * 2020-08-21 2020-11-20 网易(杭州)网络有限公司 Evaluation object extraction method and device and electronic equipment
CN112562669A (en) * 2020-12-01 2021-03-26 浙江方正印务有限公司 Intelligent digital newspaper automatic summarization and voice interaction news chat method and system

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
MAIA LYER等: "incorporating syntactic dependencies into semantic word vector model for medical text processing", 2018 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, pages 659 - 654 *
王序文;李姣;吴英杰;李军莲;: "基于BiLSTM-CRF的中文生物医学开放式概念关系抽取", 中华医学图书情报杂志, vol. 27, no. 11, pages 33 - 39 *
顾迎捷;桂小林;李德福;沈毅;廖东;: "基于神经网络的机器阅读理解综述", 软件学报, vol. 31, no. 07, pages 2095 - 2126 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434642A (en) * 2021-08-27 2021-09-24 广州云趣信息科技有限公司 Text abstract generation method and device and electronic equipment
CN113918702A (en) * 2021-10-25 2022-01-11 北京航空航天大学 Semantic matching-based online legal automatic question-answering method and system
CN116610776A (en) * 2022-12-30 2023-08-18 摩斯智联科技有限公司 Intelligent question-answering system of Internet of vehicles
CN116628129A (en) * 2023-07-21 2023-08-22 南京爱福路汽车科技有限公司 Auto part searching method and system
CN116628129B (en) * 2023-07-21 2024-02-27 南京爱福路汽车科技有限公司 Auto part searching method and system

Also Published As

Publication number Publication date
CN113282711B (en) 2023-09-22

Similar Documents

Publication Publication Date Title
CN109885672B (en) Question-answering type intelligent retrieval system and method for online education
CN111639171B (en) Knowledge graph question-answering method and device
CN113282711B (en) Internet of vehicles text matching method and device, electronic equipment and storage medium
CN112800170A (en) Question matching method and device and question reply method and device
CN112214593A (en) Question and answer processing method and device, electronic equipment and storage medium
CN110334186B (en) Data query method and device, computer equipment and computer readable storage medium
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN112464656A (en) Keyword extraction method and device, electronic equipment and storage medium
CN111078837A (en) Intelligent question and answer information processing method, electronic equipment and computer readable storage medium
CN112463944B (en) Search type intelligent question-answering method and device based on multi-model fusion
CN113191148A (en) Rail transit entity identification method based on semi-supervised learning and clustering
CN112328800A (en) System and method for automatically generating programming specification question answers
CN116992007B (en) Limiting question-answering system based on question intention understanding
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN114428850A (en) Text retrieval matching method and system
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114330483A (en) Data processing method, model training method, device, equipment and storage medium
CN113934835A (en) Retrieval type reply dialogue method and system combining keywords and semantic understanding representation
CN114372454A (en) Text information extraction method, model training method, device and storage medium
Arbaaeen et al. Natural language processing based question answering techniques: A survey
CN113741759B (en) Comment information display method and device, computer equipment and storage medium
CN116976341A (en) Entity identification method, entity identification device, electronic equipment, storage medium and program product
CN114626463A (en) Language model training method, text matching method and related device
CN112487154B (en) Intelligent search method based on natural language
CN113569124A (en) Medical title matching method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant