CN112925912B

CN112925912B - Text processing method, synonymous text recall method and apparatus

Info

Publication number: CN112925912B
Application number: CN202110220258.2A
Authority: CN
Inventors: 冯朝兵; 连义江
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2024-01-12
Anticipated expiration: 2041-02-26
Also published as: CN112925912A

Abstract

The disclosure provides a text processing method of a text database, and relates to the technical fields of searching, natural language processing, deep learning and other computer technologies. The text processing method comprises the following steps: acquiring feature vectors of all texts in a text database; classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; determining a text from each synonymous text cluster aiming at each synonymous text cluster to be used as a representative text corresponding to the synonymous text cluster; and creating a target query index of the text database according to all the feature vectors representing the text. The disclosure also provides a synonymous text recall method, a synonymous text recall device, electronic equipment and a computer readable medium.

Description

Text processing method, synonymous text recall method and apparatus

Technical Field

The present disclosure relates to the field of computer technologies such as searching, natural language processing, deep learning, and the like, and in particular, to a text processing method, a synonymous text recall method and apparatus, an electronic device, and a computer readable medium for a text database.

Background

In one application scenario in the search field, a search engine can provide three key text matching services for advertisers to meet different advertisement popularization requirements, namely accurate matching, phrase matching and broad matching. Where exact matching refers to the complete agreement between the user's search requirements (query) and the key text (also called keywords, auction words) or their synonym variants, it has remained an extremely important matching pattern in search engines to date due to its precise flow accessibility.

In the advertisement mechanism of the search engine, the excessive size of the key texts of the trigger system brings great challenges to recall and matching, and the recall efficiency of the system and the number of key texts to be searched are in negative correlation. When the trigger system presets recall efficiency and storage constraints (which aim to reduce system rattle, storage and computational resource costs), limited key text can reduce the key word coverage of the trigger system, further resulting in revenue downslope.

Currently, in search engines, in order to implement search recall, the entire amount of key text is generally retrieved in a trigger system directly through a search requirement (query), thereby implementing synonymous text recall.

Disclosure of Invention

The disclosure provides a text processing method, a synonymous text recall method and device of a text database, electronic equipment, a computer readable medium and a computer program product.

According to a first aspect of the present disclosure, the present disclosure provides a text processing method of a text database, including: acquiring feature vectors of all texts in a text database; classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; determining a text from each synonymous text cluster aiming at each synonymous text cluster to be used as a representative text corresponding to the synonymous text cluster; and creating a target query index of the text database according to all the feature vectors representing the text.

According to a second aspect of the present disclosure, the present disclosure provides a synonymous text recall method implemented based on a target query index of a text database, the target query index being created using the text processing method described above, the recall method comprising: acquiring a search request, wherein the search request comprises a search text; acquiring a feature vector corresponding to the search text; inputting the feature vector of the search text into the target query index to query out a representative text matched with the search text; and carrying out search recall by taking the representative text and all texts in the synonymous text cluster corresponding to the representative text as synonymous texts of the search text.

According to a third aspect of the present disclosure, there is provided a text processing apparatus comprising: the first vector acquisition module is configured to acquire feature vectors of all texts in the text database; the text classification module is configured to classify and cluster all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; the screening module is configured to determine a text from each synonymous text cluster to be used as a representative text corresponding to the synonymous text cluster; and the construction module is configured to create a target query index of the text database according to all the feature vectors representing the text.

According to a fourth aspect of the present disclosure, the present disclosure provides a synonymous text recall device, comprising: a request acquisition module configured to acquire a search request, the search request including a search text; the second vector acquisition module is configured to acquire the feature vector corresponding to the search text; the query module is configured to input the feature vector of the search text into a target query index corresponding to the text database so as to query out a representative text matched with the search text; the target query index is created by adopting the text processing method; and the text recall module is configured to recall the representative text and all texts in the synonymous text cluster corresponding to the representative text as synonymous texts of the search text.

According to a fifth aspect of the present disclosure, there is provided an electronic device comprising: at least one processor, and a memory communicatively coupled to the at least one processor; wherein the memory stores one or more computer programs executable by the at least one processor, one or more of the computer programs being executable by the at least one processor to enable the at least one processor to perform the method provided in any one of the above aspects.

According to a sixth aspect of the present disclosure, there is provided a computer readable medium having stored thereon a computer program, wherein the computer program when executed implements the method as provided in any of the above aspects.

According to a seventh aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method provided in any of the above aspects.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The accompanying drawings are included to provide a further understanding of the disclosure, and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure, without limitation to the disclosure. The above and other features and advantages will become more readily apparent to those skilled in the art by describing in detail exemplary embodiments with reference to the attached drawings, in which:

fig. 1 is a flowchart of a text processing method of a text database according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a metric representation model;

FIG. 3 is a flow chart of one specific implementation of step S2 in FIG. 1;

FIG. 4 is a schematic diagram of a neighbor search technique;

FIG. 5 is a schematic diagram of a synonymous discriminant model;

FIG. 6 is a flowchart of one embodiment of step S24 in FIG. 3;

FIG. 7 is a schematic diagram of a connected subgraph;

FIG. 8 is a flow chart of one embodiment of step S3 of FIG. 1;

FIG. 9 is a flowchart of another text processing method provided by an embodiment of the present disclosure;

FIG. 10 is a flow chart of a synonymous text recall method provided by an embodiment of the present disclosure;

FIG. 11 is a flow chart of another synonymous text recall method provided by an embodiment of the present disclosure;

FIG. 12 is a block diagram of a text processing device according to an embodiment of the present disclosure;

FIG. 13 is a block diagram of a synonymous text recall device according to an embodiment of the present disclosure;

fig. 14 is a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

For a better understanding of the technical solutions of the present disclosure, exemplary embodiments of the present disclosure will be described below with reference to the accompanying drawings, in which various details of the embodiments of the present disclosure are included to facilitate understanding, and they should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Embodiments of the disclosure and features of embodiments may be combined with each other without conflict.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. The terms "connected" or "connected," and the like, are not limited to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and the present disclosure, and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

At present, in a search engine, in order to realize search recall, a full amount of key texts are usually searched in a trigger system directly through search requirements (query), so that synonymous text recall is realized, however, the recall mode of the full amount of key texts in the direct search trigger system is huge in time consumption and low in recall efficiency, recall capability is limited, and the trigger system needs to construct a query index based on the full amount of key texts, so that a large amount of storage resources are consumed.

For this reason, the embodiments of the present disclosure provide a text processing method and a synonymous text recall method, apparatus, electronic device, computer readable medium, and computer program product for a text database, so as to effectively reduce the text space of a query index of the text database, save storage resources, and improve retrieval and recall efficiency and recall capability.

In the search field, "recall" refers to acquiring a matching text or document related to a search text or document input by a user.

Fig. 1 is a flowchart of a text processing method of a text database according to an embodiment of the present disclosure.

Referring to fig. 1, an embodiment of the present disclosure provides a text processing method of a text database, which may be performed by a text processing apparatus, which may be implemented in software and/or hardware, and which may be integrated in an electronic device such as a server, the text processing method including:

s1, obtaining feature vectors of all texts in a text database.

In the embodiment of the disclosure, the text database may be a query database constructed for realizing accurate search in a search engine system, or may be a database of a trigger system in an advertisement mechanism of a search engine, where the text in the database of the trigger system is a keyword (also referred to as a keyword or an auction word).

And S2, classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein each synonymous text cluster comprises a plurality of texts with synonymous relations.

And S3, determining a text from each synonymous text cluster according to each synonymous text cluster, and taking the text as a representative text corresponding to the synonymous text cluster.

And S4, creating a target query index of the text database according to all the feature vectors representing the text.

In the embodiment of the present disclosure, the query index of the text database may be created by any suitable index creation manner, and the embodiment of the present disclosure is not limited in particular to the creation manner of the index. For example, a hierarchical navigable small world map (Hierarchical Navigable Small World, abbreviated as HNSW) algorithm may be employed to create a target query index for a text database based on feature vectors representing all text, HNSW being a vector indexing algorithm with which a query index based on feature vectors representing text may be created.

According to the text processing method of the text database, the texts in the text database are classified and clustered according to the synonymous relation, and the representative texts of all clusters are selected to construct the query index of the text database, so that on one hand, the text space of the query index of the text database can be effectively reduced, storage resources are saved, on the other hand, when the text database is utilized for searching, the whole amount of texts in the text database are not required to be searched, and only the representative texts of all clusters in the text database are required to be searched through the target query index, so that the searching efficiency and the searching recall efficiency can be effectively improved, and the searching recall capacity is improved, and the occurrence of the missing recall phenomenon is effectively avoided.

In some embodiments, in step S1, a feature vector of each text in the text database is obtained using a preset metric representation model. The metric representation model can be a language model trained by adopting a deep learning algorithm, the input of the metric representation model is text, and the output is a vector representation of the text, namely, a feature vector of the text.

In some embodiments, the metric representation model is a model built based on a multi-layer transform structure, e.g., the metric representation model is implemented using a BERT (Bidirectional Encoder Representations from Transformers, bi-directional coded representation based on a multi-layer transform structure) model, enabling the conversion of text input into feature vector output. In some embodiments, the metric representation model is a BERT pre-training model, which is a model that is pre-trained with massive text corpus, which is the base model of BERT.

FIG. 2 is a schematic diagram of a metric representation model, as shown in FIG. 2, "Single segment" is an input Single Sentence text; toki represents the i-th symbol (Token) in the input text, i=1, 2,3, …, N, N being a positive integer; e represents an embedded vector, ei represents an embedded vector of an ith symbol (Token); the output C is a feature vector, and Ti is a feature vector obtained by performing BERT processing on the ith symbol (Token).

It should be noted that, the embodiment of the present disclosure is not particularly limited to a specific implementation of the metric representation model, as long as the feature vector of the text can be obtained.

FIG. 3 is a flowchart of one specific implementation of step S2 in FIG. 1. As shown in FIG. 3, in some embodiments, step S2 may further include steps S21-S24.

Step S21, an initial query index of the text database is created according to the feature vectors of all texts in the text database.

In the embodiment of the present disclosure, the query index of the text database may be created by any suitable index creation manner, and the embodiment of the present disclosure is not limited in particular to the creation manner of the index. The initial query index may be created, for example, using the HNSW algorithm.

Step S22, inquiring a text matched with each text in the text database through an initial inquiry index, and generating initial synonymous relation information, wherein the initial synonymous relation information comprises the text and the text matched with the text.

It will be appreciated that the initial query index is created based on feature vectors of text, the input of which is a feature vector of text, and the output of which is a feature vector of text that matches the input feature vector.

Specifically, in step S22, in the initial query index, a text matching the text is queried out using a neighbor search technique to generate initial synonym relationship information.

In some embodiments, the neighbor retrieval technique employs HNSW algorithm. Fig. 4 is a schematic diagram of a neighbor search technique, where "I" represents a feature vector of an input text, "M" represents a feature vector of a text stored on each layer structure of an initial query index, and "O" represents a feature vector of a text searched using the neighbor search technique, as shown in fig. 4. As shown in fig. 4, in some embodiments, in step S22, in the initial query index, a search is started from the uppermost Layer (Layer 2 as shown in fig. 4) by using a neighbor search technique, a node closest to the input feature vector in the present Layer is found, and then the next Layer is entered, and the starting node of the next Layer search is the closest node of the previous Layer, and the process is repeated until the query result (i.e. "O" as shown in fig. 4) is found.

In some embodiments, in step S22, the text matching the text refers to the text having a distance between feature vectors smaller than a predetermined distance, or the text having the smallest distance between feature vectors among all the texts. The text matched with the text can be one text or a plurality of texts, and is determined according to actual retrieval conditions.

And S23, carrying out synonym judgment on the text in each piece of initial synonym relation information by using a preset synonym judgment model, and removing the initial synonym relation information which does not meet the synonym relation in all pieces of initial synonym relation information.

In the embodiment of the present disclosure, the fact that the distance between the feature vectors is the minimum or the distance is smaller than the predetermined distance indicates that a synonym relationship exists between the two texts may be determined preliminarily, but does not indicate that the synonym relationship exists between the two texts truly, in order to further improve accuracy in identifying the synonym relationship between the texts, in some embodiments, in step S23, a synonym determination is performed on the text in each piece of initial synonym relationship information by using a preset synonym determination model, if the text in the initial synonym relationship information does not actually satisfy the synonym relationship, the initial synonym relationship information is removed, and the initial synonym relationship information that actually satisfies the synonym relationship is reserved.

Specifically, in step S23, for each initial synonym relation information, feature vectors of any two texts in the initial synonym relation information are input into a preset synonym judgment model, and the similarity (synonym degree) of the feature vectors of the two texts is calculated and whether the two texts satisfy (have) the synonym relation is identified by using the preset synonym judgment model, for example, by judging whether the similarity is greater than or equal to a preset similarity threshold, if so, the synonym relation is judged to be satisfied, and if not, the synonym relation is judged to be satisfied. If the two texts meet the synonym relation, the two texts are reserved, otherwise, the texts which do not meet the synonym relation with any text are removed.

In some embodiments, the synonym discrimination model may be a language model trained by a deep learning algorithm, where the input of the synonym discrimination model is two texts to be discriminated, and the output is a synonym relationship discrimination result representation of the two texts.

In some embodiments, the synonymous discriminant model is a classification model built based on a multi-layer transducer structure. In some embodiments, the synonym discrimination model uses a BERT classification model that is obtained based on Fine Tuning (Fine Tuning) of a base model of BERT (i.e., the BERT pre-training model described above), to implement classification prediction of whether a synonym variant relationship is satisfied between two texts.

FIG. 5 is a schematic diagram of a synonym determination model, as shown in FIG. 5, "Sentence1" is the first text entered and "Sentence2" is the second text entered; toki represents the i-th symbol (Token) in the input text, i=1, 2,3, …, N (or M), N, M is a positive integer; e (E) _[cls] An embedded vector representing the input Sentence1, E _[SEP] An embedded vector representing the input Sentence2, E _i An embedded vector representing an i-th symbol (Token); c is the eigenvector of Sentence1, T _[SEP] As a feature vector of Sentence2, ti represents a feature vector obtained by performing BERT processing on the i-th symbol (Token); class Label represents the final output, with a value of 0 or 1,1 representing that the two texts entered satisfy the synonym relationship, and 0 representing that the two texts entered do not satisfy the synonym relationship.

For example, the input text pairs are: "how much money is needed for double eyelid operation" and "price for cutting double eyelid", the text pair is predicted to satisfy the synonym relation by the synonym discrimination model, and the input text pair is: the synonymous relationship of 'how much money is needed for double eyelid operation' and 'how much money is needed for eyebrow tattooing' is not satisfied.

In the model training process of the synonymy discrimination model, for the feature vectors of two input texts, the similarity of the two texts can be calculated through a preset measurement function, so that whether the two texts meet the synonymy relation is discriminated. The metric function may be, for example, a Cosine (COS) function or a dot product function.

In the embodiments of the present disclosure, it is understood that "synonymous" means that the meaning of text is the same or substantially the same.

It should be noted that, the embodiment of the present disclosure is not particularly limited to a specific implementation manner of the synonym determination model, as long as it can identify whether there is a synonym relationship between text pairs.

And step S24, dividing the initial synonymous relation information with the intersection in the rest initial synonymous relation information into one type of initial synonymous relation information, wherein each type of initial synonymous relation information is used as a synonymous text cluster.

For example, where text a has a synonym relationship with text B, text B has a synonym relationship with text C, and text B has a synonym relationship with text D, then text A, B, C, D can be categorized as one and located in one synonym text cluster.

It is understood that a synonym text cluster refers to a collection of texts having synonym relationships, and that there is no intersection between different synonym text clusters.

Fig. 6 is a flowchart of a specific implementation of step S24 in fig. 3, and as shown in fig. 6, in some embodiments, step S24 may further include steps S241 to S243.

Step S241, regarding the rest initial synonymous relation information, each text in the initial synonymous relation information is taken as a node, and the matching relation between the texts in the initial synonymous relation information is taken as a side, so as to construct the Euler diagram.

Specifically, regarding the rest initial synonymous relation information, each text in the initial synonymous relation information is used as a node, and the node corresponding to each text is connected with the node corresponding to the text matched with the text to form an edge, so that an Euler diagram is constructed.

Step S242, determining all connected subgraphs in the Euler diagram by using a preset connected subgraph discovery algorithm.

In some embodiments, the preset connected subgraph discovery algorithm includes a union algorithm, through which all connected subgraphs in the Euler graph can be found. In the process of finding the connected subgraphs, the scale of a single connected subgraph is limited, for example, if the number of connected nodes exceeds a limiting threshold, continuous connected expansion is stopped, and the possibility of error explosion caused by overlong path depth between the nodes is avoided.

And step S243, determining initial synonymous relation information with intersection according to the connected subgraph to generate synonymous text clusters.

It can be understood that each connected subgraph mined in step S242 corresponds to a synonymous text cluster.

Fig. 7 is a schematic diagram of a connected subgraph, as shown in fig. 7, after the processing in step S23, in step S241, in the remaining initial synonymous relation information, text a, text B and text C are matched, so that a and B, C are connected, and text B, text D and text E are matched, so that B and D, E are connected, finally, in steps S242 and S243, it is determined that the text A, B, C, D, E is connected to each other as a connected subgraph, and by analogy, it is determined that the text F, G, H is connected to each other as a connected subgraph, and the text J, K, L, P is connected to each other as a connected subgraph.

As an example, through the above steps S21 to S24, the plurality of synonymous text clusters are finally determined, for example: [ how much money the double eyelid operation requires, how much money the double eyelid operation requires ], [ what the eyebrow tattooing needs to pay attention to, what the care of tattooing the eyebrow ], [ how much money the eyebrow tattooing takes once, the price of the eyebrow tattooing operation ].

Fig. 8 is a flowchart of a specific implementation of step S3 in fig. 1, and as shown in fig. 8, in some embodiments, step S3 may further include steps S31 to S32.

Step S31, determining the corresponding synonymous relation quantity of the text in each synonymous text cluster according to each text in the synonymous text clusters.

It is understood that the number of synonyms corresponding to the text in the synonym text cluster is the number of texts in the synonym text cluster that have a synonym relationship with the text.

In some embodiments, a preset synonym discrimination model may be used to identify a text that has a synonym relationship with the text in the synonym text cluster, so as to determine the number of synonym relationships corresponding to the text in the synonym text cluster. For the description of the preset synonym discrimination model, reference is made to the description of the synonym discrimination model, and the description is omitted here.

And S32, taking any text with the largest quantity of synonymous relations in the synonymous text cluster as a representative text corresponding to the synonymous text cluster.

In some embodiments, if the number of texts with the largest number of corresponding synonym relationships in the synonym text cluster is a plurality of texts, for each text in the plurality of texts, firstly calculating the synonym degree (similarity) corresponding to each synonym relationship corresponding to the text by using a synonym discrimination model; then, according to the synonym degrees of all synonym relations corresponding to the text, calculating the average value of the synonym degrees corresponding to the text; and finally, comparing the corresponding synonymy degree average values of the texts, and selecting the text with the highest synonymy degree average value as the representative text of the synonymy text cluster.

In some embodiments, in addition to the representative text determination manners through the above step S31 and step S32, one text may be randomly selected from each synonymous text cluster as the representative text of the synonymous text cluster.

Fig. 9 is a flowchart of another text processing method according to an embodiment of the present disclosure, as shown in fig. 9, in some embodiments, to ensure that the representative text and each text in the synonymous text cluster have a synonymous relationship, the text processing method may further include steps S33 to S34 before step S4 to reduce a usage error of the synonymous text cluster in an actual search scene.

And step S33, identifying whether the text has a synonym relation with the representative text corresponding to each synonym text cluster by using a preset synonym judgment model according to each text in each synonym text cluster.

Specifically, the similarity between the feature vector of the text and the feature vector of the representative text is calculated by using the synonym judging model, and whether the text and the representative text have a synonym relationship or not is identified according to the similarity, for example, whether the similarity is larger than or equal to a preset similarity threshold value is judged, if the similarity is larger than or equal to the preset similarity threshold value, the text is judged to have the synonym relationship, and if the similarity is smaller than the preset similarity threshold value, the text is judged to not have the synonym relationship. For the description of the preset synonym discrimination model, reference is made to the description of the synonym discrimination model, and the description is omitted here.

And step S34, eliminating the text from the synonymous text cluster when the text is identified to have no synonymous relation with the representative text corresponding to the synonymous text cluster.

And when the representative text corresponding to the synonymous text cluster is identified to have the synonymous relation, the text is reserved, and when the representative text corresponding to the synonymous text cluster is identified to not have the synonymous relation, the text is removed from the synonymous text cluster. Therefore, the synonymous relation between the representative text and each text in the synonymous text cluster is ensured, and the use error of the synonymous text cluster in an actual retrieval scene is reduced.

In the embodiment of the disclosure, the target query index of the text database is created based on the representative text, so that the whole text in the text database is not required to be searched in the searching process, and only the representative text is required to be searched, on one hand, the text space of the index is reduced, the storage resource is saved, and on the other hand, the searching and recall efficiency is greatly improved.

FIG. 10 is a flowchart of a synonymous text recall method provided by an embodiment of the present disclosure.

The embodiment of the disclosure provides a synonymous text recall method, which can be executed by a synonymous text recall device, wherein the synonymous text recall device can be realized by a software and/or hardware mode, and the synonymous text recall device can be integrated in an electronic device such as a server.

Referring to fig. 10, the synonym text recall method may be implemented based on a target query index of a text database, the target query index being created using the text processing method described above, the synonym text recall method comprising:

step S5, obtaining a search request, wherein the search request comprises a search text.

In some embodiments, a search request (query) of a user is received in real-time through an online environment.

In some embodiments, in step S5, a search request (query) entered by a user on the interactive system is obtained. The interactive system may be an intelligent interactive system such as an intelligent terminal, a platform, an application, a client, etc. capable of providing intelligent interactive services for users, for example, an intelligent sound box, an intelligent video sound box, an intelligent story machine, an intelligent interactive platform, an intelligent interactive application, a search engine, and a question-answering system. The embodiments of the present disclosure are not particularly limited as to the implementation of the interactive system, as long as the interactive system is capable of interacting with a user.

In the embodiment of the present disclosure, the foregoing "interaction" may include voice interaction and text interaction, where the voice interaction is implemented based on technologies such as voice recognition, voice synthesis, natural language understanding, and the like, and in various actual application scenarios, the interactive system is endowed with an intelligent human-computer interaction experience of "listening, speaking and understanding you", and the voice interaction is applicable to various application scenarios, including intelligent question-answering, intelligent playing, intelligent searching, and the like. The text interaction is realized based on the technologies of text recognition, extraction, natural language understanding and the like, and can be also applied to a plurality of application scenes.

In some embodiments, in step S5, the user may input the search request through a voice interaction manner, and after obtaining the voice information input by the user, the voice information may be subjected to operations such as voice recognition, voice conversion text, and so on, so as to obtain the corresponding search text.

In some embodiments, in step S5, the user may also input a search request through a text interaction manner, and when the user inputs text information, the user may directly obtain the text information input by the user, where the text information is the search text. The text information refers to the characters of the natural language class.

And S6, obtaining the feature vector corresponding to the search text.

In the embodiment of the present disclosure, after the search text of the user is obtained, in step S6, a feature vector corresponding to the search text may be obtained using a preset metric representation model. For a specific description of the metric representation model, reference may be made to the description of the metric representation model described above, which is not repeated here.

And S7, inputting the feature vector of the search text into a target query index to query out a representative text matched with the search text.

In the embodiment of the present disclosure, after the feature vector of the search text is acquired, in step S7, the feature vector of the search text is input into a pre-created target query index, and a representative text matching the search text is queried in the target query index using a neighbor search technique. For a specific description of the neighbor search technique, reference is made to the above description of the neighbor search technique, and the description is omitted here.

And S8, taking the representative text and all texts in the synonym text cluster corresponding to the representative text as synonym texts of the search text for search recall.

As an example, assume that there are homologous text clusters in the text database as: the "double eyelid operation price" is a representative text corresponding to the synonymous text cluster, and the search text is "how much money is needed for double eyelid operation", and the "double eyelid operation price" is a representative text matching the "how much money is needed for double eyelid operation" which is searched in step S7, and then the "double eyelid operation price" and the "how much money is needed for double eyelid operation", in the corresponding synonymous text cluster are used as synonymous texts of the search text.

FIG. 11 is a flowchart of another method for synonymous text recall according to an embodiment of the present disclosure, as shown in FIG. 11, and in some embodiments, the method for synonymous text recall may further include step S71 before step S8.

And S71, identifying whether the search text and the representative text matched with the search text have a synonymous relation or not by using a preset synonymous judgment model, if so, executing the step S8, otherwise, not performing further processing.

The detailed description of the synonym discrimination model can be referred to the description of the synonym discrimination model, and will not be repeated here.

In step S71, in the case where it is recognized that the search text and the representative text matching the search text have a synonymous relationship, step S8 is performed; in the case that the search text and the representative text matched with the search text are identified to have no synonymous relation, the text which has no synonymous relation with the search text is represented in the text database, so that the synonymous text is not recalled and is not further processed.

In some embodiments, the text processing method described above may be performed in an offline environment, while the synonymous text recall method described above may be performed in real-time in an online environment.

In the embodiment of the disclosure, in the synonym text recall method, only the representative text matched with the search text of the query is required to be searched, and the whole text in the database is not required to be searched, so that the search efficiency and recall efficiency can be effectively improved, the synonym judgment is performed on the search text of the query and the searched representative text by using the synonym judgment model, and the recall quality can be effectively improved. When the representative text is searched, the representative text and all texts in the corresponding synonymous text clusters can be effectively recalled, and the phenomenon of missing recall is effectively avoided.

Fig. 12 is a block diagram of a text processing apparatus according to an embodiment of the present disclosure.

Referring to fig. 12, an embodiment of the present disclosure provides a text processing apparatus 300, the text processing apparatus 300 including: a first vector acquisition module 301, a text classification module 302, a screening module 303, and a construction module 304.

Wherein the first vector obtaining module 301 is configured to obtain feature vectors of all texts in the text database; the text classification module 302 is configured to classify and cluster all the texts according to the feature vectors of all the texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations; the filtering module 303 is configured to determine, for each synonymous text cluster, one text from the synonymous text cluster, so as to serve as a representative text corresponding to the synonymous text cluster; the building module 304 is configured to create a target query index for the text database from all feature vectors representing text.

In some embodiments, the apparatus 300 further comprises an intra-cluster text filtering module (not shown in the figures) configured to: identifying whether the text has a synonym relation with a representative text corresponding to each synonym text cluster by using a preset synonym discrimination model according to each text in each synonym text cluster; and eliminating the text from the synonymous text cluster under the condition that the text is identified to have no synonymous relation with the representative text corresponding to the synonymous text cluster.

It should be noted that, the text processing device provided in the embodiments of the present disclosure is used to implement the text processing method provided in any of the embodiments, and the detailed description of the text processing device may be referred to the description in the embodiments, which is not repeated herein.

Fig. 13 is a block diagram of a synonymous text recall device according to an embodiment of the present disclosure.

Referring to fig. 13, an embodiment of the present disclosure provides a synonymous text recall apparatus 400, the synonymous text recall apparatus 400 including: a request acquisition module 401, a second vector acquisition module 402, a query module 403, and a text recall module 404.

Wherein the request acquisition module 401 is configured to acquire a search request, the search request comprising a search text; the second vector obtaining module 402 is configured to obtain a feature vector corresponding to the search text; the query module 403 is configured to input the feature vector of the search text into a target query index corresponding to the text database to query out a representative text matching the search text; the target query index is created by adopting the text processing method; the text recall module 404 is configured to recall the representative text and all text in the synonym text cluster corresponding to the representative text as synonym text for the search text.

In some embodiments, the apparatus 400 further includes a synonym identification module (not shown in the figure) configured to identify, after the query module 403 queries the representative text matching the search text, whether the search text and the representative text matching the search text have a synonym relationship with each other using a preset synonym discrimination model, and trigger the text recall module 404 to execute a step of recall of searching all texts in a synonym text cluster corresponding to the representative text and the representative text as synonym texts of the search text if it is identified that the search text and the representative text matching the search text have a synonym relationship with each other.

It should be noted that, the synonymous text recall device provided in the embodiments of the present disclosure is used to implement the synonymous text recall method provided in any of the embodiments described above, and the detailed description of the synonymous text recall device may be referred to the description in the embodiments described above, and will not be repeated here.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a computer-readable medium, and a computer program product.

Fig. 14 shows a schematic block diagram of an electronic device 800 that may be used to implement embodiments of the present disclosure. The electronic device 800 is intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

Referring to fig. 14, the electronic device includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.

Various components in the electronic device 800 are connected to the I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as text processing methods and/or synonymous text recall methods. For example, in some embodiments, the text processing methods and/or synonymous text recall methods described above may be implemented as computer software programs or instructions tangibly embodied in a machine (computer) readable medium, such as the storage unit 808. In some embodiments, some or all of the computer programs or instructions may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When computer programs or instructions are loaded into RAM 803 and executed by computing unit 801, one or more steps of the text processing method and/or synonymous text recall method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the text processing method and/or synonymous text recall method described above in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs or instructions that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine (computer) readable medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, implements the above-described text processing method and/or the above-described synonymous text recall method.

According to the technical scheme of the embodiment of the disclosure, the texts in the text database are classified and clustered according to the synonymous relation, and the representative texts of each cluster are selected to construct the query index of the text database, so that on one hand, the text space of the query index of the text database can be effectively reduced, storage resources are saved, and on the other hand, when the text database is utilized for searching, the whole amount of texts in the text database is not required to be searched, and only the representative texts of each cluster in the text database are required to be searched through the target query index, so that the searching efficiency and the searching recall efficiency can be effectively improved, and the representative texts and all texts in the clusters can be effectively recalled, thereby improving the searching recall capability and effectively avoiding the occurrence of the recall missing phenomenon.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present application may be performed in parallel or sequentially or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.

It is to be understood that the above-described embodiments are merely illustrative of the principles of the present disclosure and are not in limitation of the scope of the disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A text processing method of a text database, comprising:

acquiring feature vectors of all texts in a text database;

classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations;

determining a text from each synonymous text cluster aiming at each synonymous text cluster to be used as a representative text corresponding to the synonymous text cluster;

creating a target query index of the text database according to all the feature vectors representing the text;

classifying and clustering all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the classifying and clustering comprises the following steps:

creating an initial query index of the text database according to the feature vectors of all texts in the text database;

for each text in the text database, inquiring a text matched with the text through an initial inquiry index, and generating initial synonymous relation information, wherein the initial synonymous relation information comprises the text and the text matched with the text;

carrying out synonym discrimination on the text in each piece of initial synonym relation information by using a preset synonym discrimination model, and removing the initial synonym relation information which does not meet the synonym relation in all pieces of initial synonym relation information;

And dividing the initial synonymous relation information with the intersection in the rest initial synonymous relation information into one type of initial synonymous relation information, wherein each type of initial synonymous relation information is used as one synonymous text cluster.

2. The text processing method of claim 1, wherein the obtaining feature vectors of all texts in a text database comprises:

and obtaining the characteristic vector of each text in the text database by using a preset measurement representation model.

3. The text processing method of claim 2, wherein the metric representation model is a BERT pre-training model.

4. The text processing method of claim 1, wherein the generating initial synonym relationship information by dividing the text matching the text by the initial query index query comprises:

and querying a text matched with the text in the initial query index by utilizing a neighbor retrieval technology to generate the initial synonymous relation information.

5. The text processing method of claim 4, wherein the neighbor search technique employs HNSW algorithm.

6. The text processing method of claim 1, wherein the synonymous discriminant model is a BERT classification model.

7. The text processing method according to claim 1, wherein the classifying the initial synonym relation information having the intersection among the remaining initial synonym relation information into a class of initial synonym relation information, each class of initial synonym relation information being one of the synonym text clusters, comprises:

aiming at the rest initial synonymous relation information, each text in the initial synonymous relation information is taken as a node, and the matching relation between the texts in the initial synonymous relation information is taken as an edge, so that an Euler diagram is constructed;

determining all connected subgraphs in the Euler diagram by using a preset connected subgraph discovery algorithm;

and determining initial synonymous relation information with intersection according to the connected subgraph so as to generate the synonymous text cluster.

8. The text processing method of claim 7, wherein the connected subgraph discovery algorithm comprises a union algorithm.

9. The text processing method according to claim 1, wherein said determining a text from the synonym text cluster as a representative text corresponding to the synonym text cluster includes:

determining the corresponding synonymous relation quantity of the text in the synonymous text cluster aiming at each text in the synonymous text cluster;

And taking any text with the largest quantity of synonymous relations in the synonymous text cluster as a representative text corresponding to the synonymous text cluster.

10. The text processing method of claim 1, wherein before creating the target query index of the text database from all feature vectors representing text, further comprising:

identifying whether the text has a synonym relation with the representative text corresponding to each synonym text cluster by using a preset synonym judging model according to each text in each synonym text cluster;

and eliminating the text from the synonymous text cluster under the condition that the text is identified to have no synonymous relation with the representative text corresponding to the synonymous text cluster.

11. A synonymous text recall method implemented based on a target query index of a text database created using the text processing method of any one of the preceding claims 1-10, the recall method comprising:

acquiring a search request, wherein the search request comprises a search text;

acquiring a feature vector corresponding to the search text;

inputting the feature vector of the search text into the target query index to query out a representative text matched with the search text;

And carrying out search recall by taking the representative text and all texts in the synonymous text cluster corresponding to the representative text as synonymous texts of the search text.

12. The synonymous text recall method according to claim 11, wherein the obtaining the feature vector corresponding to the search text comprises:

and obtaining the feature vector corresponding to the search text by using a preset measurement representation model.

13. The synonymous text recall method of claim 11, wherein said entering feature vectors of the search text into the target query index to query out representative text matching the search text comprises:

and inputting the feature vector of the search text into the target query index, and querying the representative text matched with the search text in the target query index by utilizing a neighbor retrieval technology.

14. The synonym text recall method of claim 11, wherein the step of searching for all texts in the synonym text cluster corresponding to the representative text as synonym text of the search text further comprises: identifying whether the search text and the representative text matched with the search text have a synonym relationship or not by using a preset synonym discrimination model;

And executing the step of carrying out search recall by taking all texts in the synonym text clusters corresponding to the representative text and the representative text as the synonym text of the search text under the condition that the search text and the representative text matched with the search text are identified to have a synonym relation.

15. A text processing apparatus, comprising:

the first vector acquisition module is configured to acquire feature vectors of all texts in the text database;

the text classification module is configured to classify and cluster all texts according to the feature vectors of all texts to obtain a plurality of synonymous text clusters, wherein the synonymous text clusters comprise a plurality of texts with synonymous relations;

the screening module is configured to determine a text from each synonymous text cluster to be used as a representative text corresponding to the synonymous text cluster;

the construction module is configured to establish a target query index of the text database according to all the feature vectors representing the text;

the text classification module is configured to: creating an initial query index of the text database according to the feature vectors of all texts in the text database; for each text in the text database, inquiring a text matched with the text through an initial inquiry index, and generating initial synonymous relation information, wherein the initial synonymous relation information comprises the text and the text matched with the text; carrying out synonym discrimination on the text in each piece of initial synonym relation information by using a preset synonym discrimination model, and removing the initial synonym relation information which does not meet the synonym relation in all pieces of initial synonym relation information; and dividing the initial synonymous relation information with the intersection in the rest initial synonymous relation information into one type of initial synonymous relation information, wherein each type of initial synonymous relation information is used as one synonymous text cluster.

16. A synonymous text recall device, comprising:

a request acquisition module configured to acquire a search request, the search request including a search text;

the second vector acquisition module is configured to acquire the feature vector corresponding to the search text;

the query module is configured to input the feature vector of the search text into a target query index corresponding to the text database so as to query out a representative text matched with the search text; the target query index is created using the text processing method of any of the preceding claims 1-10;

and the text recall module is configured to recall the representative text and all texts in the synonymous text cluster corresponding to the representative text as synonymous texts of the search text.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores one or more computer programs executable by the at least one processor, the one or more computer programs being executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.

18. A computer readable medium having stored thereon a computer program, wherein the computer program when executed implements the method of any of claims 1-14.