CN114417838B

CN114417838B - Method for extracting synonym block pairs based on transformer model

Info

Publication number: CN114417838B
Application number: CN202210336467.8A
Authority: CN
Inventors: 殷晓君; 殷晓东; 王诚文; 王鸿滨
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2022-04-01
Filing date: 2022-04-01
Publication date: 2022-06-21
Anticipated expiration: 2042-04-01
Also published as: CN114417838A

Abstract

The invention relates to the technical field of synonym block pair extraction, in particular to a method for extracting synonym block pairs based on a transformer model, which comprises the following steps: obtaining a statement pair to be extracted and inputting the statement pair to a transformer model, and obtaining ec _ att _ matrix and edcc _ att _ matrix in the transformer model; in the ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a corresponding language block and a label, and determining the language block as a Query language block; for each Query language block, determining a minimum matrix meeting a second condition, and determining a Title language block corresponding to the Query language block; and determining a synonym block pair according to the Query block and the corresponding Title block. By adopting the method and the device, the problem of inconsistent retrieval between spoken language and written language expression can be solved, and the efficiency and the accuracy are improved.

Description

Method for extracting synonym block pair based on transformer model

Technical Field

The invention relates to the technical field of synonym block pair extraction, in particular to a method and a device for extracting synonym block pairs based on a transformer model.

Background

Synonym chunk pairs refer to chunk pairs that can form synonyms in conjunction with certain context information and are not simple synonyms. For example, "how far the electric vehicle runs" and "electric vehicle range" are synonym block pairs, but neither "run-range" nor "how far-range" can be used alone as a context-free synonym.

The synonym block is mainly used for solving the situation that the same semantic meaning is achieved but the expressions are different, particularly the expressions of spoken language and written language are inconsistent, and the synonym block is one of the main problems faced by a search engine. As a user, it is customary to spoken input Query "how far the electric vehicle runs", and the search engine-indexed article is in many written languages, and the habit is expressed as "electric vehicle endurance mileage". If there is a corresponding synonym chunk pair, the corresponding result can be correctly retrieved. Therefore, an efficient and fast method for extracting synonym block pairs is needed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for extracting synonym block pairs based on a transformer model. The technical scheme is as follows:

in one aspect, a method for extracting synonym block pairs based on a transformer model is provided, where the method is implemented by a blockchain management node, and the method includes:

obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, wherein the statement pair to be extracted comprises a Query statement and a Title statement;

acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;

determining a minimum internal matrix meeting a first condition in the ec _ att _ matrix, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to the Query sentence;

for each Query language block, determining a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determining a Title language block corresponding to the Query language block according to the minimum matrix;

and determining a synonym block pair according to the Query block and the corresponding Title block.

Optionally, the determining, in the ec _ att _ matrix, a minimum internal matrix that meets a first condition, and recording a language block and a label corresponding to the minimum internal matrix includes:

in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );

wherein satisfying the first condition comprises:

if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:

1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;

2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;

determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.

Optionally, for each Query language block, in the edcc _ att _ matrix, determining a minimum matrix satisfying a second condition includes:

marking the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, marking the head and tail word subscripts of the minimum matrix as T _ begin and T _ end,

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.

Optionally, the determining, according to the minimum matrix, a Title language block corresponding to the Query language block includes:

and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.

Optionally, the determining a synonym block pair according to the Query block and the corresponding Title block includes:

and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.

In another aspect, an apparatus for extracting a synonym block pair based on a transformer model is provided, and the apparatus is applied to a method for extracting a synonym block pair based on a transformer model, and the apparatus includes:

the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a statement pair to be extracted and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, and the statement pair to be extracted comprises a Query statement and a Title statement;

the second acquisition module is used for acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;

a first determining module, configured to determine, in the ec _ att _ matrix, a minimum internal matrix meeting a first condition, record a language block and a label corresponding to the minimum internal matrix, and determine the language block as a Query language block corresponding to the Query statement;

a second determining module, configured to determine, for each Query language block, a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determine, according to the minimum matrix, a Title language block corresponding to the Query language block;

and the third determining module is used for determining the synonym block pair according to the Query block and the corresponding Title block.

Optionally, the first determining module is further configured to:

wherein satisfying the first condition comprises:

Optionally, the second determining module is further configured to:

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

Optionally, the second determining module is further configured to:

Optionally, the third determining module is configured to:

In another aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for extracting a synonym block pair based on a transform model.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above method for extracting synonym block pairs based on a transformer model.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

the scheme does not depend on expert knowledge, and based on massive Query-Doc clicking behaviors in a search engine, and self-addressing matrix and encoder-decoder addressing matrix in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of retrieval with inconsistent expression of spoken language and written language can be solved, the extraction efficiency is high, and the accuracy is high.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flowchart of a method for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

The embodiment of the invention provides a method for extracting synonym block pairs based on a transformer model, which can be realized by a block chain management node, wherein the block chain management node can be a terminal or a server. As shown in fig. 1, a flow chart of a method for extracting synonym block pairs based on a transformer model, a processing flow of the method may include the following steps:

s101, obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, wherein the statement pair to be extracted comprises a Query statement and a Title statement.

S102, obtaining a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-association matrix is marked as ec _ att _ matrix, ec _ att _ matrix is used for representing the relationship between the Query statement and the self, the encoder-decoder association matrix is marked as edc _ att _ matrix, and edc _ att _ matrix is used for representing the relationship between the Query statement and the Title statement.

S103, in the ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to a Query statement.

S104, for each Query language block, determining a minimum matrix meeting a second condition in edcc _ att _ matrix, and determining a Title language block corresponding to the Query language block according to the minimum matrix.

And S105, determining a synonym block pair according to the Query block and the corresponding Title block.

Optionally, in the ec _ att _ matrix, determining a minimum internal matrix satisfying the first condition, and recording a language block and a label corresponding to the minimum internal matrix, including:

wherein satisfying the first condition comprises:

2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q.

Determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into the Query language block set.

Optionally, for each Query language block, in edcc _ att _ matrix, determining a minimum matrix satisfying a second condition includes:

recording the corresponding head-to-tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, recording the head-to-tail word subscripts of the minimum matrix as T _ begin and T _ end,

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

Optionally, determining a Title language block corresponding to the Query language block according to the minimum matrix includes:

Optionally, determining a synonym chunk pair according to the Query chunk and the corresponding Title chunk includes:

In the embodiment of the invention, the scheme does not depend on expert knowledge, and based on massive Query-Doc clicking behaviors in a search engine and self-addressing matrixes and encoder-decoder addressing matrixes in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of retrieval of inconsistent expression of spoken language and written language can be solved, the extraction efficiency is high, and the accuracy is high.

The embodiment of the invention provides a method for extracting synonym block pairs based on a transformer model, which can be realized by a block chain management node, wherein the block chain management node can be a terminal or a server. As shown in fig. 2, a flow chart of a method for extracting synonym block pairs based on a transformer model, a processing flow of the method may include the following steps:

s201, constructing a corresponding Query-Title pair as a sample synonymous statement pair through the click behavior of Query-Doc in a search engine.

And S202, training a transformer model based on the sample synonym statement pair.

It should be noted that training of the transform model may be performed by using a training method commonly used in the prior art, and details of the embodiment of the present invention are not described herein.

And S203, obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into the trained transform model.

The statement pair to be extracted includes a Query statement and a Title statement, and the statement pair to be extracted may be obtained by a click action of a Query-Doc in a search engine or may be obtained by other manners, such as manual input by a user, which is not limited in the present invention.

S204, obtaining a self-addressing matrix and an encoder-decoder addressing matrix inside the transform model.

In one possible embodiment, after the sentence pair is input into the transform model, the transform model will first divide the sentence into language blocks, for example, the Query sentence is "how far the latest electric vehicle runs", and then the sentence is divided into language blocks "latest/electric vehicle/how far". Then, the model generates self-attribute matrix through the encoder layer, and the self-attribute matrix is recorded as ec _ att _ matrix, and ec _ att _ matrix is used to represent the relationship between Query statement and itself, for example, assuming that the Query statement is "how far the latest electric vehicle runs", ec _ att _ matrix may be as shown in table 1 below:

an encoder-decoder association matrix is recorded as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relationship between Query statements and Title statements. For example, assuming that the Query statement is "how far the latest electric vehicle runs", and the Title statement is "the latest electric vehicle endurance mileage", the edcc _ att _ matrix may be as shown in table 2 below:

s205, in ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to a Query statement.

Alternatively, the manner of determining the minimum internal matrix satisfying the first condition may be as follows:

wherein satisfying the first condition comprises:

if there is one internal matrix for each row q, the following two conditions are satisfied:

Preferably, the first threshold T1 may be set to 0.85, and the second threshold T2 may be set to 0.15.

S206, recording the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, and recording the head and tail word subscripts of the minimum matrix as T _ begin and T _ end.

Wherein satisfying the second condition comprises: for each i row, Q _ begin < = i < = Q _ end:

In one possible embodiment, among the plurality of edcc _ att _ matrices, it is determined whether there is a square matrix in which the sum of elements in each column is greater than or equal to the third threshold T3, and if so, the square matrix is determined as the minimum matrix satisfying the second condition.

Specifically, for each Query term block extracted as described above, the head word index and the tail word index of the Query term block are respectively denoted as S _ begin and S _ end, and taking table 2 of step 204 as an example, in this example, S _ begin =3 and S _ end = 5.

A For loop may be used to find whether there is a square matrix in edcc _ att _ matrix where the sum of each column of elements is greater than or equal to a third threshold T3:

For i = S_begin; i <= S_end; S_begin <=j<= S_end ；i++:

the sum of ecdc _ att _ matrix [ i ] [ j ] is equal to or greater than a third threshold T3.

And if the minimum matrix exists, determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block. In the example of table 2 of the above step 204, D _ begin =3 and D _ end =5, it may be determined that the corresponding Title block is "electric vehicle endurance mileage".

It should be noted that, the process of screening the minimum matrix meeting the first condition in the ec _ att _ matrix and the process of screening the minimum matrix meeting the second condition in the edc _ att _ matrix may adopt the same screening method, including the following steps S2061 to S2067, for convenience of description, the ec _ att _ matrix and the edc _ att _ matrix in the following steps S2061 to S2067 are collectively referred to as a matrix to be screened, and the first condition and the second condition are collectively referred to as a screening condition:

s2061, acquiring an initial value of x, an initial value of k and a row number N of a matrix to be screened, wherein the initial value of x is 1, and the initial value of k is 1;

s2062, judging whether x is larger than or equal to N, if x is not larger than N and not equal to N, executing S2063; if x is greater than or equal to N, go to S2067;

s2063, in the matrix to be screened, determining a preselection matrix with the row number and the column number from x to x + k, and judging whether the preselection matrix meets the screening condition;

s2064, if the preselected matrix meets the screening condition, determining the preselected matrix as a minimum matrix, wherein x = x + k +1, and k =1, and executing S2062; if the preselected matrix does not meet the screening condition, go to execute S2065;

s2065, judging whether k is equal to N-x or not, if k is not equal to N-x, k = k +1, and executing S2063; if k is equal to N-x, go to S2066;

s2066, shift x = x +1, k =1, to execute S2062;

and S2067, ending the circulation operation.

And S207, determining a synonym block pair according to the Query block and the corresponding Title block.

In one possible implementation, a Query chunk and a corresponding Title chunk form a synonym chunk pair, and the synonym chunk pair is correspondingly stored.

In the embodiment of the invention, the scheme does not depend on expert knowledge, and based on massive Query-Doc click behaviors in a search engine, and a self-addressing matrix and an encoder-decoder addressing matrix in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of inconsistent retrieval of spoken language and written language expressions can be solved, the extraction efficiency is high, and the accuracy is high.

FIG. 3 is a block diagram illustrating an apparatus for extracting synonym block pairs based on a transform model, according to an example embodiment. Referring to fig. 3, the apparatus includes a first obtaining module 310, a second obtaining module 320, a first determining module 330, a second determining module 340, and a third determining module 350, wherein:

a first obtaining module 310, configured to obtain a statement pair to be extracted, and input the statement pair to be extracted into a transform model for extracting a synonym block pair, where the statement pair to be extracted includes a Query statement and a Title statement;

a second obtaining module 320, configured to obtain a self-addressing matrix and an encoder-decoder addressing matrix inside the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;

a first determining module 330, configured to determine, in the ec _ att _ matrix, a minimum internal matrix that meets a first condition, record a language block and a label that correspond to the minimum internal matrix, and determine the language block as a Query language block corresponding to the Query sentence;

a second determining module 340, configured to determine, for each Query language block, a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determine, according to the minimum matrix, a Title language block corresponding to the Query language block;

and a third determining module 350, configured to determine a synonym block pair according to the Query block and the corresponding Title block.

Optionally, the first determining module 330 is further configured to:

wherein satisfying the first condition comprises:

Optionally, the second determining module 340 is further configured to:

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

Optionally, the second determining module 340 is further configured to:

Optionally, the third determining module 350 is configured to:

Fig. 4 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 600 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the steps of the method for extracting the synonym block pair based on the transform model.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, is also provided to perform the above method for extracting synonym block pairs based on a transform model. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims

1. A method for extracting synonym block pairs based on a transformer model, the method comprising:

acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-association matrix is recorded as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relationship between the Query statement and the entity, the encoder-decoder association matrix is recorded as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relationship between the Query statement and the Title statement;

2. The method according to claim 1, wherein the determining, in the ec _ att _ matrix, a minimum internal matrix satisfying a first condition, and recording a language block and a label corresponding to the minimum internal matrix comprises:

wherein satisfying the first condition comprises:

3. The method according to claim 1, wherein said determining, for each Query language block, a minimum matrix satisfying a second condition in said edc _ att _ matrix comprises:

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

4. The method according to claim 3, wherein the determining, according to the minimum matrix, a Title language block corresponding to the Query language block comprises:

5. The method of claim 1, wherein determining synonym chunk pairs from the Query chunks and corresponding Title chunks comprises:

6. An apparatus for extracting synonym block pairs based on a transformer model, the apparatus comprising:

7. The apparatus of claim 6, wherein the first determining module is further configured to:

in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the matrix upper left index of the minimum internal matrix as (i, i) and the matrix lower right index as (i + k );

wherein satisfying the first condition comprises:

8. The apparatus of claim 6, wherein the second determining module is further configured to:

wherein satisfying the second condition comprises:

for each i row, Q _ begin < = i < = Q _ end:

9. The apparatus of claim 8, wherein the second determining module is further configured to:

10. The apparatus of claim 6, wherein the third determining module is configured to: