CN114417838B - Method for extracting synonym block pairs based on transformer model - Google Patents

Method for extracting synonym block pairs based on transformer model Download PDF

Info

Publication number
CN114417838B
CN114417838B CN202210336467.8A CN202210336467A CN114417838B CN 114417838 B CN114417838 B CN 114417838B CN 202210336467 A CN202210336467 A CN 202210336467A CN 114417838 B CN114417838 B CN 114417838B
Authority
CN
China
Prior art keywords
matrix
block
att
query
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210336467.8A
Other languages
Chinese (zh)
Other versions
CN114417838A (en
Inventor
殷晓君
殷晓东
王诚文
王鸿滨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Original Assignee
BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING LANGUAGE AND CULTURE UNIVERSITY filed Critical BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority to CN202210336467.8A priority Critical patent/CN114417838B/en
Publication of CN114417838A publication Critical patent/CN114417838A/en
Application granted granted Critical
Publication of CN114417838B publication Critical patent/CN114417838B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of synonym block pair extraction, in particular to a method for extracting synonym block pairs based on a transformer model, which comprises the following steps: obtaining a statement pair to be extracted and inputting the statement pair to a transformer model, and obtaining ec _ att _ matrix and edcc _ att _ matrix in the transformer model; in the ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a corresponding language block and a label, and determining the language block as a Query language block; for each Query language block, determining a minimum matrix meeting a second condition, and determining a Title language block corresponding to the Query language block; and determining a synonym block pair according to the Query block and the corresponding Title block. By adopting the method and the device, the problem of inconsistent retrieval between spoken language and written language expression can be solved, and the efficiency and the accuracy are improved.

Description

Method for extracting synonym block pair based on transformer model
Technical Field
The invention relates to the technical field of synonym block pair extraction, in particular to a method and a device for extracting synonym block pairs based on a transformer model.
Background
Synonym chunk pairs refer to chunk pairs that can form synonyms in conjunction with certain context information and are not simple synonyms. For example, "how far the electric vehicle runs" and "electric vehicle range" are synonym block pairs, but neither "run-range" nor "how far-range" can be used alone as a context-free synonym.
The synonym block is mainly used for solving the situation that the same semantic meaning is achieved but the expressions are different, particularly the expressions of spoken language and written language are inconsistent, and the synonym block is one of the main problems faced by a search engine. As a user, it is customary to spoken input Query "how far the electric vehicle runs", and the search engine-indexed article is in many written languages, and the habit is expressed as "electric vehicle endurance mileage". If there is a corresponding synonym chunk pair, the corresponding result can be correctly retrieved. Therefore, an efficient and fast method for extracting synonym block pairs is needed.
Disclosure of Invention
The embodiment of the invention provides a method and a device for extracting synonym block pairs based on a transformer model. The technical scheme is as follows:
in one aspect, a method for extracting synonym block pairs based on a transformer model is provided, where the method is implemented by a blockchain management node, and the method includes:
obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, wherein the statement pair to be extracted comprises a Query statement and a Title statement;
acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;
determining a minimum internal matrix meeting a first condition in the ec _ att _ matrix, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to the Query sentence;
for each Query language block, determining a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determining a Title language block corresponding to the Query language block according to the minimum matrix;
and determining a synonym block pair according to the Query block and the corresponding Title block.
Optionally, the determining, in the ec _ att _ matrix, a minimum internal matrix that meets a first condition, and recording a language block and a label corresponding to the minimum internal matrix includes:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.
Optionally, for each Query language block, in the edcc _ att _ matrix, determining a minimum matrix satisfying a second condition includes:
marking the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, marking the head and tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
Optionally, the determining, according to the minimum matrix, a Title language block corresponding to the Query language block includes:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
Optionally, the determining a synonym block pair according to the Query block and the corresponding Title block includes:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
In another aspect, an apparatus for extracting a synonym block pair based on a transformer model is provided, and the apparatus is applied to a method for extracting a synonym block pair based on a transformer model, and the apparatus includes:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a statement pair to be extracted and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, and the statement pair to be extracted comprises a Query statement and a Title statement;
the second acquisition module is used for acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;
a first determining module, configured to determine, in the ec _ att _ matrix, a minimum internal matrix meeting a first condition, record a language block and a label corresponding to the minimum internal matrix, and determine the language block as a Query language block corresponding to the Query statement;
a second determining module, configured to determine, for each Query language block, a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determine, according to the minimum matrix, a Title language block corresponding to the Query language block;
and the third determining module is used for determining the synonym block pair according to the Query block and the corresponding Title block.
Optionally, the first determining module is further configured to:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.
Optionally, the second determining module is further configured to:
marking the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, marking the head and tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
Optionally, the second determining module is further configured to:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
Optionally, the third determining module is configured to:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
In another aspect, an electronic device is provided, and the electronic device includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the method for extracting a synonym block pair based on a transform model.
In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above method for extracting synonym block pairs based on a transformer model.
The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:
the scheme does not depend on expert knowledge, and based on massive Query-Doc clicking behaviors in a search engine, and self-addressing matrix and encoder-decoder addressing matrix in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of retrieval with inconsistent expression of spoken language and written language can be solved, the extraction efficiency is high, and the accuracy is high.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a flowchart of a method for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;
FIG. 2 is a flowchart of a method for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;
FIG. 3 is a block diagram of an apparatus for extracting synonym block pairs based on a transform model according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.
The embodiment of the invention provides a method for extracting synonym block pairs based on a transformer model, which can be realized by a block chain management node, wherein the block chain management node can be a terminal or a server. As shown in fig. 1, a flow chart of a method for extracting synonym block pairs based on a transformer model, a processing flow of the method may include the following steps:
s101, obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, wherein the statement pair to be extracted comprises a Query statement and a Title statement.
S102, obtaining a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-association matrix is marked as ec _ att _ matrix, ec _ att _ matrix is used for representing the relationship between the Query statement and the self, the encoder-decoder association matrix is marked as edc _ att _ matrix, and edc _ att _ matrix is used for representing the relationship between the Query statement and the Title statement.
S103, in the ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to a Query statement.
S104, for each Query language block, determining a minimum matrix meeting a second condition in edcc _ att _ matrix, and determining a Title language block corresponding to the Query language block according to the minimum matrix.
And S105, determining a synonym block pair according to the Query block and the corresponding Title block.
Optionally, in the ec _ att _ matrix, determining a minimum internal matrix satisfying the first condition, and recording a language block and a label corresponding to the minimum internal matrix, including:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q.
Determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into the Query language block set.
Optionally, for each Query language block, in edcc _ att _ matrix, determining a minimum matrix satisfying a second condition includes:
recording the corresponding head-to-tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, recording the head-to-tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
Optionally, determining a Title language block corresponding to the Query language block according to the minimum matrix includes:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
Optionally, determining a synonym chunk pair according to the Query chunk and the corresponding Title chunk includes:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
In the embodiment of the invention, the scheme does not depend on expert knowledge, and based on massive Query-Doc clicking behaviors in a search engine and self-addressing matrixes and encoder-decoder addressing matrixes in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of retrieval of inconsistent expression of spoken language and written language can be solved, the extraction efficiency is high, and the accuracy is high.
The embodiment of the invention provides a method for extracting synonym block pairs based on a transformer model, which can be realized by a block chain management node, wherein the block chain management node can be a terminal or a server. As shown in fig. 2, a flow chart of a method for extracting synonym block pairs based on a transformer model, a processing flow of the method may include the following steps:
s201, constructing a corresponding Query-Title pair as a sample synonymous statement pair through the click behavior of Query-Doc in a search engine.
And S202, training a transformer model based on the sample synonym statement pair.
It should be noted that training of the transform model may be performed by using a training method commonly used in the prior art, and details of the embodiment of the present invention are not described herein.
And S203, obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into the trained transform model.
The statement pair to be extracted includes a Query statement and a Title statement, and the statement pair to be extracted may be obtained by a click action of a Query-Doc in a search engine or may be obtained by other manners, such as manual input by a user, which is not limited in the present invention.
S204, obtaining a self-addressing matrix and an encoder-decoder addressing matrix inside the transform model.
In one possible embodiment, after the sentence pair is input into the transform model, the transform model will first divide the sentence into language blocks, for example, the Query sentence is "how far the latest electric vehicle runs", and then the sentence is divided into language blocks "latest/electric vehicle/how far". Then, the model generates self-attribute matrix through the encoder layer, and the self-attribute matrix is recorded as ec _ att _ matrix, and ec _ att _ matrix is used to represent the relationship between Query statement and itself, for example, assuming that the Query statement is "how far the latest electric vehicle runs", ec _ att _ matrix may be as shown in table 1 below:
Figure DEST_PATH_IMAGE001
an encoder-decoder association matrix is recorded as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relationship between Query statements and Title statements. For example, assuming that the Query statement is "how far the latest electric vehicle runs", and the Title statement is "the latest electric vehicle endurance mileage", the edcc _ att _ matrix may be as shown in table 2 below:
Figure 326504DEST_PATH_IMAGE002
s205, in ec _ att _ matrix, determining a minimum internal matrix meeting a first condition, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to a Query statement.
Alternatively, the manner of determining the minimum internal matrix satisfying the first condition may be as follows:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into the Query language block set.
Preferably, the first threshold T1 may be set to 0.85, and the second threshold T2 may be set to 0.15.
S206, recording the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, and recording the head and tail word subscripts of the minimum matrix as T _ begin and T _ end.
Wherein satisfying the second condition comprises: for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
In one possible embodiment, among the plurality of edcc _ att _ matrices, it is determined whether there is a square matrix in which the sum of elements in each column is greater than or equal to the third threshold T3, and if so, the square matrix is determined as the minimum matrix satisfying the second condition.
Specifically, for each Query term block extracted as described above, the head word index and the tail word index of the Query term block are respectively denoted as S _ begin and S _ end, and taking table 2 of step 204 as an example, in this example, S _ begin =3 and S _ end = 5.
A For loop may be used to find whether there is a square matrix in edcc _ att _ matrix where the sum of each column of elements is greater than or equal to a third threshold T3:
For i = S_begin; i <= S_end; S_begin <=j<= S_end ;i++:
the sum of ecdc _ att _ matrix [ i ] [ j ] is equal to or greater than a third threshold T3.
And if the minimum matrix exists, determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block. In the example of table 2 of the above step 204, D _ begin =3 and D _ end =5, it may be determined that the corresponding Title block is "electric vehicle endurance mileage".
It should be noted that, the process of screening the minimum matrix meeting the first condition in the ec _ att _ matrix and the process of screening the minimum matrix meeting the second condition in the edc _ att _ matrix may adopt the same screening method, including the following steps S2061 to S2067, for convenience of description, the ec _ att _ matrix and the edc _ att _ matrix in the following steps S2061 to S2067 are collectively referred to as a matrix to be screened, and the first condition and the second condition are collectively referred to as a screening condition:
s2061, acquiring an initial value of x, an initial value of k and a row number N of a matrix to be screened, wherein the initial value of x is 1, and the initial value of k is 1;
s2062, judging whether x is larger than or equal to N, if x is not larger than N and not equal to N, executing S2063; if x is greater than or equal to N, go to S2067;
s2063, in the matrix to be screened, determining a preselection matrix with the row number and the column number from x to x + k, and judging whether the preselection matrix meets the screening condition;
s2064, if the preselected matrix meets the screening condition, determining the preselected matrix as a minimum matrix, wherein x = x + k +1, and k =1, and executing S2062; if the preselected matrix does not meet the screening condition, go to execute S2065;
s2065, judging whether k is equal to N-x or not, if k is not equal to N-x, k = k +1, and executing S2063; if k is equal to N-x, go to S2066;
s2066, shift x = x +1, k =1, to execute S2062;
and S2067, ending the circulation operation.
And S207, determining a synonym block pair according to the Query block and the corresponding Title block.
In one possible implementation, a Query chunk and a corresponding Title chunk form a synonym chunk pair, and the synonym chunk pair is correspondingly stored.
In the embodiment of the invention, the scheme does not depend on expert knowledge, and based on massive Query-Doc click behaviors in a search engine, and a self-addressing matrix and an encoder-decoder addressing matrix in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of inconsistent retrieval of spoken language and written language expressions can be solved, the extraction efficiency is high, and the accuracy is high.
FIG. 3 is a block diagram illustrating an apparatus for extracting synonym block pairs based on a transform model, according to an example embodiment. Referring to fig. 3, the apparatus includes a first obtaining module 310, a second obtaining module 320, a first determining module 330, a second determining module 340, and a third determining module 350, wherein:
a first obtaining module 310, configured to obtain a statement pair to be extracted, and input the statement pair to be extracted into a transform model for extracting a synonym block pair, where the statement pair to be extracted includes a Query statement and a Title statement;
a second obtaining module 320, configured to obtain a self-addressing matrix and an encoder-decoder addressing matrix inside the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;
a first determining module 330, configured to determine, in the ec _ att _ matrix, a minimum internal matrix that meets a first condition, record a language block and a label that correspond to the minimum internal matrix, and determine the language block as a Query language block corresponding to the Query sentence;
a second determining module 340, configured to determine, for each Query language block, a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determine, according to the minimum matrix, a Title language block corresponding to the Query language block;
and a third determining module 350, configured to determine a synonym block pair according to the Query block and the corresponding Title block.
Optionally, the first determining module 330 is further configured to:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.
Optionally, the second determining module 340 is further configured to:
marking the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, marking the head and tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
Optionally, the second determining module 340 is further configured to:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
Optionally, the third determining module 350 is configured to:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
In the embodiment of the invention, the scheme does not depend on expert knowledge, and based on massive Query-Doc click behaviors in a search engine, and a self-addressing matrix and an encoder-decoder addressing matrix in a transform model, a large number of synonym block pairs can be automatically extracted, the problem of inconsistent retrieval of spoken language and written language expressions can be solved, the extraction efficiency is high, and the accuracy is high.
Fig. 4 is a schematic structural diagram of an electronic device 400 according to an embodiment of the present invention, where the electronic device 600 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the steps of the method for extracting the synonym block pair based on the transform model.
In an exemplary embodiment, a computer-readable storage medium, such as a memory including instructions executable by a processor in a terminal, is also provided to perform the above method for extracting synonym block pairs based on a transform model. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, which is intended to cover any modifications, equivalents, improvements, etc. within the spirit and scope of the present invention.

Claims (10)

1. A method for extracting synonym block pairs based on a transformer model, the method comprising:
obtaining a statement pair to be extracted, and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, wherein the statement pair to be extracted comprises a Query statement and a Title statement;
acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-association matrix is recorded as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relationship between the Query statement and the entity, the encoder-decoder association matrix is recorded as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relationship between the Query statement and the Title statement;
determining a minimum internal matrix meeting a first condition in the ec _ att _ matrix, recording a language block and a label corresponding to the minimum internal matrix, and determining the language block as a Query language block corresponding to the Query sentence;
for each Query language block, determining a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determining a Title language block corresponding to the Query language block according to the minimum matrix;
and determining a synonym block pair according to the Query block and the corresponding Title block.
2. The method according to claim 1, wherein the determining, in the ec _ att _ matrix, a minimum internal matrix satisfying a first condition, and recording a language block and a label corresponding to the minimum internal matrix comprises:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the upper left index of the matrix of the minimum internal matrix as (i, i) and the lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.
3. The method according to claim 1, wherein said determining, for each Query language block, a minimum matrix satisfying a second condition in said edc _ att _ matrix comprises:
recording the corresponding head-to-tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, recording the head-to-tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
4. The method according to claim 3, wherein the determining, according to the minimum matrix, a Title language block corresponding to the Query language block comprises:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
5. The method of claim 1, wherein determining synonym chunk pairs from the Query chunks and corresponding Title chunks comprises:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
6. An apparatus for extracting synonym block pairs based on a transformer model, the apparatus comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a statement pair to be extracted and inputting the statement pair to be extracted into a transform model for extracting a synonym block pair, and the statement pair to be extracted comprises a Query statement and a Title statement;
the second acquisition module is used for acquiring a self-addressing matrix and an encoder-decoder addressing matrix in the transform model; the self-attribute matrix is marked as ec _ att _ matrix, the ec _ att _ matrix is used for representing the relation between the Query statement and the self, the encoder-decoder attribute matrix is marked as edc _ att _ matrix, and the edc _ att _ matrix is used for representing the relation between the Query statement and the Title statement;
a first determining module, configured to determine, in the ec _ att _ matrix, a minimum internal matrix meeting a first condition, record a language block and a label corresponding to the minimum internal matrix, and determine the language block as a Query language block corresponding to the Query statement;
a second determining module, configured to determine, for each Query language block, a minimum matrix meeting a second condition in the edcc _ att _ matrix, and determine, according to the minimum matrix, a Title language block corresponding to the Query language block;
and the third determining module is used for determining the synonym block pair according to the Query block and the corresponding Title block.
7. The apparatus of claim 6, wherein the first determining module is further configured to:
in ec _ att _ matrix, for a current index i, i < = N, finding a minimum internal matrix meeting a first condition, and marking the matrix upper left index of the minimum internal matrix as (i, i) and the matrix lower right index as (i + k );
wherein satisfying the first condition comprises:
if there is one internal matrix for each row q, i < = q < = i + k, the following two conditions are satisfied:
1) the sum of ec _ att _ matrix [ q ] [ p ] is greater than a first threshold T1, where i < = p < = i + k;
2) the sum of ec _ att _ matrix [ q ] [ h ] is greater than a second threshold T2, where i < = h < = i + k and h | = q;
determining the internal matrix as the minimum internal matrix, and putting the language blocks corresponding to the subscripts i to i + k into a Query language block set.
8. The apparatus of claim 6, wherein the second determining module is further configured to:
marking the corresponding head and tail word subscripts of the current Query language block as Q _ begin and Q _ end, searching a minimum matrix meeting a second condition, marking the head and tail word subscripts of the minimum matrix as T _ begin and T _ end,
wherein satisfying the second condition comprises:
for each i row, Q _ begin < = i < = Q _ end:
the sum of ecdc _ att _ matrix [ i ] [ j ] is greater than a third threshold T3, where T _ begin < = j < = T _ end.
9. The apparatus of claim 8, wherein the second determining module is further configured to:
and determining a corresponding language block according to the T _ begin and the T _ end, and determining the language block as a Title language block corresponding to the Query language block.
10. The apparatus of claim 6, wherein the third determining module is configured to:
and forming a synonym block pair by the Query block and the corresponding Title block, and correspondingly storing.
CN202210336467.8A 2022-04-01 2022-04-01 Method for extracting synonym block pairs based on transformer model Active CN114417838B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210336467.8A CN114417838B (en) 2022-04-01 2022-04-01 Method for extracting synonym block pairs based on transformer model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210336467.8A CN114417838B (en) 2022-04-01 2022-04-01 Method for extracting synonym block pairs based on transformer model

Publications (2)

Publication Number Publication Date
CN114417838A CN114417838A (en) 2022-04-29
CN114417838B true CN114417838B (en) 2022-06-21

Family

ID=81264286

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210336467.8A Active CN114417838B (en) 2022-04-01 2022-04-01 Method for extracting synonym block pairs based on transformer model

Country Status (1)

Country Link
CN (1) CN114417838B (en)

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599071B (en) * 2009-07-10 2012-04-18 华中科技大学 Automatic extraction method of conversation text topic
CN108009182B (en) * 2016-10-28 2020-03-10 京东方科技集团股份有限公司 Information extraction method and device
CN112949284B (en) * 2019-12-11 2022-11-04 上海大学 Text semantic similarity prediction method based on Transformer model
CN111324699A (en) * 2020-02-20 2020-06-23 广州腾讯科技有限公司 Semantic matching method and device, electronic equipment and storage medium
CN111368554B (en) * 2020-03-13 2023-07-28 深圳追一科技有限公司 Statement processing method, device, computer equipment and storage medium
CN113361238B (en) * 2021-05-21 2022-02-11 北京语言大学 Method and device for automatically proposing question by recombining question types with language blocks

Also Published As

Publication number Publication date
CN114417838A (en) 2022-04-29

Similar Documents

Publication Publication Date Title
CN109783817B (en) Text semantic similarity calculation model based on deep reinforcement learning
CN110532397B (en) Question-answering method and device based on artificial intelligence, computer equipment and storage medium
CN112287670A (en) Text error correction method, system, computer device and readable storage medium
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN109828981B (en) Data processing method and computing device
CN108228231B (en) Visualization drifting method of Git warehouse file annotation system
CN109471889B (en) Report accelerating method, system, computer equipment and storage medium
CN110795544B (en) Content searching method, device, equipment and storage medium
US20200364216A1 (en) Method, apparatus and storage medium for updating model parameter
CN117391192B (en) Method and device for constructing knowledge graph from PDF by using LLM based on graph database
CN114090784A (en) Entity label clustering method and device for knowledge graph in material field
CN112307048A (en) Semantic matching model training method, matching device, equipment and storage medium
CN117609460A (en) Intelligent question-answering method and device based on keyword semantic decomposition
CN113590811A (en) Text abstract generation method and device, electronic equipment and storage medium
CN114417838B (en) Method for extracting synonym block pairs based on transformer model
CN112560463A (en) Text multi-labeling method, device, equipment and storage medium
US20230136889A1 (en) Fine-grained concept identification for open information knowledge graph population
CN116028626A (en) Text matching method and device, storage medium and electronic equipment
CN114625889A (en) Semantic disambiguation method and device, electronic equipment and storage medium
CN114780700A (en) Intelligent question-answering method, device, equipment and medium based on machine reading understanding
CN113742447A (en) Knowledge graph question-answering method, medium and equipment based on query path generation
CN113704427A (en) Text provenance determination method, device, equipment and storage medium
CN112749268A (en) FAQ system sequencing method, device and system based on hybrid strategy
CN111930880A (en) Text code retrieval method, device and medium
CN111538902B (en) Information pushing method and device, storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant