CN114547251A

CN114547251A - Two-stage folk story retrieval method based on BERT

Info

Publication number: CN114547251A
Application number: CN202210188618.XA
Authority: CN
Inventors: 吴晓军; 刘隆涛; 张玉梅; 路纲; 赵力
Original assignee: Shaanxi Normal University
Current assignee: Shaanxi Normal University
Priority date: 2022-02-28
Filing date: 2022-02-28
Publication date: 2022-05-27
Anticipated expiration: 2042-02-28
Also published as: CN114547251B

Abstract

A two-stage folk story retrieval method based on a BERT model comprises the steps of collecting folk stories, preprocessing folk story data, constructing folk story data sets, constructing a vector search engine in a one-stage mode, screening candidate folk story sets, training the BERT model, determining relevancy in two stages and displaying retrieval results. Compared with the traditional retrieval method, the method has the advantages that experimental results show that the method can better understand the context information of the folk stories, better combines the query request with the folk stories, improves the retrieval accuracy and accelerates the retrieval speed. The invention has the characteristics of accurate retrieval result, high retrieval speed and the like, and can accurately find the folk stories which the user wants to know from the massive folk stories.

Description

Two-stage folk story retrieval method based on BERT

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a computer information retrieval system.

Background

The 21 st century is an information age, and the work of collecting folk stories becomes simple due to the development of the internet, so that the folk stories which can be known by people are greatly increased. Meanwhile, the threshold and difficulty of text information processing are increasing, and the requirements on the quality standard and efficiency of the text retrieval technology are also increasing. Usually, finding a demo meeting the requirement in a large number of demo stories takes a lot of time, and the detection result often does not reach the expected result. There are many traditional search methods, such as a text similarity calculation-based method, an ontology-based search method, and a clustering-based search method. First, if the data set to be retrieved is large, the conventional retrieval method is extremely time-consuming and has low detection accuracy. Second, folk stories often contain rich textual content, and relying solely on the shallow features of text is far from sufficient. Therefore, it is very important to find a new searching method. The folk story contains rich historical knowledge, deep national emotion and is rich in variety and large in quantity. How to query the relevant folk stories from the folk stories with various and huge varieties becomes the difficulty of folk story retrieval.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a two-stage folk story retrieval method based on a BERT model, which has high retrieval accuracy and high retrieval speed.

The technical scheme for solving the technical problems comprises the following steps:

(1) gathering folk stories

And finding out a part of the folk story from the folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story.

(2) Folk story data preprocessing

And deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories.

(3) Construction of folk story data set

Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }₁:c₁,t₂:c₂,…,t_n:c_nWhere t is_nTitle representing the nth folk story, c_nAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set.

(4) One-stage construction of vector search engine

Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {₁,d₂,…,d_nAnd dividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine.

(5) Screening candidate folk story set

Converting a user's query request q into a query vector q through a BERT-whitening model_VWill query the vector q_vThe cosine similarity cos θ is determined with the database vector D as follows:

wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |₁,g₂,…,g_kAnd k is 20-50.

(6) Training BERT model

Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); the learning rate of the model r ∈ [ [ alpha ] ]10^-5,10^-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges.

(7) Two-stage determination of correlation

Embedding words output by the trained BERT model into E and output X of the l coding layer^lL is a finite positive integer, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set₁：

E＝E_s+E_p+E_t

X¹＝E

Q＝X^l-1×W^Q

K＝X^l-1×W^K

V＝X^l-1×W^V

F₁＝s(H¹²)

Wherein

Representing the output of a multi-head attention calculation, E_sRepresenting sentence word embedding, E_pIndicating position word embedding, E_tThe expression is embedded, C denotes the operation of connecting the attention moment array, A_jDenotes the attention matrix, s (H)¹²) Denotes the softmax function, X^l-1Is the layer l-1 output of the BERT model, d_kIs the dimension of the input vector, j represents the number of multiple head attentions, W^Q，W^K,W^VIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.

The correlation F is determined as follows:

F＝0.5×F₁+0.5×F₂

w_i＝s(r_i)

wherein, F₂Representing the sum of similarity of the query request and the candidate folk story sub-segments, r_iRepresenting the similarity of the query request and the candidate folk story sub-segments, w_iWeight, s (r), representing the degree of correlation of each sub-segment_i) Representing the softmax function.

(8) Displaying the search results

And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.

In step (3) of the present invention, the folk story pair of title-content is: the title and the content of the folk story are participled, and the title T is divided into:

T＝{t₁,t₂,…,t_u}

where u is the length of the title;

the content S is split into:

S＝{s₁,s₂,…,s_z}

where z is the length of the content.

In steps (5) and (7) of the present invention, the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.

In the step (6) of the invention, the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:

L(y,a)＝y×lna+(1-y)×ln(1-a)

in the step (6), the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y is 0 or 1, a is a predicted value, a is optimally 0.5, and the learning rate r of the model is optimally 10^-4The optimal discard rate value is 0.08, the optimal training round number is 12, the batch size of each round of training is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.

In step (7) of the present invention, the candidate folk story set G includes 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment and the beginning of the next sub-segment have 10% content repetition.

In the following formula of step (7) of the present invention:

X¹＝E

wherein X^lRepresents the output of the ith coding layer of the BERT model, and l takes the value of [1, 12%]。

Compared with the prior art, the invention has the following advantages:

because the invention adopts a two-stage retrieval method and combines a Faiss retrieval algorithm with the BERT pre-training language model, the model can better understand the context information of folk stories, the query request is firstly screened out a candidate folk story set and then the BERT model is utilized to carry out relevancy calculation, the traditional retrieval speed is improved, the retrieval accuracy is increased, and the retrieval result is accurate. The invention can help the user to quickly and accurately find the folk stories which the user wants to know in a large number of folk stories, reduces the waiting time of the user, stimulates the interest of the user in the folk stories and is beneficial to the propagation of traditional culture.

Drawings

FIG. 1 is a flowchart of example 1 of the present invention

Detailed Description

The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.

Example 1

In fig. 1, the BERT-based two-stage folk story retrieval method of the present embodiment is composed of the following steps:

(1) gathering folk stories

(2) Folk story data preprocessing

(3) Construction of folk story data set

The folk story pair of the title and the content is as follows: the title and the content of the folk story are participled, and the title is divided into:

T＝{t₁,t₂,…,t_u}

where u is the length of the header.

The content is split into:

S＝{s₁,s₂,…,s_z}

where z is the length of the content.

(4) One-stage construction of vector search engine

wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |₁,g₂,…,g_kAnd k is 20-50, and k is 40 in the embodiment.

The query request operation of this embodiment is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.

(6) Training BERT model

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]^-5,10^-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, a is 0.5, and the learning rate r of the model is 10^-4The discard rate is 0.08 and the number of training rounds is 12.

(7) Two-stage determination of relevance

Embedding words output by the trained BERT model into E and output X of the l coding layer^lL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set₁：

E＝E_s+E_p+E_t

X¹＝E

Q＝X^l-1×W^Q

K＝X^l-1×W^K

V＝X^l-1×W^V

F₁＝s(H¹²)

Wherein

Representing the output of a multi-head attention calculation, E_sRepresenting sentence word embedding, E_pIndicating position word embedding, E_tThe expression is embedded, C denotes the operation of connecting the attention moment array, A_jDenotes the attention matrix, s (H)¹²) Denotes the softmax function, X^l-1Is the layer l-1 output of the BERT model, d_kIs the dimension of the input vector, j represents the number of multi-head attentions, W^Q，W^K,W^VIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.

The correlation F is determined as follows:

F＝0.5×F₁+0.5×F₂

w_i＝s(r_i)

wherein, F₂Representing the sum of similarity of the query request and the candidate folk story sub-segments, r_iRepresenting the similarity of the query request and the candidate folk story sub-segments, w_iWeight, s (r), representing the degree of relevance of each sub-segment_i) Representing the softmax function.

(8) Displaying the search results

And completing the two-stage folk story retrieval method based on BERT.

Example 2

The BERT-based two-stage folk story retrieval method of the embodiment comprises the following steps of:

(1) gathering folk stories

This procedure is the same as in example 1.

(2) Folk story data preprocessing

This procedure is the same as in example 1.

(3) Construction of folk story data set

This procedure is the same as in example 1.

(4) One-stage construction of vector search engine

This procedure is the same as in example 1.

(5) Screening candidate folk story set

wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |₁,g₂,…,g_kAnd k is 20-50, and in the embodiment, k is 20.

The other steps of this step are the same as in example 1.

(6) Training BERT model

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]^-5,10^-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. Book (I)Example a takes a value of 0.01 and the learning rate r of the model takes a value of 10^-5The discard rate is 0.05 and the number of training rounds is 10.

The other steps of this step are the same as in example 1.

(7) Two-stage determination of relevance

Embedding words output by the trained BERT model into E and output X of the l coding layer^lL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined₁The procedure for determining the degree of correlation F is the same as in example 1.

The other steps were the same as in example 1.

And completing the two-stage folk story retrieval method based on BERT.

Example 3

(1) gathering folk stories

This procedure is the same as in example 1.

(2) Folk story data preprocessing

This procedure is the same as in example 1.

(3) Construction of folk story data set

This procedure is the same as in example 1.

(4) One-stage construction of vector search engine

This procedure is the same as in example 1.

(5) Screening candidate folk story set

wherein, the dot product operation is represented, d represents a vector in the database vector, and | | represents the modular operation, the first k candidate folk story sets G are returned, and G belongs to{g₁,g₂,…,g_kAnd k is 20-50, and in the embodiment, k is 50.

The other steps of this step are the same as in example 1.

(6) Training BERT model

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]^-5,10^-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, the value of a is 0.09, and the learning rate r of the model is 10^-3The discard rate is 0.1 and the number of training rounds is 15.

The other steps of this step are the same as in example 1.

(7) Two-stage determination of relevance

Embedding words output by the trained BERT model into E and output X of the l coding layer^lL is a finite positive integer, the value of l in the embodiment is 12, and the relevancy F of the candidate folk story set G is determined₁The procedure for determining the degree of correlation F is the same as in example 1.

The other steps were the same as in example 1.

And completing the two-stage folk story retrieval method based on BERT.

Example 4

In the above embodiments 1 to 3, the BERT-based two-stage folk story retrieval method of the present embodiment includes the following steps:

(1) steps (1) to (5) are the same as in example 1.

(6) Training BERT model

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y takes a value of 1, and other parameters are the same as those of the corresponding embodiment.

The other steps were the same as in example 1.

And completing the two-stage folk story retrieval method based on BERT.

In order to verify the beneficial effects of the invention, the inventor carried out a comparative simulation experiment by using the two-stage folk story search method based on BERT in embodiment 1 of the invention, and the terminal Frequency-Inverse Document Frequency method (hereinafter abbreviated as TF-IDF) and the Best Match 25 method (hereinafter abbreviated as BM25) in the traditional search method.

Evaluation indexes are as follows: the average reciprocal ranking M (hereinafter referred to as M), the normalized depreciation cumulative gain N (hereinafter referred to as N), the retrieval time T (hereinafter referred to as T), and the sizes of M and N are evaluation indexes for evaluating retrieval accuracy, and the large M and N indicate that the retrieval accuracy is good.

The average reciprocal rank M is calculated as follows:

wherein, B takes 1000 to represent the number of query requests in the test set, and for each query q_iNote that its first correlation result is ranked at position b_i。

The normalized loss-to-break cumulative gain N is calculated as follows:

wherein, the representation results are sorted according to the sequence of the relevance from big to small, and a set formed by the top 10 results is taken. e.g. of the type_iRepresenting the query relevance value.

The results of the experiments and calculations are shown in table 1.

Table 1 simulation experiment result table of the present invention and the comparative search method

As can be seen from Table 1, the average reciprocal rank M of the example 1 method is 0.344, which is higher than the TF-IDF method, BM25 method; the normalized breaking cumulative gain N of the method of the embodiment 1 is 0.389 which is higher than the TF-IDF method and the BM25 method; the search time of the method of example 1 was 4813 seconds, which is less than those of the TF-IDF method and BM25 method. Experiments show that the method has better retrieval accuracy and higher retrieval speed.

Claims

1. A two-stage folk story retrieval method based on a BERT model is characterized by comprising the following steps:

(1) collecting folk stories

Finding out a part of a folk story from a folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story;

(2) folk story data preprocessing

Deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories;

(3) construction of folk story data set

Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }₁:c₁,t₂:c₂,…,t_n:c_nWhere t is_nTitle representing the nth folk story, c_nAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set;

(4) one-stage construction of vector search engine

Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {₁,d₂,…,d_nDividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine;

(5) screening candidate folk story set

Passing a user's query request q through BERT-whiteninConversion of g models into query vectors q_VWill query the vector q_vThe cosine similarity cos θ is determined with the database vector D as follows:

wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |₁,g₂,…,g_kK is 20-50;

(6) training BERT model

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); learning rate r of model E [10 ]^-5,10^-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]The batch size of each round of training is 8, the optimizer selects Adam, and the iteration is carried out until the cross entropy loss function L (y, a) converges;

(7) two-stage determination of relevance

E＝E_s+E_p+E_t

X¹＝E

Q＝X^l-1×W^Q

K＝X^l-1×W^K

V＝X^l-1×W^V

F₁＝s(H¹²)

Wherein

Representing the output of a multi-head attention calculation, E_sRepresenting sentence word embedding, E_pIndicating position word embedding, E_tThe expression is embedded, C denotes the operation of connecting the attention moment array, A_jDenotes the attention matrix, s (H)¹²) Denotes the softmax function, X^l-1Is the layer l-1 output of the BERT model, d_kIs the dimension of the input vector, j represents the number of multi-head attentions, W^Q，W^K,W^VIs a linear mapping matrix, Q, K, V denotes learning a parameter matrix during training;

the correlation F is determined as follows:

F＝0.5×F₁+0.5×F₂

w_i＝s(r_i)

wherein, F₂Representing the sum of similarity of the query request and the candidate folk story sub-segments, r_iRepresenting the similarity of the query request and the candidate folk story sub-segments, w_iWeight, s (r), representing the degree of correlation of each sub-segment_i) Represents the softmax function;

(8) displaying the search results

2. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the (3) step, the folk story pairs of title-content are: the title and the content of the folk story are participled, and the title is divided into:

T＝{t1,t2,…,tu}

where u is the length of the title;

the content is split into:

S＝{s1,s2,…,sz}

where z is the length of the content.

3. The BERT-based two-stage folk story retrieval method of claim 1, wherein in steps (5) and (7), the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.

4. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the step (6), the folk story data set Y is inputted to the BERT model for training, and the cross entropy loss function L (Y, a) is determined as follows:

L(y,a)＝y×lna+(1-y)×ln(1-a)

wherein y is a true value, y is 0 or 1, a is a predicted value, a is 0.5, and the learning rate r of the model is 10^-4The discard rate is 0.08, the number of training rounds is 12, the batch size of each training round is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.

5. The BERT-based two-stage folk story retrieval method of claim 1, wherein in step (7), the candidate folk story set G comprises 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment is 10% of the content of the beginning of the next sub-segment.

6. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the following formula in (7):

X¹＝E

wherein X^lRepresenting BERThe output of the first coding layer of the T model, wherein l is [1,12 ]]。