CN114547251A - Two-stage folk story retrieval method based on BERT - Google Patents
Two-stage folk story retrieval method based on BERT Download PDFInfo
- Publication number
- CN114547251A CN114547251A CN202210188618.XA CN202210188618A CN114547251A CN 114547251 A CN114547251 A CN 114547251A CN 202210188618 A CN202210188618 A CN 202210188618A CN 114547251 A CN114547251 A CN 114547251A
- Authority
- CN
- China
- Prior art keywords
- folk
- story
- bert
- vector
- folk story
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 64
- 239000013598 vector Substances 0.000 claims abstract description 52
- 238000012549 training Methods 0.000 claims abstract description 40
- 238000007781 pre-processing Methods 0.000 claims abstract description 6
- 238000012216 screening Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 22
- 238000010276 construction Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 9
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 230000009193 crawling Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 230000011218 segmentation Effects 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000001186 cumulative effect Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000052 comparative effect Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011156 evaluation Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000008451 emotion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3341—Query execution using boolean model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A two-stage folk story retrieval method based on a BERT model comprises the steps of collecting folk stories, preprocessing folk story data, constructing folk story data sets, constructing a vector search engine in a one-stage mode, screening candidate folk story sets, training the BERT model, determining relevancy in two stages and displaying retrieval results. Compared with the traditional retrieval method, the method has the advantages that experimental results show that the method can better understand the context information of the folk stories, better combines the query request with the folk stories, improves the retrieval accuracy and accelerates the retrieval speed. The invention has the characteristics of accurate retrieval result, high retrieval speed and the like, and can accurately find the folk stories which the user wants to know from the massive folk stories.
Description
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a computer information retrieval system.
Background
The 21 st century is an information age, and the work of collecting folk stories becomes simple due to the development of the internet, so that the folk stories which can be known by people are greatly increased. Meanwhile, the threshold and difficulty of text information processing are increasing, and the requirements on the quality standard and efficiency of the text retrieval technology are also increasing. Usually, finding a demo meeting the requirement in a large number of demo stories takes a lot of time, and the detection result often does not reach the expected result. There are many traditional search methods, such as a text similarity calculation-based method, an ontology-based search method, and a clustering-based search method. First, if the data set to be retrieved is large, the conventional retrieval method is extremely time-consuming and has low detection accuracy. Second, folk stories often contain rich textual content, and relying solely on the shallow features of text is far from sufficient. Therefore, it is very important to find a new searching method. The folk story contains rich historical knowledge, deep national emotion and is rich in variety and large in quantity. How to query the relevant folk stories from the folk stories with various and huge varieties becomes the difficulty of folk story retrieval.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a two-stage folk story retrieval method based on a BERT model, which has high retrieval accuracy and high retrieval speed.
The technical scheme for solving the technical problems comprises the following steps:
(1) gathering folk stories
And finding out a part of the folk story from the folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story.
(2) Folk story data preprocessing
And deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories.
(3) Construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set.
(4) One-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnAnd dividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); the learning rate of the model r ∈ [ [ alpha ] ]10-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges.
(7) Two-stage determination of correlation
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1:
E=Es+Ep+Et
X1=E
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
WhereinRepresenting the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multiple head attentions, WQ,WK,WVIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.
The correlation F is determined as follows:
F=0.5×F1+0.5×F2
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of correlation of each sub-segmenti) Representing the softmax function.
(8) Displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
In step (3) of the present invention, the folk story pair of title-content is: the title and the content of the folk story are participled, and the title T is divided into:
T={t1,t2,…,tu}
where u is the length of the title;
the content S is split into:
S={s1,s2,…,sz}
where z is the length of the content.
In steps (5) and (7) of the present invention, the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
In the step (6) of the invention, the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
in the step (6), the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0 or 1, a is a predicted value, a is optimally 0.5, and the learning rate r of the model is optimally 10-4The optimal discard rate value is 0.08, the optimal training round number is 12, the batch size of each round of training is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.
In step (7) of the present invention, the candidate folk story set G includes 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment and the beginning of the next sub-segment have 10% content repetition.
In the following formula of step (7) of the present invention:
X1=E
wherein XlRepresents the output of the ith coding layer of the BERT model, and l takes the value of [1, 12%]。
Compared with the prior art, the invention has the following advantages:
because the invention adopts a two-stage retrieval method and combines a Faiss retrieval algorithm with the BERT pre-training language model, the model can better understand the context information of folk stories, the query request is firstly screened out a candidate folk story set and then the BERT model is utilized to carry out relevancy calculation, the traditional retrieval speed is improved, the retrieval accuracy is increased, and the retrieval result is accurate. The invention can help the user to quickly and accurately find the folk stories which the user wants to know in a large number of folk stories, reduces the waiting time of the user, stimulates the interest of the user in the folk stories and is beneficial to the propagation of traditional culture.
Drawings
FIG. 1 is a flowchart of example 1 of the present invention
Detailed Description
The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.
Example 1
In fig. 1, the BERT-based two-stage folk story retrieval method of the present embodiment is composed of the following steps:
(1) gathering folk stories
And finding out a part of the folk story from the folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story.
(2) Folk story data preprocessing
And deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories.
(3) Construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set.
The folk story pair of the title and the content is as follows: the title and the content of the folk story are participled, and the title is divided into:
T={t1,t2,…,tu}
where u is the length of the header.
The content is split into:
S={s1,s2,…,sz}
where z is the length of the content.
(4) One-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnAnd dividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine.
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50, and k is 40 in the embodiment.
The query request operation of this embodiment is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, a is 0.5, and the learning rate r of the model is 10-4The discard rate is 0.08 and the number of training rounds is 12.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1:
E=Es+Ep+Et
X1=E
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
WhereinRepresenting the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multi-head attentions, WQ,WK,WVIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.
The correlation F is determined as follows:
F=0.5×F1+0.5×F2
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of relevance of each sub-segmenti) Representing the softmax function.
(8) Displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
And completing the two-stage folk story retrieval method based on BERT.
Example 2
The BERT-based two-stage folk story retrieval method of the embodiment comprises the following steps of:
(1) gathering folk stories
This procedure is the same as in example 1.
(2) Folk story data preprocessing
This procedure is the same as in example 1.
(3) Construction of folk story data set
This procedure is the same as in example 1.
(4) One-stage construction of vector search engine
This procedure is the same as in example 1.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50, and in the embodiment, k is 20.
The other steps of this step are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. Book (I)Example a takes a value of 0.01 and the learning rate r of the model takes a value of 10-5The discard rate is 0.05 and the number of training rounds is 10.
The other steps of this step are the same as in example 1.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined1The procedure for determining the degree of correlation F is the same as in example 1.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
Example 3
The BERT-based two-stage folk story retrieval method of the embodiment comprises the following steps of:
(1) gathering folk stories
This procedure is the same as in example 1.
(2) Folk story data preprocessing
This procedure is the same as in example 1.
(3) Construction of folk story data set
This procedure is the same as in example 1.
(4) One-stage construction of vector search engine
This procedure is the same as in example 1.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
wherein, the dot product operation is represented, d represents a vector in the database vector, and | | represents the modular operation, the first k candidate folk story sets G are returned, and G belongs to{g1,g2,…,gkAnd k is 20-50, and in the embodiment, k is 50.
The other steps of this step are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, the value of a is 0.09, and the learning rate r of the model is 10-3The discard rate is 0.1 and the number of training rounds is 15.
The other steps of this step are the same as in example 1.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, the value of l in the embodiment is 12, and the relevancy F of the candidate folk story set G is determined1The procedure for determining the degree of correlation F is the same as in example 1.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
Example 4
In the above embodiments 1 to 3, the BERT-based two-stage folk story retrieval method of the present embodiment includes the following steps:
(1) steps (1) to (5) are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 1, and other parameters are the same as those of the corresponding embodiment.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
In order to verify the beneficial effects of the invention, the inventor carried out a comparative simulation experiment by using the two-stage folk story search method based on BERT in embodiment 1 of the invention, and the terminal Frequency-Inverse Document Frequency method (hereinafter abbreviated as TF-IDF) and the Best Match 25 method (hereinafter abbreviated as BM25) in the traditional search method.
Evaluation indexes are as follows: the average reciprocal ranking M (hereinafter referred to as M), the normalized depreciation cumulative gain N (hereinafter referred to as N), the retrieval time T (hereinafter referred to as T), and the sizes of M and N are evaluation indexes for evaluating retrieval accuracy, and the large M and N indicate that the retrieval accuracy is good.
The average reciprocal rank M is calculated as follows:
wherein, B takes 1000 to represent the number of query requests in the test set, and for each query qiNote that its first correlation result is ranked at position bi。
The normalized loss-to-break cumulative gain N is calculated as follows:
wherein, the representation results are sorted according to the sequence of the relevance from big to small, and a set formed by the top 10 results is taken. e.g. of the typeiRepresenting the query relevance value.
The results of the experiments and calculations are shown in table 1.
Table 1 simulation experiment result table of the present invention and the comparative search method
As can be seen from Table 1, the average reciprocal rank M of the example 1 method is 0.344, which is higher than the TF-IDF method, BM25 method; the normalized breaking cumulative gain N of the method of the embodiment 1 is 0.389 which is higher than the TF-IDF method and the BM25 method; the search time of the method of example 1 was 4813 seconds, which is less than those of the TF-IDF method and BM25 method. Experiments show that the method has better retrieval accuracy and higher retrieval speed.
Claims (6)
1. A two-stage folk story retrieval method based on a BERT model is characterized by comprising the following steps:
(1) collecting folk stories
Finding out a part of a folk story from a folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story;
(2) folk story data preprocessing
Deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories;
(3) construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set;
(4) one-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnDividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine;
(5) screening candidate folk story set
Passing a user's query request q through BERT-whiteninConversion of g models into query vectors qVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkK is 20-50;
(6) training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); learning rate r of model E [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]The batch size of each round of training is 8, the optimizer selects Adam, and the iteration is carried out until the cross entropy loss function L (y, a) converges;
(7) two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1:
E=Es+Ep+Et
X1=E
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
WhereinRepresenting the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multi-head attentions, WQ,WK,WVIs a linear mapping matrix, Q, K, V denotes learning a parameter matrix during training;
the correlation F is determined as follows:
F=0.5×F1+0.5×F2
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of correlation of each sub-segmenti) Represents the softmax function;
(8) displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
2. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the (3) step, the folk story pairs of title-content are: the title and the content of the folk story are participled, and the title is divided into:
T={t1,t2,…,tu}
where u is the length of the title;
the content is split into:
S={s1,s2,…,sz}
where z is the length of the content.
3. The BERT-based two-stage folk story retrieval method of claim 1, wherein in steps (5) and (7), the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
4. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the step (6), the folk story data set Y is inputted to the BERT model for training, and the cross entropy loss function L (Y, a) is determined as follows:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0 or 1, a is a predicted value, a is 0.5, and the learning rate r of the model is 10-4The discard rate is 0.08, the number of training rounds is 12, the batch size of each training round is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.
5. The BERT-based two-stage folk story retrieval method of claim 1, wherein in step (7), the candidate folk story set G comprises 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment is 10% of the content of the beginning of the next sub-segment.
6. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the following formula in (7):
X1=E
wherein XlRepresenting BERThe output of the first coding layer of the T model, wherein l is [1,12 ]]。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210188618.XA CN114547251B (en) | 2022-02-28 | 2022-02-28 | BERT-based two-stage folk story retrieval method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210188618.XA CN114547251B (en) | 2022-02-28 | 2022-02-28 | BERT-based two-stage folk story retrieval method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114547251A true CN114547251A (en) | 2022-05-27 |
CN114547251B CN114547251B (en) | 2024-03-01 |
Family
ID=81679714
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210188618.XA Active CN114547251B (en) | 2022-02-28 | 2022-02-28 | BERT-based two-stage folk story retrieval method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114547251B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023246337A1 (en) * | 2022-06-24 | 2023-12-28 | 京东方科技集团股份有限公司 | Unsupervised semantic retrieval method and apparatus, and computer-readable storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012173794A (en) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | Document retrieval device having ranking model selection function, document retrieval method having ranking model selection function, and document retrieval program having ranking model selection function |
CN112256860A (en) * | 2020-11-25 | 2021-01-22 | 携程计算机技术(上海)有限公司 | Semantic retrieval method, system, equipment and storage medium for customer service conversation content |
WO2021211207A1 (en) * | 2020-04-17 | 2021-10-21 | Microsoft Technology Licensing, Llc | Adversarial pretraining of machine learning models |
CN113553824A (en) * | 2021-07-07 | 2021-10-26 | 临沂中科好孕智能技术有限公司 | Sentence vector model training method |
CN113962228A (en) * | 2021-10-26 | 2022-01-21 | 北京理工大学 | Long document retrieval method based on semantic fusion of memory network |
-
2022
- 2022-02-28 CN CN202210188618.XA patent/CN114547251B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2012173794A (en) * | 2011-02-17 | 2012-09-10 | Nippon Telegr & Teleph Corp <Ntt> | Document retrieval device having ranking model selection function, document retrieval method having ranking model selection function, and document retrieval program having ranking model selection function |
WO2021211207A1 (en) * | 2020-04-17 | 2021-10-21 | Microsoft Technology Licensing, Llc | Adversarial pretraining of machine learning models |
CN112256860A (en) * | 2020-11-25 | 2021-01-22 | 携程计算机技术(上海)有限公司 | Semantic retrieval method, system, equipment and storage medium for customer service conversation content |
CN113553824A (en) * | 2021-07-07 | 2021-10-26 | 临沂中科好孕智能技术有限公司 | Sentence vector model training method |
CN113962228A (en) * | 2021-10-26 | 2022-01-21 | 北京理工大学 | Long document retrieval method based on semantic fusion of memory network |
Non-Patent Citations (2)
Title |
---|
赵亚慧;: "大容量文本检索算法", 延边大学学报(自然科学版), no. 01, 20 March 2009 (2009-03-20) * |
顾迎捷;桂小林;李德福;沈毅;廖东;: "基于神经网络的机器阅读理解综述", 软件学报, no. 07, 15 July 2020 (2020-07-15) * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023246337A1 (en) * | 2022-06-24 | 2023-12-28 | 京东方科技集团股份有限公司 | Unsupervised semantic retrieval method and apparatus, and computer-readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114547251B (en) | 2024-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111353030B (en) | Knowledge question and answer retrieval method and device based on knowledge graph in travel field | |
WO2021093755A1 (en) | Matching method and apparatus for questions, and reply method and apparatus for questions | |
CN112035730B (en) | Semantic retrieval method and device and electronic equipment | |
CN108846029B (en) | Information correlation analysis method based on knowledge graph | |
CN111291188B (en) | Intelligent information extraction method and system | |
CN108132927B (en) | Keyword extraction method for combining graph structure and node association | |
US20180341686A1 (en) | System and method for data search based on top-to-bottom similarity analysis | |
CN106708929B (en) | Video program searching method and device | |
CN112307182B (en) | Question-answering system-based pseudo-correlation feedback extended query method | |
CN115599902B (en) | Oil-gas encyclopedia question-answering method and system based on knowledge graph | |
CN113761890A (en) | BERT context sensing-based multi-level semantic information retrieval method | |
CN110851584A (en) | Accurate recommendation system and method for legal provision | |
CN116306504B (en) | Candidate entity generation method and device, storage medium and electronic equipment | |
CN106570196B (en) | Video program searching method and device | |
CN113806510B (en) | Legal provision retrieval method, terminal equipment and computer storage medium | |
CN114547251B (en) | BERT-based two-stage folk story retrieval method | |
CN116401345A (en) | Intelligent question-answering method, device, storage medium and equipment | |
CN113342950B (en) | Answer selection method and system based on semantic association | |
CN112487274B (en) | Search result recommendation method and system based on text click rate | |
CN114297351A (en) | Statement question and answer method, device, equipment, storage medium and computer program product | |
CN117634615A (en) | Multi-task code retrieval method based on mode irrelevant comparison learning | |
CN117633148A (en) | Medical term standardization method based on fusion multi-strategy comparison learning | |
CN116204622A (en) | Query expression enhancement method in cross-language dense retrieval | |
CN113627722B (en) | Simple answer scoring method based on keyword segmentation, terminal and readable storage medium | |
CN112199461B (en) | Document retrieval method, device, medium and equipment based on block index structure |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |