CN114547251A - Two-stage folk story retrieval method based on BERT - Google Patents

Two-stage folk story retrieval method based on BERT Download PDF

Info

Publication number
CN114547251A
CN114547251A CN202210188618.XA CN202210188618A CN114547251A CN 114547251 A CN114547251 A CN 114547251A CN 202210188618 A CN202210188618 A CN 202210188618A CN 114547251 A CN114547251 A CN 114547251A
Authority
CN
China
Prior art keywords
folk
story
bert
vector
folk story
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210188618.XA
Other languages
Chinese (zh)
Other versions
CN114547251B (en
Inventor
吴晓军
刘隆涛
张玉梅
路纲
赵力
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shaanxi Normal University
Original Assignee
Shaanxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shaanxi Normal University filed Critical Shaanxi Normal University
Priority to CN202210188618.XA priority Critical patent/CN114547251B/en
Publication of CN114547251A publication Critical patent/CN114547251A/en
Application granted granted Critical
Publication of CN114547251B publication Critical patent/CN114547251B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/319Inverted lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3341Query execution using boolean model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A two-stage folk story retrieval method based on a BERT model comprises the steps of collecting folk stories, preprocessing folk story data, constructing folk story data sets, constructing a vector search engine in a one-stage mode, screening candidate folk story sets, training the BERT model, determining relevancy in two stages and displaying retrieval results. Compared with the traditional retrieval method, the method has the advantages that experimental results show that the method can better understand the context information of the folk stories, better combines the query request with the folk stories, improves the retrieval accuracy and accelerates the retrieval speed. The invention has the characteristics of accurate retrieval result, high retrieval speed and the like, and can accurately find the folk stories which the user wants to know from the massive folk stories.

Description

Two-stage folk story retrieval method based on BERT
Technical Field
The invention belongs to the technical field of computers, and particularly relates to a computer information retrieval system.
Background
The 21 st century is an information age, and the work of collecting folk stories becomes simple due to the development of the internet, so that the folk stories which can be known by people are greatly increased. Meanwhile, the threshold and difficulty of text information processing are increasing, and the requirements on the quality standard and efficiency of the text retrieval technology are also increasing. Usually, finding a demo meeting the requirement in a large number of demo stories takes a lot of time, and the detection result often does not reach the expected result. There are many traditional search methods, such as a text similarity calculation-based method, an ontology-based search method, and a clustering-based search method. First, if the data set to be retrieved is large, the conventional retrieval method is extremely time-consuming and has low detection accuracy. Second, folk stories often contain rich textual content, and relying solely on the shallow features of text is far from sufficient. Therefore, it is very important to find a new searching method. The folk story contains rich historical knowledge, deep national emotion and is rich in variety and large in quantity. How to query the relevant folk stories from the folk stories with various and huge varieties becomes the difficulty of folk story retrieval.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provide a two-stage folk story retrieval method based on a BERT model, which has high retrieval accuracy and high retrieval speed.
The technical scheme for solving the technical problems comprises the following steps:
(1) gathering folk stories
And finding out a part of the folk story from the folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story.
(2) Folk story data preprocessing
And deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories.
(3) Construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set.
(4) One-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnAnd dividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
Figure BDA0003524625060000021
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); the learning rate of the model r ∈ [ [ alpha ] ]10-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges.
(7) Two-stage determination of correlation
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1
E=Es+Ep+Et
X1=E
Figure BDA0003524625060000031
Figure BDA0003524625060000032
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
Wherein
Figure BDA0003524625060000033
Representing the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multiple head attentions, WQ,WK,WVIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.
The correlation F is determined as follows:
F=0.5×F1+0.5×F2
Figure BDA0003524625060000034
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of correlation of each sub-segmenti) Representing the softmax function.
(8) Displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
In step (3) of the present invention, the folk story pair of title-content is: the title and the content of the folk story are participled, and the title T is divided into:
T={t1,t2,…,tu}
where u is the length of the title;
the content S is split into:
S={s1,s2,…,sz}
where z is the length of the content.
In steps (5) and (7) of the present invention, the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
In the step (6) of the invention, the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
in the step (6), the folk story data set Y is input into a BERT model for training, and a cross entropy loss function L (Y, a) is determined according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0 or 1, a is a predicted value, a is optimally 0.5, and the learning rate r of the model is optimally 10-4The optimal discard rate value is 0.08, the optimal training round number is 12, the batch size of each round of training is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.
In step (7) of the present invention, the candidate folk story set G includes 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment and the beginning of the next sub-segment have 10% content repetition.
In the following formula of step (7) of the present invention:
X1=E
wherein XlRepresents the output of the ith coding layer of the BERT model, and l takes the value of [1, 12%]。
Compared with the prior art, the invention has the following advantages:
because the invention adopts a two-stage retrieval method and combines a Faiss retrieval algorithm with the BERT pre-training language model, the model can better understand the context information of folk stories, the query request is firstly screened out a candidate folk story set and then the BERT model is utilized to carry out relevancy calculation, the traditional retrieval speed is improved, the retrieval accuracy is increased, and the retrieval result is accurate. The invention can help the user to quickly and accurately find the folk stories which the user wants to know in a large number of folk stories, reduces the waiting time of the user, stimulates the interest of the user in the folk stories and is beneficial to the propagation of traditional culture.
Drawings
FIG. 1 is a flowchart of example 1 of the present invention
Detailed Description
The present invention will be described in further detail below with reference to the drawings and examples, but the present invention is not limited to the embodiments described below.
Example 1
In fig. 1, the BERT-based two-stage folk story retrieval method of the present embodiment is composed of the following steps:
(1) gathering folk stories
And finding out a part of the folk story from the folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story.
(2) Folk story data preprocessing
And deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories.
(3) Construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set.
The folk story pair of the title and the content is as follows: the title and the content of the folk story are participled, and the title is divided into:
T={t1,t2,…,tu}
where u is the length of the header.
The content is split into:
S={s1,s2,…,sz}
where z is the length of the content.
(4) One-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnAnd dividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine.
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
Figure BDA0003524625060000061
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50, and k is 40 in the embodiment.
The query request operation of this embodiment is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, a is 0.5, and the learning rate r of the model is 10-4The discard rate is 0.08 and the number of training rounds is 12.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1
E=Es+Ep+Et
X1=E
Figure BDA0003524625060000062
Figure BDA0003524625060000063
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
Wherein
Figure BDA0003524625060000071
Representing the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multi-head attentions, WQ,WK,WVIs a linear mapping matrix and Q, K, V represents the learning parameter matrix during the training process.
The correlation F is determined as follows:
F=0.5×F1+0.5×F2
Figure BDA0003524625060000072
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of relevance of each sub-segmenti) Representing the softmax function.
(8) Displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
And completing the two-stage folk story retrieval method based on BERT.
Example 2
The BERT-based two-stage folk story retrieval method of the embodiment comprises the following steps of:
(1) gathering folk stories
This procedure is the same as in example 1.
(2) Folk story data preprocessing
This procedure is the same as in example 1.
(3) Construction of folk story data set
This procedure is the same as in example 1.
(4) One-stage construction of vector search engine
This procedure is the same as in example 1.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
Figure BDA0003524625060000081
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkAnd k is 20-50, and in the embodiment, k is 20.
The other steps of this step are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. Book (I)Example a takes a value of 0.01 and the learning rate r of the model takes a value of 10-5The discard rate is 0.05 and the number of training rounds is 10.
The other steps of this step are the same as in example 1.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a limited positive integer, the value of l in the embodiment is 6, and the correlation F of the candidate folk story set G is determined1The procedure for determining the degree of correlation F is the same as in example 1.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
Example 3
The BERT-based two-stage folk story retrieval method of the embodiment comprises the following steps of:
(1) gathering folk stories
This procedure is the same as in example 1.
(2) Folk story data preprocessing
This procedure is the same as in example 1.
(3) Construction of folk story data set
This procedure is the same as in example 1.
(4) One-stage construction of vector search engine
This procedure is the same as in example 1.
(5) Screening candidate folk story set
Converting a user's query request q into a query vector q through a BERT-whitening modelVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
Figure BDA0003524625060000091
wherein, the dot product operation is represented, d represents a vector in the database vector, and | | represents the modular operation, the first k candidate folk story sets G are returned, and G belongs to{g1,g2,…,gkAnd k is 20-50, and in the embodiment, k is 50.
The other steps of this step are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0, a is a predicted value, a belongs to (0,1), and the learning rate r of the model belongs to [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]With the batch size of 8 for each round of training, the optimizer chooses Adam, iterating until the cross-entropy loss function L (y, a) converges. In this embodiment, the value of a is 0.09, and the learning rate r of the model is 10-3The discard rate is 0.1 and the number of training rounds is 15.
The other steps of this step are the same as in example 1.
(7) Two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, the value of l in the embodiment is 12, and the relevancy F of the candidate folk story set G is determined1The procedure for determining the degree of correlation F is the same as in example 1.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
Example 4
In the above embodiments 1 to 3, the BERT-based two-stage folk story retrieval method of the present embodiment includes the following steps:
(1) steps (1) to (5) are the same as in example 1.
(6) Training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 1, and other parameters are the same as those of the corresponding embodiment.
The other steps were the same as in example 1.
And completing the two-stage folk story retrieval method based on BERT.
In order to verify the beneficial effects of the invention, the inventor carried out a comparative simulation experiment by using the two-stage folk story search method based on BERT in embodiment 1 of the invention, and the terminal Frequency-Inverse Document Frequency method (hereinafter abbreviated as TF-IDF) and the Best Match 25 method (hereinafter abbreviated as BM25) in the traditional search method.
Evaluation indexes are as follows: the average reciprocal ranking M (hereinafter referred to as M), the normalized depreciation cumulative gain N (hereinafter referred to as N), the retrieval time T (hereinafter referred to as T), and the sizes of M and N are evaluation indexes for evaluating retrieval accuracy, and the large M and N indicate that the retrieval accuracy is good.
The average reciprocal rank M is calculated as follows:
Figure BDA0003524625060000101
wherein, B takes 1000 to represent the number of query requests in the test set, and for each query qiNote that its first correlation result is ranked at position bi
The normalized loss-to-break cumulative gain N is calculated as follows:
Figure BDA0003524625060000102
wherein, the representation results are sorted according to the sequence of the relevance from big to small, and a set formed by the top 10 results is taken. e.g. of the typeiRepresenting the query relevance value.
The results of the experiments and calculations are shown in table 1.
Table 1 simulation experiment result table of the present invention and the comparative search method
Figure BDA0003524625060000111
As can be seen from Table 1, the average reciprocal rank M of the example 1 method is 0.344, which is higher than the TF-IDF method, BM25 method; the normalized breaking cumulative gain N of the method of the embodiment 1 is 0.389 which is higher than the TF-IDF method and the BM25 method; the search time of the method of example 1 was 4813 seconds, which is less than those of the TF-IDF method and BM25 method. Experiments show that the method has better retrieval accuracy and higher retrieval speed.

Claims (6)

1. A two-stage folk story retrieval method based on a BERT model is characterized by comprising the following steps:
(1) collecting folk stories
Finding out a part of a folk story from a folk culture resource management system, and crawling down text data in the folk story by adopting a crawler method to obtain the folk story;
(2) folk story data preprocessing
Deleting parts of messy codes, empty contents, inconformity with the contents and random replacement of synonyms in the contents of the folk stories;
(3) construction of folk story data set
Processing folk stories into folk story pairs of title-content to make folk story data set Y, Y being E { t }1:c1,t2:c2,…,tn:cnWhere t isnTitle representing the nth folk story, cnAnd (3) representing the content of the nth folk story, wherein n selects 10000 folk stories and is divided into 9 parts: the proportion of 1 is divided into a training set and a testing set;
(4) one-stage construction of vector search engine
Converting folk story data set Y into word vector J by adopting BERT-whitening model, and establishing database vector D for word vector J by using Faiss retrieval method, wherein D belongs to { D ∈ {1,d2,…,dnDividing the database vector D into N spaces by adopting a reverse fast indexing method, wherein N is a limited positive integer, and constructing a vector search engine;
(5) screening candidate folk story set
Passing a user's query request q through BERT-whiteninConversion of g models into query vectors qVWill query the vector qvThe cosine similarity cos θ is determined with the database vector D as follows:
Figure FDA0003524625050000011
wherein, represents dot product operation, d represents a vector in the database vector, | | | | | represents modulus operation, the first k candidate folk story sets G are returned, G is formed by { G ∈ |1,g2,…,gkK is 20-50;
(6) training BERT model
Inputting the folk story data set Y into a BERT model for training, and determining a cross entropy loss function L (Y, a) according to the following formula:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y takes a value of 0 or 1, a is a predicted value, and a belongs to (0, 1); learning rate r of model E [10 ]-5,10-3]The discarding rate is [0.05,0.1 ]]The number of training rounds is [10,15 ]]The batch size of each round of training is 8, the optimizer selects Adam, and the iteration is carried out until the cross entropy loss function L (y, a) converges;
(7) two-stage determination of relevance
Embedding words output by the trained BERT model into E and output X of the l coding layerlL is a finite positive integer, and the correlation F of the candidate folk story set G is determined according to the following formula for the query request q and the candidate folk story set1
E=Es+Ep+Et
X1=E
Figure FDA0003524625050000021
Figure FDA0003524625050000022
Q=Xl-1×WQ
K=Xl-1×WK
V=Xl-1×WV
F1=s(H12)
Wherein
Figure FDA0003524625050000023
Representing the output of a multi-head attention calculation, EsRepresenting sentence word embedding, EpIndicating position word embedding, EtThe expression is embedded, C denotes the operation of connecting the attention moment array, AjDenotes the attention matrix, s (H)12) Denotes the softmax function, Xl-1Is the layer l-1 output of the BERT model, dkIs the dimension of the input vector, j represents the number of multi-head attentions, WQ,WK,WVIs a linear mapping matrix, Q, K, V denotes learning a parameter matrix during training;
the correlation F is determined as follows:
F=0.5×F1+0.5×F2
Figure FDA0003524625050000024
wi=s(ri)
wherein, F2Representing the sum of similarity of the query request and the candidate folk story sub-segments, riRepresenting the similarity of the query request and the candidate folk story sub-segments, wiWeight, s (r), representing the degree of correlation of each sub-segmenti) Represents the softmax function;
(8) displaying the search results
And sequencing the relevancy F from high to low, and showing the folk stories with the highest similarity as a final retrieval result to the user.
2. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the (3) step, the folk story pairs of title-content are: the title and the content of the folk story are participled, and the title is divided into:
T={t1,t2,…,tu}
where u is the length of the title;
the content is split into:
S={s1,s2,…,sz}
where z is the length of the content.
3. The BERT-based two-stage folk story retrieval method of claim 1, wherein in steps (5) and (7), the query request q is: and (5) carrying out analysis and word segmentation processing, and converting into a word vector V through a BERT-whitening model.
4. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the step (6), the folk story data set Y is inputted to the BERT model for training, and the cross entropy loss function L (Y, a) is determined as follows:
L(y,a)=y×lna+(1-y)×ln(1-a)
wherein y is a true value, y is 0 or 1, a is a predicted value, a is 0.5, and the learning rate r of the model is 10-4The discard rate is 0.08, the number of training rounds is 12, the batch size of each training round is 8, and the optimizer selects Adam and iterates until the cross entropy loss function L (y, a) converges.
5. The BERT-based two-stage folk story retrieval method of claim 1, wherein in step (7), the candidate folk story set G comprises 5 sub-segments, each sub-segment has a length of 128, and the end of the previous sub-segment is 10% of the content of the beginning of the next sub-segment.
6. The BERT-based two-stage folk story retrieval method of claim 1, wherein in the following formula in (7):
X1=E
wherein XlRepresenting BERThe output of the first coding layer of the T model, wherein l is [1,12 ]]。
CN202210188618.XA 2022-02-28 2022-02-28 BERT-based two-stage folk story retrieval method Active CN114547251B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210188618.XA CN114547251B (en) 2022-02-28 2022-02-28 BERT-based two-stage folk story retrieval method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210188618.XA CN114547251B (en) 2022-02-28 2022-02-28 BERT-based two-stage folk story retrieval method

Publications (2)

Publication Number Publication Date
CN114547251A true CN114547251A (en) 2022-05-27
CN114547251B CN114547251B (en) 2024-03-01

Family

ID=81679714

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210188618.XA Active CN114547251B (en) 2022-02-28 2022-02-28 BERT-based two-stage folk story retrieval method

Country Status (1)

Country Link
CN (1) CN114547251B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023246337A1 (en) * 2022-06-24 2023-12-28 京东方科技集团股份有限公司 Unsupervised semantic retrieval method and apparatus, and computer-readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012173794A (en) * 2011-02-17 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval device having ranking model selection function, document retrieval method having ranking model selection function, and document retrieval program having ranking model selection function
CN112256860A (en) * 2020-11-25 2021-01-22 携程计算机技术(上海)有限公司 Semantic retrieval method, system, equipment and storage medium for customer service conversation content
WO2021211207A1 (en) * 2020-04-17 2021-10-21 Microsoft Technology Licensing, Llc Adversarial pretraining of machine learning models
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN113962228A (en) * 2021-10-26 2022-01-21 北京理工大学 Long document retrieval method based on semantic fusion of memory network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012173794A (en) * 2011-02-17 2012-09-10 Nippon Telegr & Teleph Corp <Ntt> Document retrieval device having ranking model selection function, document retrieval method having ranking model selection function, and document retrieval program having ranking model selection function
WO2021211207A1 (en) * 2020-04-17 2021-10-21 Microsoft Technology Licensing, Llc Adversarial pretraining of machine learning models
CN112256860A (en) * 2020-11-25 2021-01-22 携程计算机技术(上海)有限公司 Semantic retrieval method, system, equipment and storage medium for customer service conversation content
CN113553824A (en) * 2021-07-07 2021-10-26 临沂中科好孕智能技术有限公司 Sentence vector model training method
CN113962228A (en) * 2021-10-26 2022-01-21 北京理工大学 Long document retrieval method based on semantic fusion of memory network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
赵亚慧;: "大容量文本检索算法", 延边大学学报(自然科学版), no. 01, 20 March 2009 (2009-03-20) *
顾迎捷;桂小林;李德福;沈毅;廖东;: "基于神经网络的机器阅读理解综述", 软件学报, no. 07, 15 July 2020 (2020-07-15) *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023246337A1 (en) * 2022-06-24 2023-12-28 京东方科技集团股份有限公司 Unsupervised semantic retrieval method and apparatus, and computer-readable storage medium

Also Published As

Publication number Publication date
CN114547251B (en) 2024-03-01

Similar Documents

Publication Publication Date Title
CN111353030B (en) Knowledge question and answer retrieval method and device based on knowledge graph in travel field
WO2021093755A1 (en) Matching method and apparatus for questions, and reply method and apparatus for questions
CN112035730B (en) Semantic retrieval method and device and electronic equipment
CN108846029B (en) Information correlation analysis method based on knowledge graph
CN111291188B (en) Intelligent information extraction method and system
CN108132927B (en) Keyword extraction method for combining graph structure and node association
US20180341686A1 (en) System and method for data search based on top-to-bottom similarity analysis
CN106708929B (en) Video program searching method and device
CN112307182B (en) Question-answering system-based pseudo-correlation feedback extended query method
CN115599902B (en) Oil-gas encyclopedia question-answering method and system based on knowledge graph
CN113761890A (en) BERT context sensing-based multi-level semantic information retrieval method
CN110851584A (en) Accurate recommendation system and method for legal provision
CN116306504B (en) Candidate entity generation method and device, storage medium and electronic equipment
CN106570196B (en) Video program searching method and device
CN113806510B (en) Legal provision retrieval method, terminal equipment and computer storage medium
CN114547251B (en) BERT-based two-stage folk story retrieval method
CN116401345A (en) Intelligent question-answering method, device, storage medium and equipment
CN113342950B (en) Answer selection method and system based on semantic association
CN112487274B (en) Search result recommendation method and system based on text click rate
CN114297351A (en) Statement question and answer method, device, equipment, storage medium and computer program product
CN117634615A (en) Multi-task code retrieval method based on mode irrelevant comparison learning
CN117633148A (en) Medical term standardization method based on fusion multi-strategy comparison learning
CN116204622A (en) Query expression enhancement method in cross-language dense retrieval
CN113627722B (en) Simple answer scoring method based on keyword segmentation, terminal and readable storage medium
CN112199461B (en) Document retrieval method, device, medium and equipment based on block index structure

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant