WO2017114110A1 - 一种文档检索匹配方法 - Google Patents

一种文档检索匹配方法 Download PDF

Info

Publication number
WO2017114110A1
WO2017114110A1 PCT/CN2016/108775 CN2016108775W WO2017114110A1 WO 2017114110 A1 WO2017114110 A1 WO 2017114110A1 CN 2016108775 W CN2016108775 W CN 2016108775W WO 2017114110 A1 WO2017114110 A1 WO 2017114110A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
formula
index
matched
matching method
Prior art date
Application number
PCT/CN2016/108775
Other languages
English (en)
French (fr)
Inventor
杜南山
Original Assignee
语联网(武汉)信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 语联网(武汉)信息技术有限公司 filed Critical 语联网(武汉)信息技术有限公司
Publication of WO2017114110A1 publication Critical patent/WO2017114110A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/93Document management systems

Definitions

  • the technical field to which the present invention pertains is natural language processing and information retrieval, and in particular, to a document retrieval matching method.
  • t represents a word
  • Q represents a query document
  • D represents a document to be matched.
  • k 1 generally takes a value of 1.0 to 2.0
  • b generally takes a value of 0.75
  • k 3 generally takes a value of 0 to 1000, both of which are constant.
  • the technical problem to be solved by the present invention is to provide a document retrieval matching method, improve the execution efficiency of the algorithm implementation program, and reduce the time required for the full-text retrieval process.
  • the present invention provides a document retrieval matching method, including the step of calculating a degree of relevance between a query document and a document to be matched, wherein the step of calculating the degree of relevance between the query document and the document to be matched is calculated according to formula 1. It is concluded that the formula one is:
  • Q is the query document
  • D is the document to be matched
  • t is the index word
  • tf is the number of times the index word appears in the document to be matched
  • dl is the document length of the document to be matched
  • the steps of the data preprocessing are:
  • the step of data pre-processing is preceded by the step of calculating the relevance of the query document to the document to be matched.
  • the formula 1 is converted from the formula eleven, and the formula eleven is:
  • the formula 2 is converted from the combination of the steps (1), (2), (3), and (4) of the formula eleven to the formula one.
  • the idf is an inverse document frequency of the index word
  • the inverse document frequency of the index word is obtained according to the total number of documents N and the number of documents df in which the index word appears.
  • the total number of documents N and the total document length adl are recorded in a document library.
  • the document library includes an inverted index table and a document information table
  • the document information table records a document, a unique number ID of the document, and a document length d1
  • the inverted index table records a list of index words and index words information.
  • the list information of the index word includes: the number of documents df in which the index word appears, and the number tf of occurrences of the index word in the corresponding document.
  • the index word in the formula 1 is an index word obtained from both the query document and the document to be matched.
  • the calculation factor tf, dl, ipp of the formula 1 is obtained by the step (a) of data preprocessing;
  • the calculation factor pk 1 b of the formula 1 is obtained by the step (b) of data preprocessing;
  • the calculation factor pbavdl of the formula 1 is obtained by the step (c) of data preprocessing.
  • the invention has the beneficial effects that the execution efficiency of the algorithm corresponding software program can be improved, and the time required for the full text retrieval process can be reduced.
  • Figure 1 is a schematic illustration of the invention.
  • the present invention provides a document retrieval matching method, which optimizes the classical algorithm.
  • the optimization technology is mainly implemented by data preprocessing and changing the order of calculation items in the calculation formula, including three steps:
  • the first step data preprocessing, through the data preprocessing to calculate the three calculation items ipp, pk 1 b, and pbavdl,
  • the steps of the data preprocessing are:
  • the total number of documents N and the total document length adl are recorded in a document library, the document library including an inverted index table and a document information table, the document information table records a document, a unique number ID of the document, and a document length dl,
  • the inverted index table records the list information of the index words and the index words; the list information of the index words includes: the number of documents df in which the index words appear, and the number tf of occurrences of the index words in the corresponding documents.
  • the modified inverted index table is as
  • Table 3 shows.
  • the index word corresponds to a list of information of the word in the document, each item being the document number and the number of times the index word appears in the document tf. in
  • the information corresponding to the index word increases the number of documents df in which the word appears in the number of documents, that is, the document frequency.
  • the preparation of the calculation item is completed in the data preprocessing stage.
  • Step 2 Calculate the formula conversion and convert the formula eleven into the formula one:
  • t represents a word
  • Q represents a query document
  • D represents a document to be matched.
  • k 1 generally takes a value of 1.0 to 2.0
  • b generally takes a value of 0.75
  • k 3 generally takes a value of 0 to 1000, both of which are constant.
  • the present invention converts the classical formula for optimization, and converts it to Formula One, which is:
  • Q represents a query document
  • D represents a document to be matched
  • t represents an index word obtained according to the query document
  • Tf is the number of times the index word appears in the corresponding document
  • dl is the length of the document
  • the calculation factor tf, dl, ipp of the formula 1 is obtained by the step (a) of data preprocessing;
  • the calculation factor pk 1 b of the formula 1 is obtained by the step (b) of data preprocessing;
  • the calculation factor pbavdl of the formula 1 is obtained by the step (c) of data preprocessing.
  • the ipp is calculated by the formula 2
  • the pk 1 b is calculated by the formula 3
  • the pbavdl is calculated by the formula 4;
  • the idf is an inverse document frequency of the index word
  • the inverse document frequency of the index word is obtained according to the total number of documents N and the number of documents df appearing in the index word, and can be calculated in the step (2) of data preprocessing .
  • the third step is to calculate the correlation between the query document and the document to be matched one by one according to the converted calculation formula, that is, the formula of the conversion formula is:
  • the calculation amount of each index word appearing in the document is simply calculated according to the formula, and the calculation amount after the algorithm is optimized, and the comparison data is performed. As shown in Table 4. The number of additions and subtractions is reduced by 7/9, the number of multiplication and division is reduced by 7/10, and the number of logarithmic calculations is from 1 time. Reduce to 0.
  • the algorithm optimization calculation also needs to be based on the pre-processing calculation of the index words in the query document, that is, the values of the three items ipp, pk 1 b, pbavdl are calculated, and the corresponding calculation amount is shown in Table 5, where ipp is complete.
  • the calculation formula is:
  • the calculation amount corresponding to the simple implementation and the optimization implementation is as shown in Table 6.
  • M-1 is the calculated amount corresponding to the continuous sign.
  • N>>M>>1 so the calculation of document correlation can be reduced to about 3/10.
  • the computational complexity of the algorithm is reduced by about 7/10, and the corresponding calculation time It can also be shortened by about 7/10.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文档检索匹配方法,包括计算查询文档与待匹配文档相关度的步骤,根据公式(I)计算得出,其中,Q表示查询文档,D表示待匹配文档,所述t表示索引词,所述tf为索引词在待匹配文档中出现的次数,所述dl为待匹配文档长度;还包括在计算相关度之前进行数据预处理,通过预处理计算出ipp,pk1b,和pbavd了。能提高算法实现程序的执行效率,减少全文检索过程所需要的时间。

Description

一种文档检索匹配方法 技术领域
本发明所属技术领域为自然语言处理、信息检索,尤其涉及一种文档检索匹配方法。
背景技术
信息检索中计算文档相关性的算法有很多,基于TF-IDF框架的算法是其中重要的一类,而Okapi BM25又是该类算法中的一个经典实现。本文的文档检索匹配方法主要是基于该经典算法的优化实现,方法和相关技术也可以推广到其它算法。算法中的符号及含义如表格1所示。
表格1基于TF-IDF框架的算法中使用的符号及含义
Figure PCTCN2016108775-appb-000001
Okapi BM25算法的具体计算公式如下(公式十一):
Figure PCTCN2016108775-appb-000002
其中,t表示词,Q表示查询文档,D表示待匹配文档。k1一般取值1.0~2.0,b一般取值0.75,k3一般取值取值0~1000,均为常数。
发明内容
本发明所要解决的技术问题是提供一种文档检索匹配方法,提高算法实现程序的执行效率,减少全文检索过程所需要的时间。
为解决上述技术问题,本发明提供一种文档检索匹配方法,包括计算查询文档与待匹配文档相关度的步骤,其特征是:所述计算查询文档与待匹配文档相关度的步骤根据公式一计算得出,所述公式一为:
Figure PCTCN2016108775-appb-000003
其中,Q表示查询文档,D表示待匹配文档,t表示索引词,tf为索引词在待匹配文档中出现的次数,dl为待匹配文档的文档长度;
所述公式一的计算因子tf,dl,ipp,pk1b,pbavdl由数据预处理的步骤得出;
所述数据预处理的步骤是:
(a)记录文档库中的总文档数N,总文档长度adl,根据公式
Figure PCTCN2016108775-appb-000004
计算得出平均文档长度avdl;
将b取值0.75,根据公式四计算得出pbavdl,所述公式四为:
Figure PCTCN2016108775-appb-000005
记录待匹配文档的长度dl,记录索引词在待匹配文档中出现的次数tf;
(b)获取索引词的查询词频qtf,记录索引词出现的文档数量df,将k1取值1.0~2.0,k3取值0~1000,根据公式二计算得出ipp,所述公式二为:
Figure PCTCN2016108775-appb-000006
(c)根据公式三计算得出pk1b,所述公式三为:pk1b=k1(1-b);
所述数据预处理的步骤在计算查询文档与待匹配文档相关度的步骤之前。
优选地,所述公式一由公式十一转换而来,所述公式十一为:
Figure PCTCN2016108775-appb-000007
所述由公式十一向公式一转换的步骤为:
(1)令
Figure PCTCN2016108775-appb-000008
(2)令k1+1=pk1
(3)令
Figure PCTCN2016108775-appb-000009
(4)令idf·pk1·pqtf=ipp;
(5)令k1(1-b)=pk1b;
(6)令
Figure PCTCN2016108775-appb-000010
优选地,所述公式二由公式十一向公式一转换的步骤(1),(2),(3),(4)组合转换而来。
优选地,所述idf为索引词的逆文档频率,所述索引词的逆文档频率根据总文档数N和索引词出现的文档数量df得出。
优选地,所述总文档数N和总文档长度adl记录在文档库中。
优选地,所述文档库包括倒排索引表和文档信息表,所述文档信息表记录文档、文档的唯一编号ID和文档长度dl,所述倒排索引表记录了索引词和索引词的列表信息。
优选地,所述索引词的列表信息包括:索引词出现的文档数量df,索引词在对应文档中出现的次数tf。
优选地,所述公式一中的索引词为同时从查询文档和待匹配文档中获取的索引词。
优选地,所述公式一的计算因子tf,dl,ipp由数据预处理的步骤(a)得出;
所述公式一的计算因子pk1b,由数据预处理的步骤(b)得出;
所述公式一的计算因子pbavdl,由数据预处理的步骤(c)得出。
本发明的有益效果是:能提高算法对应软件程序的执行效率,减少全文检索过程所需要的时间。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定,在附图中:
图1为本发明的示意图。
具体实施方式
下面结合附图和具体实施方式对本发明的技术方案作进一步具体说明。
为解决上述技术问题,本发明提供一种文档检索匹配方法,将经典算法进行了优化,优化技术主要通过数据预处理和改变计算公式中计算项的先后顺序实现,包括三个步骤:
如图1所示,第一步:数据预处理,通过数据预处理计算出ipp,pk1b,和pbavdl这三个计算项,
所述数据预处理的步骤是:
(a)记录文档库中的总文档数N,总文档长度adl,根据公式
Figure PCTCN2016108775-appb-000011
计算得出平均文档长度avdl;
将b取值0.75,根据公式四计算得出pbavdl,所述公式四为:
Figure PCTCN2016108775-appb-000012
记录待匹配文档的长度dl,记录索引词在待匹配文档中出现的次数tf;
(b)获取索引词的查询词频qtf,记录索引词出现的文档数量df,将k1取值1.0~2.0,k3取值0~1000,根据公式二计算得出ipp,所述公式二为:
Figure PCTCN2016108775-appb-000013
(c)将b取值0.75,将k1取值1.0~2.0,根据公式三计算得出pk1b,所述公式三为:pk1b=k1(1-b);
所述总文档数N和总文档长度adl记录在文档库中,所述文档库包括倒排索引表和文档信息表,所述文档信息表记录文档、文档的唯一编号ID和文档长度dl,所述倒排索引表记录了索引词和索引词的列表信息;所述索引词的列表信息包括:索引词出现的文档数量df,索引词在对应文档中出现的次数tf。
一般的倒排索引表中:记录了所有的索引词以及这些索引词在每篇文档中出现的频率;倒排索引表一般形式如表格2所示。
修改后的倒排索引表如
表格3所示。在表格2中,索引词对应的是该词在文档中的信息列表,每项为文档编号及索引词在该文档中出现的次数tf。在
表格3中,索引词对应的信息增加了该词在多少篇文档中出现过即索引词出现的文档数量df,也就是文档频率。
表格2倒排索引表一般形式
Figure PCTCN2016108775-appb-000014
Figure PCTCN2016108775-appb-000015
表格3修改后的倒排索引表形式
Figure PCTCN2016108775-appb-000016
根据
表格3中的文档频率和总文档数,即可计算得到公式中第一个计算项
Figure PCTCN2016108775-appb-000017
的值。
在数据预处理阶段完成计算项的准备。
第二步:计算公式转换,将公式十一转换成公式一:
背景技术中提到经典算法Okapi BM25算法的具体计算公式如下(公式十一):
Figure PCTCN2016108775-appb-000018
其中,t表示词,Q表示查询文档,D表示待匹配文档。k1一般取值1.0~2.0,b一般取值0.75,k3一般取值取值0~1000,均为常数。
本发明对该经典公式进行了转换以便于优化,转换为公式一,所述公式一为:
Figure PCTCN2016108775-appb-000019
其中,Q表示查询文档,D表示待匹配文档,所述t表示根据查询文档获取的索引词,所述 tf为索引词在对应文档中出现的次数,所述dl为文档长度;
所述公式一的计算因子tf,dl,ipp由数据预处理的步骤(a)得出;
所述公式一的计算因子pk1b,由数据预处理的步骤(b)得出;
所述公式一的计算因子pbavdl,由数据预处理的步骤(c)得出。
所述ipp由公式二计算得出,所述pk1b由公式三计算得出,所述pbavdl由公式四计算得出;
公式十一转换成公式一的步骤为:
(1)令
Figure PCTCN2016108775-appb-000020
(2)令k1+1=pk1
(3)令
Figure PCTCN2016108775-appb-000021
(4)令idf·pk1·pqtf=ipp;
(5)令k1(1-b)=pk1b;
(6)令
Figure PCTCN2016108775-appb-000022
其中,所述公式二
Figure PCTCN2016108775-appb-000023
由上述步骤(1),(2),(3),(4)组合转换而来。
其中,所述idf为索引词的逆文档频率,所述索引词的逆文档频率根据总文档数N和索引词出现的文档数量df得出,在数据预处理的步骤(2)可以计算得出。
第三步:根据转换后的计算公式即公式一逐一计算查询文档与待匹配文档之间的相关度,转换后的计算公式即公式一为:
Figure PCTCN2016108775-appb-000024
上述步骤为一种文档相关性计算方法优化算法,其先进性体现在以下:
Okapi BM25算法实现时,对每一篇候选文档计算相关性得分时,对该文档中出现的每一个索引词,简单按公式实现的计算量,以及算法优化实现后的计算量,进行比较的数据如表格4所示。其中加减法次数减少7/9,乘除法次数减少7/10,对数计算次数从1次 减少到0。
表格4算法实现的单项计算量比较
Figure PCTCN2016108775-appb-000025
算法优化实现的计算还需要基于对查询文档中的索引词的预处理计算,也就是计算ipp,pk1b,pbavdl这三项的值,对应的计算量如表格5所示,其中ipp的完整计算公式为:
Figure PCTCN2016108775-appb-000026
表格5索引词的预处理计算量
Figure PCTCN2016108775-appb-000027
计算相关性的词,也就是索引词数量为M,候选文档数量为N,则简单实现和优化实现对应的计算量如表格6所示。其中M-1为连加符号对应的计算量。
表格6算法实现计算量比较
Figure PCTCN2016108775-appb-000028
优化实现相对简单实现所需要的计算量,根据各种运算类型计算可得:
Figure PCTCN2016108775-appb-000029
Figure PCTCN2016108775-appb-000030
Figure PCTCN2016108775-appb-000031
一般来说,有N>>M>>1,因此,文档相关性的计算量约可以降为原来的3/10。虽然算法的理论复杂度没有变化,但是算法实现的计算量减少了约7/10,则相应的计算时 间也可以缩短约7/10。
最后所应说明的是,以上具体实施方式仅用以说明本发明的技术方案而非限制,尽管参照较佳实施例对本发明进行了详细说明,本领域的普通技术人员应当理解,可以对本发明的技术方案进行修改或者等同替换,而不脱离本发明技术方案的精神和范围,其均应涵盖在本发明的权利要求范围当中。

Claims (9)

  1. 一种文档检索匹配方法,包括计算查询文档与待匹配文档相关度的步骤,其特征是:
    所述计算查询文档与待匹配文档相关度的步骤根据公式一计算得出,所述公式一为:
    Figure PCTCN2016108775-appb-100001
    其中,Q表示查询文档,D表示待匹配文档,t表示索引词,tf为索引词在待匹配文档中出现的次数,dl为待匹配文档的文档长度;
    所述公式一的计算因子tf,dl,ipp,pk1b,pbavdl由数据预处理的步骤得出;
    所述数据预处理的步骤是:
    (a)记录文档库中的总文档数N,总文档长度adl,根据公式
    Figure PCTCN2016108775-appb-100002
    计算得出平均文档长度avdl;
    将b取值0.75,根据公式四计算得出pbavdl,所述公式四为:
    Figure PCTCN2016108775-appb-100003
    记录待匹配文档的长度dl,记录索引词在待匹配文档中出现的次数tf;
    (b)获取索引词在查询文档中的词频qtf,记录索引词出现的文档数量df,将k1取值1.0~2.0,k3取值0~1000,根据公式二计算得出ipp,所述公式二为:
    Figure PCTCN2016108775-appb-100004
    (c)根据公式三计算得出pk1b,所述公式三为:pk1b=k1(1-b);
    所述数据预处理的步骤在计算查询文档与待匹配文档相关度的步骤之前。
  2. 根据权利要求1所述的一种文档检索匹配方法,其特征是所述公式一由公式十一转换而来,所述公式十一为:
    Figure PCTCN2016108775-appb-100005
    所述由公式十一向公式一转换的步骤为:
    (1)令
    Figure PCTCN2016108775-appb-100006
    (2)令k1+1=pk1
    (3)令
    Figure PCTCN2016108775-appb-100007
    (4)令idf·pk1·pqtf=ipp;
    (5)令k1(1-b)=pk1b;
    (6)令
    Figure PCTCN2016108775-appb-100008
  3. 根据权利要求1,2所述的一种文档检索匹配方法,其特征是所述公式二由公式十一向公式一转换的步骤(1),(2),(3),(4)组合转换而来。
  4. 根据权利要求2所述的一种文档检索匹配方法,其特征是所述idf为索引词的逆文档频率,所述索引词的逆文档频率根据总文档数N和索引词出现的文档数量df得出。
  5. 根据权利要求1所述的一种文档检索匹配方法,其特征是所述总文档数N和总文档长度adl记录在文档库中。
  6. 根据权利要求5所述的一种文档检索匹配方法,其特征是所述文档库包括倒排索引表和文档信息表,所述文档信息表记录文档、文档的唯一编号ID和文档长度dl,所述倒排索引表记录了索引词和索引词的列表信息。
  7. 根据权利要求6所述的一种文档检索匹配方法,其特征是所述索引词的列表信息包括:索引词出现的文档数量df,索引词在对应文档中出现的次数tf。
  8. 根据权利要求1所述的一种文档检索匹配方法,其特征是所述公式一中的索引词为同时从查询文档和待匹配文档中获取的索引词。
  9. 根据权利要求1所述的一种文档检索匹配方法,其特征是:
    所述公式一的计算因子tf,dl,ipp由数据预处理的步骤(a)得出;
    所述公式一的计算因子pk1b,由数据预处理的步骤(b)得出;
    所述公式一的计算因子pbavdl,由数据预处理的步骤(c)得出。
PCT/CN2016/108775 2015-12-31 2016-12-07 一种文档检索匹配方法 WO2017114110A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201511026068.8A CN105653703A (zh) 2015-12-31 2015-12-31 一种文档检索匹配方法
CN201511026068.8 2015-12-31

Publications (1)

Publication Number Publication Date
WO2017114110A1 true WO2017114110A1 (zh) 2017-07-06

Family

ID=56490410

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2016/108775 WO2017114110A1 (zh) 2015-12-31 2016-12-07 一种文档检索匹配方法

Country Status (2)

Country Link
CN (1) CN105653703A (zh)
WO (1) WO2017114110A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105653703A (zh) * 2015-12-31 2016-06-08 武汉传神信息技术有限公司 一种文档检索匹配方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215574A1 (en) * 2007-03-01 2008-09-04 Microsoft Corporation Efficient Retrieval Algorithm by Query Term Discrimination
CN101876979A (zh) * 2009-04-28 2010-11-03 株式会社理光 查询扩展方法及查询扩展设备
CN104765769A (zh) * 2015-03-06 2015-07-08 大连理工大学 一种基于词矢量的短文本查询扩展及检索方法
CN105653703A (zh) * 2015-12-31 2016-06-08 武汉传神信息技术有限公司 一种文档检索匹配方法

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7822752B2 (en) * 2007-05-18 2010-10-26 Microsoft Corporation Efficient retrieval algorithm by query term discrimination
US20130179418A1 (en) * 2012-01-06 2013-07-11 Microsoft Corporation Search ranking features
CN103246681B (zh) * 2012-02-13 2018-10-26 深圳市世纪光速信息技术有限公司 一种搜索方法及装置
CN103049470B (zh) * 2012-09-12 2016-09-21 北京航空航天大学 基于情感相关度的观点检索方法
CN103699574B (zh) * 2013-11-28 2017-01-11 科大讯飞股份有限公司 一种对复杂检索式进行检索优化的方法及***

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080215574A1 (en) * 2007-03-01 2008-09-04 Microsoft Corporation Efficient Retrieval Algorithm by Query Term Discrimination
CN101876979A (zh) * 2009-04-28 2010-11-03 株式会社理光 查询扩展方法及查询扩展设备
CN104765769A (zh) * 2015-03-06 2015-07-08 大连理工大学 一种基于词矢量的短文本查询扩展及检索方法
CN105653703A (zh) * 2015-12-31 2016-06-08 武汉传神信息技术有限公司 一种文档检索匹配方法

Also Published As

Publication number Publication date
CN105653703A (zh) 2016-06-08

Similar Documents

Publication Publication Date Title
US10394851B2 (en) Methods and systems for mapping data items to sparse distributed representations
CN110083696B (zh) 基于元结构技术的全局引文推荐方法、推荐***
Zvonarev et al. A Comparison of Machine Learning Methods of Sentiment Analysis Based on Russian Language Twitter Data.
Zhang et al. Cross Lingual Entity Linking with Bilingual Topic Model.
Martín et al. Using semi-structured data for assessing research paper similarity
Li et al. An approach to improve kernel-based protein–protein interaction extraction by learning from large-scale network data
Bian et al. Research on multi-document summarization based on LDA topic model
CN106933824A (zh) 在多个文档中确定与目标文档相似的文档集合的方法和装置
Wang et al. Topic-driven multi-document summarization
Graus et al. Context-Based Entity Linking-University of Amsterdam at TAC 2012.
Jin et al. A multi-strategy query processing approach for biomedical question answering: USTB_PRIR at BioASQ 2017 Task 5B
Gao et al. The Math Retrieval System of ICST for NTCIR-12 MathIR Task.
WO2017114110A1 (zh) 一种文档检索匹配方法
Sawarkar et al. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Semantic Search and Hybrid Query-Based Retrievers
US20220138241A1 (en) User-Focused, Ontological, Automatic Text Summarization
US9811780B1 (en) Identifying subjective attributes by analysis of curation signals
Zhang et al. Automatic web news extraction based on DS theory considering content topics
Ahmed et al. K-means based algorithm for islamic document clustering
Znaidi et al. Answering PICO clinical questions: a semantic graph-based approach
CN108920449A (zh) 一种基于大规模主题建模的文档模型扩展方法
Berenguer et al. Word embeddings for retrieving tabular data from research publications
Saad Missen et al. Using passage-based language model for opinion detection in blogs
WO2024130741A1 (zh) 数据处理方法、装置、设备、存储介质及程序产品
Shen et al. A deep transfer learning method for medical question matching
Idris et al. Semantics based intelligent search in large digital repositories using Hadoop MapReduce

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 16880900

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 16880900

Country of ref document: EP

Kind code of ref document: A1