CN108804443A - A kind of judicial class case searching method based on multi-feature fusion - Google Patents

A kind of judicial class case searching method based on multi-feature fusion Download PDF

Info

Publication number
CN108804443A
CN108804443A CN201710289597.XA CN201710289597A CN108804443A CN 108804443 A CN108804443 A CN 108804443A CN 201710289597 A CN201710289597 A CN 201710289597A CN 108804443 A CN108804443 A CN 108804443A
Authority
CN
China
Prior art keywords
fusion
query
word
words
method based
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710289597.XA
Other languages
Chinese (zh)
Inventor
耿伟
司华建
贾真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Fu Chi Information Technology Co Ltd
Original Assignee
Anhui Fu Chi Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Fu Chi Information Technology Co Ltd filed Critical Anhui Fu Chi Information Technology Co Ltd
Priority to CN201710289597.XA priority Critical patent/CN108804443A/en
Publication of CN108804443A publication Critical patent/CN108804443A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows:User input query is asked;User's inquiry request is pre-processed and is segmented, and removes stop words therein, obtains a group polling keyword;Traversal queries set of words successively carries out query semantics extension, and the query semantics lists of keywords after being expanded for each query word in inquiry set of words by semantic dictionary;Document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords, then carries out multiple features fusion;The fusion similarity value between document and query statement is acquired, and obtains final similarity score;Output is ranked up to final search result.The present invention has many advantages, such as that accuracy is high.

Description

A kind of judicial class case searching method based on multi-feature fusion
Technical field
The present invention relates to judicial class case search field, specifically a kind of judicial class case search based on multi-feature fusion Method.
Background technology
Law is the product of country, refers to ruling class's (ruling group is exactly political party, including king, monarch), in order to Realize the purpose for ruling and managing country, by certain legislative procedure, the basic statute and general law promulgated.Law is complete The embodiment of body its people's will, national rule tools.
With coming into the open for social information, the trial result of some legal cases is increasingly paid attention in society, trial In the process, similar judgement document can be recommended in time as reference, the effect of trial can be effectively improved, currently, generally using Be the text retrieval system based on keyword, only simply compare the similar of two cases using word matching, it is difficult to accurate Ideal search result is got, reason can be summarized as three aspects:Keyword feature is not comprehensive to the description of document information, To keep similarity calculation inaccurate;It is distributed in the keyword of document difference section block, final similar judgement is influenced also different; Fail constraint of the fine consideration contextual information to keywords semantics, to have to the difference that context change is brought The differentiation of effect, therefore work out a kind of searching method that accuracy is high and have become current important one of project.
Invention content
The technical problem to be solved by the present invention is in order to overcome in the prior art recall precision is low, accuracy is not high to lack It falls into, and a kind of judicial class case searching method based on multi-feature fusion is provided.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind being based on multiple features fusion Judicial class case searching method, be as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords Vector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector, Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, then Divided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation length For the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thing The threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using theme It is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
score(q,d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Simcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Compared with prior art, the present invention has following beneficial advantage:
The present invention passes through semantic dictionary and carries out query semantics extension first so that relationship description is more between searching keyword and word Comprehensively, comprehensive and accurate keyword description is constructed, then passes through the multiple features such as the entry weighting of piecemeal, language model, theme word set Similarity model is constructed, and integrated ordered to search result progress, greatly improves the accuracy rate and recall rate of the retrieval of class case.
Description of the drawings
Fig. 1 is to build multiple features model schematic offline in the embodiment of the present invention 1;
Fig. 2 is the flow diagram of the judicial class case searching method based on multi-feature fusion in the embodiment of the present invention 1;
Fig. 3 is the multiple features fusion schematic diagram in the embodiment of the present invention 1;
Fig. 4 is the vector space model principle schematic in the embodiment of the present invention 1.
Specific implementation mode
It is specific to walk the invention discloses a kind of judicial class case searching method based on multi-feature fusion referring to Fig.1 shown in -4 It is rapid as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords Vector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector, Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, then Divided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation length For the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thing The threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using theme It is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slmcapte(q,d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Embodiment 1
The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords The keyword of vector, including keyword feature vector, language model feature vector, theme word set feature vector, divided group is special The vectorial tfidf information by counting piecemeal entry of sign, then divided group, language model feature vector are N by carrying out size Sliding window operation, formation length is the word fragment sequence of N, and each word segment is known as gram, goes out to whole gram Existing frequency is counted, and is filtered according to the threshold value being previously set, and is formed key gram lists, is with 2-gram models Example, the method for calculating the adjacent similarity score of word, calculation formula are as follows:
Indicate the Words similarity score between query string q and document d;2-gram (q) indicates the 2-gram collection of query string It closes, 2-gram (d) indicates the 2-gram set of document
Specific algorithm is described as follows:Input pretreated query string q, document d
Export the adjacent similarity score of word between q and d
A, the 2-gram set 2-gram (q) of q are acquired;
B, the 2-gram set 2-gram (d) of d are acquired;
C, q similarity score gramScore (q, d) adjacent with the word of d are calculated by 2-gram (q) and 2-gram (d);
Theme word set feature vector indicates concept, an one side by using theme, shows as a series of correlations Key topic word, be the conditional probability of these key words,
Then multiple features fusion is carried out to features described above vector;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score, specifically Step is
Hypothesized model regards document as a vector being made of t dimensional features, and feature is commonly using word come table Show, each feature can calculate its weight according to certain basis for estimation, and feature of this t dimensions with weight together constitutes a text Book;
In order to calculate the score value, document and inquiry are all expressed as vector, and document is regarded as a series of words (Term) by we, Each word (Term) is there are one weight (Term weight), and different word (Term) is according to oneself weight in document Marking to influence document relevance calculates,
Then the weight (term weight) of word (term) in this all document is regarded as a vector by we,
Document=term1, term2 ..., term N }
Document Vector=weight1, weight2 ..., weight N }
Equally query statement is regarded as a simple document by we, is also indicated with vector,
Query=term1, term 2 ..., term N }
Query Vector=weight1, weight2 ..., weight N }
We are put into all document vectors searched out and query vector in one N-dimensional space, and each word (term) is One-dimensional, vector space model principle is as shown in Figure 4:
Then the similarity value between document and query statement is obtained by following formula:
Query semantics extend so that more comprehensively, the keyword based on divided group is special for relationship description between searching keyword and word Sign embodies keyword distributed intelligence;Keyword feature based on language model embodies keyword dependence and context language The constraint of adopted keywords semantics;And query terms and descriptor correlativity, body are introduced based on the keyword feature of theme word set The likelihood score between inquiry and document block is showed, our target is, the keyword feature of divided group, language model is special Sign, descriptor feature combine, maximize favourable factors and minimize unfavourable ones, complement one another, and describe a document jointly, to according to these feature calculations Similarity between inquiry and document,
Similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate;
(6), output is ranked up to final search result.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should by the present invention claim be covered.

Claims (6)

1. a kind of judicial class case searching method based on multi-feature fusion, it is characterised in that:It is as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words looks into each query word in inquiry set of words by semantic dictionary successively Ask semantic extension, and the query semantics lists of keywords after being expanded;
(4), document filtering being carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords, Multiple features fusion is carried out again;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
2. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described The step of (4) in, the feature vector includes keyword feature vector, language model feature vector, the theme of divided group Word set feature vector.
3. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described Divided group keyword feature vector by counting the tfidf information of piecemeal entry, then divided group.
4. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described Language model feature vector by carry out size be N sliding window operate, formation length be N word fragment sequence, often A word segment is known as gram, is counted to the occurrence frequency of whole gram, and is filtered according to the threshold value being previously set, Form key gram lists.
5. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described Theme word set feature vector one concept, one side are indicated by using theme, show as a series of relevant keys Topic word is the conditional probability of these key words.
6. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described The step of (5) in, similarity after multiple features fusion marking formula is as follows:
Score (q, d)
=a*weightword (q, d)+D*gramScore (q, d)+c
*Simropk(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through the description of mathematical model Making parameter combination (a, b, c) with solution and training data, adaptively adjustment is optimal.Specific method is to limit a, b, c first The value range of three parameters is (0,1), rule of thumb takes algebraically appropriate.
CN201710289597.XA 2017-04-27 2017-04-27 A kind of judicial class case searching method based on multi-feature fusion Pending CN108804443A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710289597.XA CN108804443A (en) 2017-04-27 2017-04-27 A kind of judicial class case searching method based on multi-feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710289597.XA CN108804443A (en) 2017-04-27 2017-04-27 A kind of judicial class case searching method based on multi-feature fusion

Publications (1)

Publication Number Publication Date
CN108804443A true CN108804443A (en) 2018-11-13

Family

ID=64070316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710289597.XA Pending CN108804443A (en) 2017-04-27 2017-04-27 A kind of judicial class case searching method based on multi-feature fusion

Country Status (1)

Country Link
CN (1) CN108804443A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222260A (en) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 A kind of searching method, device and storage medium
CN110347812A (en) * 2019-06-25 2019-10-18 银江股份有限公司 A kind of search ordering method and system towards judicial style
CN110582761A (en) * 2018-10-24 2019-12-17 阿里巴巴集团控股有限公司 Intelligent customer service based on vector propagation model on click graph
CN111368022A (en) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 Method and tool for realizing book screening by using reverse index
CN111797247A (en) * 2020-09-10 2020-10-20 平安国际智慧城市科技股份有限公司 Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN112131456A (en) * 2019-06-24 2020-12-25 腾讯科技(北京)有限公司 Information pushing method, device, equipment and storage medium
CN113535805A (en) * 2021-06-17 2021-10-22 科大讯飞股份有限公司 Data mining method and related device, electronic equipment and storage medium
CN115017257A (en) * 2022-04-21 2022-09-06 南京坤爵信息技术有限公司 Intelligent super retrieval method based on KTree algorithm

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101540017A (en) * 2009-04-28 2009-09-23 黑龙江工程学院 Feature extraction method based on byte level n-gram and junk mail filter
CN104050243A (en) * 2014-05-28 2014-09-17 黄斌 Network searching method and system combined with searching and social contact
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN104143005A (en) * 2014-08-04 2014-11-12 五八同城信息技术有限公司 Related searching system and method
CN104778201A (en) * 2015-01-23 2015-07-15 湖南科技大学 Multi-query result combination-based prior art retrieval method
CN105117386A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Semantic association method based on book content structures
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101540017A (en) * 2009-04-28 2009-09-23 黑龙江工程学院 Feature extraction method based on byte level n-gram and junk mail filter
CN104050235A (en) * 2014-03-27 2014-09-17 浙江大学 Distributed information retrieval method based on set selection
CN104050243A (en) * 2014-05-28 2014-09-17 黄斌 Network searching method and system combined with searching and social contact
CN104143005A (en) * 2014-08-04 2014-11-12 五八同城信息技术有限公司 Related searching system and method
CN104778201A (en) * 2015-01-23 2015-07-15 湖南科技大学 Multi-query result combination-based prior art retrieval method
CN105117386A (en) * 2015-09-19 2015-12-02 杭州电子科技大学 Semantic association method based on book content structures
CN106294662A (en) * 2016-08-05 2017-01-04 华东师范大学 Inquiry based on context-aware theme represents and mixed index method for establishing model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
斯日古楞等: ""融合主题与语言模型的蒙古文信息检索方法研究"", 《计算机应用研究》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110582761A (en) * 2018-10-24 2019-12-17 阿里巴巴集团控股有限公司 Intelligent customer service based on vector propagation model on click graph
CN110582761B (en) * 2018-10-24 2023-05-30 创新先进技术有限公司 Smart customer service based on vector propagation model on click graph
CN110222260A (en) * 2019-05-21 2019-09-10 深圳壹账通智能科技有限公司 A kind of searching method, device and storage medium
CN112131456A (en) * 2019-06-24 2020-12-25 腾讯科技(北京)有限公司 Information pushing method, device, equipment and storage medium
CN110347812A (en) * 2019-06-25 2019-10-18 银江股份有限公司 A kind of search ordering method and system towards judicial style
CN110347812B (en) * 2019-06-25 2021-09-10 银江股份有限公司 Search ordering method and system for judicial texts
CN111368022A (en) * 2020-02-28 2020-07-03 山东汇贸电子口岸有限公司 Method and tool for realizing book screening by using reverse index
CN111797247A (en) * 2020-09-10 2020-10-20 平安国际智慧城市科技股份有限公司 Case pushing method and device based on artificial intelligence, electronic equipment and medium
CN113535805A (en) * 2021-06-17 2021-10-22 科大讯飞股份有限公司 Data mining method and related device, electronic equipment and storage medium
CN113535805B (en) * 2021-06-17 2024-06-04 科大讯飞股份有限公司 Data mining method, related device, electronic equipment and storage medium
CN115017257A (en) * 2022-04-21 2022-09-06 南京坤爵信息技术有限公司 Intelligent super retrieval method based on KTree algorithm

Similar Documents

Publication Publication Date Title
CN108804443A (en) A kind of judicial class case searching method based on multi-feature fusion
CN109101479B (en) Clustering method and device for Chinese sentences
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN112100344B (en) Knowledge graph-based financial domain knowledge question-answering method
CN106502994B (en) method and device for extracting keywords of text
El-Fishawy et al. Arabic summarization in twitter social network
CN105528437B (en) A kind of question answering system construction method extracted based on structured text knowledge
Radu et al. Clustering documents using the document to vector model for dimensionality reduction
Asyaky et al. Improving the performance of HDBSCAN on short text clustering by using word embedding and UMAP
CN101751455A (en) Method for automatically generating title by adopting artificial intelligence technology
US20220114340A1 (en) System and method for an automatic search and comparison tool
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
Halevy et al. Discovering structure in the universe of attribute names
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
Zu et al. Graph-based keyphrase extraction using word and document em beddings
CN115357691B (en) Semantic retrieval method, system, equipment and computer readable storage medium
Shuai et al. Question answering system based on knowledge graph of film culture
CN114298020B (en) Keyword vectorization method based on topic semantic information and application thereof
Cherif et al. Text categorization based on a new classification by thresholds
Akhgari et al. Sem-TED: semantic twitter event detection and adapting with news stories
CN107220354A (en) A kind of big data search method
Liu et al. The short text matching model enhanced with knowledge via contrastive learning
Rautaray et al. An Empirical and Comparative Study of Graph based Summarization Algorithms
CN112380830B (en) Matching method, system and computer readable storage medium for related sentences in different documents
Wei An iterative approach to keywords extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181113

WD01 Invention patent application deemed withdrawn after publication