CN108804443A - A kind of judicial class case searching method based on multi-feature fusion - Google Patents
A kind of judicial class case searching method based on multi-feature fusion Download PDFInfo
- Publication number
- CN108804443A CN108804443A CN201710289597.XA CN201710289597A CN108804443A CN 108804443 A CN108804443 A CN 108804443A CN 201710289597 A CN201710289597 A CN 201710289597A CN 108804443 A CN108804443 A CN 108804443A
- Authority
- CN
- China
- Prior art keywords
- fusion
- query
- word
- words
- method based
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows:User input query is asked;User's inquiry request is pre-processed and is segmented, and removes stop words therein, obtains a group polling keyword;Traversal queries set of words successively carries out query semantics extension, and the query semantics lists of keywords after being expanded for each query word in inquiry set of words by semantic dictionary;Document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords, then carries out multiple features fusion;The fusion similarity value between document and query statement is acquired, and obtains final similarity score;Output is ranked up to final search result.The present invention has many advantages, such as that accuracy is high.
Description
Technical field
The present invention relates to judicial class case search field, specifically a kind of judicial class case search based on multi-feature fusion
Method.
Background technology
Law is the product of country, refers to ruling class's (ruling group is exactly political party, including king, monarch), in order to
Realize the purpose for ruling and managing country, by certain legislative procedure, the basic statute and general law promulgated.Law is complete
The embodiment of body its people's will, national rule tools.
With coming into the open for social information, the trial result of some legal cases is increasingly paid attention in society, trial
In the process, similar judgement document can be recommended in time as reference, the effect of trial can be effectively improved, currently, generally using
Be the text retrieval system based on keyword, only simply compare the similar of two cases using word matching, it is difficult to accurate
Ideal search result is got, reason can be summarized as three aspects:Keyword feature is not comprehensive to the description of document information,
To keep similarity calculation inaccurate;It is distributed in the keyword of document difference section block, final similar judgement is influenced also different;
Fail constraint of the fine consideration contextual information to keywords semantics, to have to the difference that context change is brought
The differentiation of effect, therefore work out a kind of searching method that accuracy is high and have become current important one of project.
Invention content
The technical problem to be solved by the present invention is in order to overcome in the prior art recall precision is low, accuracy is not high to lack
It falls into, and a kind of judicial class case searching method based on multi-feature fusion is provided.
The present invention solves the technical solution that above-mentioned technical problem provides:The invention discloses one kind being based on multiple features fusion
Judicial class case searching method, be as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into
Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords
Vector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector,
Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, then
Divided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation length
For the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thing
The threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using theme
It is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
score(q,d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Simcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model
Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first
A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Compared with prior art, the present invention has following beneficial advantage:
The present invention passes through semantic dictionary and carries out query semantics extension first so that relationship description is more between searching keyword and word
Comprehensively, comprehensive and accurate keyword description is constructed, then passes through the multiple features such as the entry weighting of piecemeal, language model, theme word set
Similarity model is constructed, and integrated ordered to search result progress, greatly improves the accuracy rate and recall rate of the retrieval of class case.
Description of the drawings
Fig. 1 is to build multiple features model schematic offline in the embodiment of the present invention 1;
Fig. 2 is the flow diagram of the judicial class case searching method based on multi-feature fusion in the embodiment of the present invention 1;
Fig. 3 is the multiple features fusion schematic diagram in the embodiment of the present invention 1;
Fig. 4 is the vector space model principle schematic in the embodiment of the present invention 1.
Specific implementation mode
It is specific to walk the invention discloses a kind of judicial class case searching method based on multi-feature fusion referring to Fig.1 shown in -4
It is rapid as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into
Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords
Vector, then carry out multiple features fusion;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
Preferably, in the step (4), the feature vector include divided group keyword feature vector,
Language model feature vector, theme word set feature vector.
Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, then
Divided group;
Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation length
For the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thing
The threshold value first set is filtered, and forms key gram lists;
Preferably, the theme word set feature vector indicates concept, an one side, table by using theme
It is now a series of relevant key topic words, is the conditional probability of these key words;
Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slmcapte(q,d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model
Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first
A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.
Embodiment 1
The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into
Row query semantics extend, and the query semantics lists of keywords after being expanded;
(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords
The keyword of vector, including keyword feature vector, language model feature vector, theme word set feature vector, divided group is special
The vectorial tfidf information by counting piecemeal entry of sign, then divided group, language model feature vector are N by carrying out size
Sliding window operation, formation length is the word fragment sequence of N, and each word segment is known as gram, goes out to whole gram
Existing frequency is counted, and is filtered according to the threshold value being previously set, and is formed key gram lists, is with 2-gram models
Example, the method for calculating the adjacent similarity score of word, calculation formula are as follows:
Indicate the Words similarity score between query string q and document d;2-gram (q) indicates the 2-gram collection of query string
It closes, 2-gram (d) indicates the 2-gram set of document
Specific algorithm is described as follows:Input pretreated query string q, document d
Export the adjacent similarity score of word between q and d
A, the 2-gram set 2-gram (q) of q are acquired;
B, the 2-gram set 2-gram (d) of d are acquired;
C, q similarity score gramScore (q, d) adjacent with the word of d are calculated by 2-gram (q) and 2-gram (d);
Theme word set feature vector indicates concept, an one side by using theme, shows as a series of correlations
Key topic word, be the conditional probability of these key words,
Then multiple features fusion is carried out to features described above vector;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score, specifically
Step is
Hypothesized model regards document as a vector being made of t dimensional features, and feature is commonly using word come table
Show, each feature can calculate its weight according to certain basis for estimation, and feature of this t dimensions with weight together constitutes a text
Book;
In order to calculate the score value, document and inquiry are all expressed as vector, and document is regarded as a series of words (Term) by we,
Each word (Term) is there are one weight (Term weight), and different word (Term) is according to oneself weight in document
Marking to influence document relevance calculates,
Then the weight (term weight) of word (term) in this all document is regarded as a vector by we,
Document=term1, term2 ..., term N }
Document Vector=weight1, weight2 ..., weight N }
Equally query statement is regarded as a simple document by we, is also indicated with vector,
Query=term1, term 2 ..., term N }
Query Vector=weight1, weight2 ..., weight N }
We are put into all document vectors searched out and query vector in one N-dimensional space, and each word (term) is
One-dimensional, vector space model principle is as shown in Figure 4:
Then the similarity value between document and query statement is obtained by following formula:
Query semantics extend so that more comprehensively, the keyword based on divided group is special for relationship description between searching keyword and word
Sign embodies keyword distributed intelligence;Keyword feature based on language model embodies keyword dependence and context language
The constraint of adopted keywords semantics;And query terms and descriptor correlativity, body are introduced based on the keyword feature of theme word set
The likelihood score between inquiry and document block is showed, our target is, the keyword feature of divided group, language model is special
Sign, descriptor feature combine, maximize favourable factors and minimize unfavourable ones, complement one another, and describe a document jointly, to according to these feature calculations
Similarity between inquiry and document,
Similarity marking formula after multiple features fusion is as follows:
Score (q, d)
=a*weightword (q, d)+b*gramScore (q, d)+c
*Slcapte(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model
Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first
A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate;
(6), output is ranked up to final search result.
The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe
The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause
This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as
At all equivalent modifications or change, should by the present invention claim be covered.
Claims (6)
1. a kind of judicial class case searching method based on multi-feature fusion, it is characterised in that:It is as follows:
(1), user input query is asked;
(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword;
(3), traversal queries set of words looks into each query word in inquiry set of words by semantic dictionary successively
Ask semantic extension, and the query semantics lists of keywords after being expanded;
(4), document filtering being carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords,
Multiple features fusion is carried out again;
(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score;
(6), output is ranked up to final search result.
2. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described
The step of (4) in, the feature vector includes keyword feature vector, language model feature vector, the theme of divided group
Word set feature vector.
3. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described
Divided group keyword feature vector by counting the tfidf information of piecemeal entry, then divided group.
4. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described
Language model feature vector by carry out size be N sliding window operate, formation length be N word fragment sequence, often
A word segment is known as gram, is counted to the occurrence frequency of whole gram, and is filtered according to the threshold value being previously set,
Form key gram lists.
5. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described
Theme word set feature vector one concept, one side are indicated by using theme, show as a series of relevant keys
Topic word is the conditional probability of these key words.
6. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that:It is described
The step of (5) in, similarity after multiple features fusion marking formula is as follows:
Score (q, d)
=a*weightword (q, d)+D*gramScore (q, d)+c
*Simropk(q, d)
Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through the description of mathematical model
Making parameter combination (a, b, c) with solution and training data, adaptively adjustment is optimal.Specific method is to limit a, b, c first
The value range of three parameters is (0,1), rule of thumb takes algebraically appropriate.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289597.XA CN108804443A (en) | 2017-04-27 | 2017-04-27 | A kind of judicial class case searching method based on multi-feature fusion |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710289597.XA CN108804443A (en) | 2017-04-27 | 2017-04-27 | A kind of judicial class case searching method based on multi-feature fusion |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108804443A true CN108804443A (en) | 2018-11-13 |
Family
ID=64070316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710289597.XA Pending CN108804443A (en) | 2017-04-27 | 2017-04-27 | A kind of judicial class case searching method based on multi-feature fusion |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108804443A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110222260A (en) * | 2019-05-21 | 2019-09-10 | 深圳壹账通智能科技有限公司 | A kind of searching method, device and storage medium |
CN110347812A (en) * | 2019-06-25 | 2019-10-18 | 银江股份有限公司 | A kind of search ordering method and system towards judicial style |
CN110582761A (en) * | 2018-10-24 | 2019-12-17 | 阿里巴巴集团控股有限公司 | Intelligent customer service based on vector propagation model on click graph |
CN111368022A (en) * | 2020-02-28 | 2020-07-03 | 山东汇贸电子口岸有限公司 | Method and tool for realizing book screening by using reverse index |
CN111797247A (en) * | 2020-09-10 | 2020-10-20 | 平安国际智慧城市科技股份有限公司 | Case pushing method and device based on artificial intelligence, electronic equipment and medium |
CN112131456A (en) * | 2019-06-24 | 2020-12-25 | 腾讯科技(北京)有限公司 | Information pushing method, device, equipment and storage medium |
CN113535805A (en) * | 2021-06-17 | 2021-10-22 | 科大讯飞股份有限公司 | Data mining method and related device, electronic equipment and storage medium |
CN115017257A (en) * | 2022-04-21 | 2022-09-06 | 南京坤爵信息技术有限公司 | Intelligent super retrieval method based on KTree algorithm |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540017A (en) * | 2009-04-28 | 2009-09-23 | 黑龙江工程学院 | Feature extraction method based on byte level n-gram and junk mail filter |
CN104050243A (en) * | 2014-05-28 | 2014-09-17 | 黄斌 | Network searching method and system combined with searching and social contact |
CN104050235A (en) * | 2014-03-27 | 2014-09-17 | 浙江大学 | Distributed information retrieval method based on set selection |
CN104143005A (en) * | 2014-08-04 | 2014-11-12 | 五八同城信息技术有限公司 | Related searching system and method |
CN104778201A (en) * | 2015-01-23 | 2015-07-15 | 湖南科技大学 | Multi-query result combination-based prior art retrieval method |
CN105117386A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Semantic association method based on book content structures |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
-
2017
- 2017-04-27 CN CN201710289597.XA patent/CN108804443A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101540017A (en) * | 2009-04-28 | 2009-09-23 | 黑龙江工程学院 | Feature extraction method based on byte level n-gram and junk mail filter |
CN104050235A (en) * | 2014-03-27 | 2014-09-17 | 浙江大学 | Distributed information retrieval method based on set selection |
CN104050243A (en) * | 2014-05-28 | 2014-09-17 | 黄斌 | Network searching method and system combined with searching and social contact |
CN104143005A (en) * | 2014-08-04 | 2014-11-12 | 五八同城信息技术有限公司 | Related searching system and method |
CN104778201A (en) * | 2015-01-23 | 2015-07-15 | 湖南科技大学 | Multi-query result combination-based prior art retrieval method |
CN105117386A (en) * | 2015-09-19 | 2015-12-02 | 杭州电子科技大学 | Semantic association method based on book content structures |
CN106294662A (en) * | 2016-08-05 | 2017-01-04 | 华东师范大学 | Inquiry based on context-aware theme represents and mixed index method for establishing model |
Non-Patent Citations (1)
Title |
---|
斯日古楞等: ""融合主题与语言模型的蒙古文信息检索方法研究"", 《计算机应用研究》 * |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110582761A (en) * | 2018-10-24 | 2019-12-17 | 阿里巴巴集团控股有限公司 | Intelligent customer service based on vector propagation model on click graph |
CN110582761B (en) * | 2018-10-24 | 2023-05-30 | 创新先进技术有限公司 | Smart customer service based on vector propagation model on click graph |
CN110222260A (en) * | 2019-05-21 | 2019-09-10 | 深圳壹账通智能科技有限公司 | A kind of searching method, device and storage medium |
CN112131456A (en) * | 2019-06-24 | 2020-12-25 | 腾讯科技(北京)有限公司 | Information pushing method, device, equipment and storage medium |
CN110347812A (en) * | 2019-06-25 | 2019-10-18 | 银江股份有限公司 | A kind of search ordering method and system towards judicial style |
CN110347812B (en) * | 2019-06-25 | 2021-09-10 | 银江股份有限公司 | Search ordering method and system for judicial texts |
CN111368022A (en) * | 2020-02-28 | 2020-07-03 | 山东汇贸电子口岸有限公司 | Method and tool for realizing book screening by using reverse index |
CN111797247A (en) * | 2020-09-10 | 2020-10-20 | 平安国际智慧城市科技股份有限公司 | Case pushing method and device based on artificial intelligence, electronic equipment and medium |
CN113535805A (en) * | 2021-06-17 | 2021-10-22 | 科大讯飞股份有限公司 | Data mining method and related device, electronic equipment and storage medium |
CN113535805B (en) * | 2021-06-17 | 2024-06-04 | 科大讯飞股份有限公司 | Data mining method, related device, electronic equipment and storage medium |
CN115017257A (en) * | 2022-04-21 | 2022-09-06 | 南京坤爵信息技术有限公司 | Intelligent super retrieval method based on KTree algorithm |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108804443A (en) | A kind of judicial class case searching method based on multi-feature fusion | |
CN109101479B (en) | Clustering method and device for Chinese sentences | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN112100344B (en) | Knowledge graph-based financial domain knowledge question-answering method | |
CN106502994B (en) | method and device for extracting keywords of text | |
El-Fishawy et al. | Arabic summarization in twitter social network | |
CN105528437B (en) | A kind of question answering system construction method extracted based on structured text knowledge | |
Radu et al. | Clustering documents using the document to vector model for dimensionality reduction | |
Asyaky et al. | Improving the performance of HDBSCAN on short text clustering by using word embedding and UMAP | |
CN101751455A (en) | Method for automatically generating title by adopting artificial intelligence technology | |
US20220114340A1 (en) | System and method for an automatic search and comparison tool | |
CN109614493B (en) | Text abbreviation recognition method and system based on supervision word vector | |
Halevy et al. | Discovering structure in the universe of attribute names | |
CN112632261A (en) | Intelligent question and answer method, device, equipment and storage medium | |
Zu et al. | Graph-based keyphrase extraction using word and document em beddings | |
CN115357691B (en) | Semantic retrieval method, system, equipment and computer readable storage medium | |
Shuai et al. | Question answering system based on knowledge graph of film culture | |
CN114298020B (en) | Keyword vectorization method based on topic semantic information and application thereof | |
Cherif et al. | Text categorization based on a new classification by thresholds | |
Akhgari et al. | Sem-TED: semantic twitter event detection and adapting with news stories | |
CN107220354A (en) | A kind of big data search method | |
Liu et al. | The short text matching model enhanced with knowledge via contrastive learning | |
Rautaray et al. | An Empirical and Comparative Study of Graph based Summarization Algorithms | |
CN112380830B (en) | Matching method, system and computer readable storage medium for related sentences in different documents | |
Wei | An iterative approach to keywords extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20181113 |
|
WD01 | Invention patent application deemed withdrawn after publication |