CN108804443A

CN108804443A - A kind of judicial class case searching method based on multi-feature fusion

Info

Publication number: CN108804443A
Application number: CN201710289597.XA
Authority: CN
Inventors: 耿伟; 司华建; 贾真
Original assignee: Anhui Fu Chi Information Technology Co Ltd
Current assignee: Anhui Fu Chi Information Technology Co Ltd
Priority date: 2017-04-27
Filing date: 2017-04-27
Publication date: 2018-11-13

Abstract

The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows：User input query is asked；User's inquiry request is pre-processed and is segmented, and removes stop words therein, obtains a group polling keyword；Traversal queries set of words successively carries out query semantics extension, and the query semantics lists of keywords after being expanded for each query word in inquiry set of words by semantic dictionary；Document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords, then carries out multiple features fusion；The fusion similarity value between document and query statement is acquired, and obtains final similarity score；Output is ranked up to final search result.The present invention has many advantages, such as that accuracy is high.

Description

A kind of judicial class case searching method based on multi-feature fusion

Technical field

The present invention relates to judicial class case search field, specifically a kind of judicial class case search based on multi-feature fusion Method.

Background technology

Law is the product of country, refers to ruling class's (ruling group is exactly political party, including king, monarch), in order to Realize the purpose for ruling and managing country, by certain legislative procedure, the basic statute and general law promulgated.Law is complete The embodiment of body its people's will, national rule tools.

With coming into the open for social information, the trial result of some legal cases is increasingly paid attention in society, trial In the process, similar judgement document can be recommended in time as reference, the effect of trial can be effectively improved, currently, generally using Be the text retrieval system based on keyword, only simply compare the similar of two cases using word matching, it is difficult to accurate Ideal search result is got, reason can be summarized as three aspects：Keyword feature is not comprehensive to the description of document information, To keep similarity calculation inaccurate；It is distributed in the keyword of document difference section block, final similar judgement is influenced also different； Fail constraint of the fine consideration contextual information to keywords semantics, to have to the difference that context change is brought The differentiation of effect, therefore work out a kind of searching method that accuracy is high and have become current important one of project.

Invention content

The technical problem to be solved by the present invention is in order to overcome in the prior art recall precision is low, accuracy is not high to lack It falls into, and a kind of judicial class case searching method based on multi-feature fusion is provided.

The present invention solves the technical solution that above-mentioned technical problem provides：The invention discloses one kind being based on multiple features fusion Judicial class case searching method, be as follows：

(1), user input query is asked；

(2), user's inquiry request is pre-processed and is segmented, and remove stop words therein, obtain a group polling keyword；

(3), traversal queries set of words successively, for each query word in inquiry set of words, by semantic dictionary into Row query semantics extend, and the query semantics lists of keywords after being expanded；

(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords Vector, then carry out multiple features fusion；

(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score；

(6), output is ranked up to final search result.

Preferably, in the step (4), the feature vector include divided group keyword feature vector, Language model feature vector, theme word set feature vector.

Preferably, tfidf information of the keyword feature vector of the divided group by statistics piecemeal entry, then Divided group；

Preferably, the language model feature vector is operated by carrying out the sliding window that size is N, formation length For the word fragment sequence of N, each word segment is known as gram, is counted to the occurrence frequency of whole gram, and according to thing The threshold value first set is filtered, and forms key gram lists；

Preferably, the theme word set feature vector indicates concept, an one side, table by using theme It is now a series of relevant key topic words, is the conditional probability of these key words；

Preferably, in the step (5), the similarity marking formula after multiple features fusion is as follows：

score(q_,d)

=a*weightword (q, d)+b*gramScore (q, d)+c

*Sim_capte(q, d)

Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate.

Compared with prior art, the present invention has following beneficial advantage：

The present invention passes through semantic dictionary and carries out query semantics extension first so that relationship description is more between searching keyword and word Comprehensively, comprehensive and accurate keyword description is constructed, then passes through the multiple features such as the entry weighting of piecemeal, language model, theme word set Similarity model is constructed, and integrated ordered to search result progress, greatly improves the accuracy rate and recall rate of the retrieval of class case.

Description of the drawings

Fig. 1 is to build multiple features model schematic offline in the embodiment of the present invention 1；

Fig. 2 is the flow diagram of the judicial class case searching method based on multi-feature fusion in the embodiment of the present invention 1；

Fig. 3 is the multiple features fusion schematic diagram in the embodiment of the present invention 1；

Fig. 4 is the vector space model principle schematic in the embodiment of the present invention 1.

Specific implementation mode

It is specific to walk the invention discloses a kind of judicial class case searching method based on multi-feature fusion referring to Fig.1 shown in -4 It is rapid as follows：

(1), user input query is asked；

(6), output is ranked up to final search result.

Score (q, d)

=a*weightword (q, d)+b*gramScore (q, d)+c

*Slm_capte(q_,d)

Embodiment 1

The invention discloses a kind of judicial class case searching methods based on multi-feature fusion, are as follows：

(1), user input query is asked；

(4), document filtering is carried out using information point, search characteristics inverted index obtains the different characteristic of lists of keywords The keyword of vector, including keyword feature vector, language model feature vector, theme word set feature vector, divided group is special The vectorial tfidf information by counting piecemeal entry of sign, then divided group, language model feature vector are N by carrying out size Sliding window operation, formation length is the word fragment sequence of N, and each word segment is known as gram, goes out to whole gram Existing frequency is counted, and is filtered according to the threshold value being previously set, and is formed key gram lists, is with 2-gram models Example, the method for calculating the adjacent similarity score of word, calculation formula are as follows:

Indicate the Words similarity score between query string q and document d；2-gram (q) indicates the 2-gram collection of query string It closes, 2-gram (d) indicates the 2-gram set of document

Specific algorithm is described as follows:Input pretreated query string q, document d

Export the adjacent similarity score of word between q and d

A, the 2-gram set 2-gram (q) of q are acquired；

B, the 2-gram set 2-gram (d) of d are acquired；

C, q similarity score gramScore (q, d) adjacent with the word of d are calculated by 2-gram (q) and 2-gram (d)；

Theme word set feature vector indicates concept, an one side by using theme, shows as a series of correlations Key topic word, be the conditional probability of these key words,

Then multiple features fusion is carried out to features described above vector；

(5), the fusion similarity value between document and query statement is acquired, and obtains final similarity score, specifically Step is

Hypothesized model regards document as a vector being made of t dimensional features, and feature is commonly using word come table Show, each feature can calculate its weight according to certain basis for estimation, and feature of this t dimensions with weight together constitutes a text Book；

In order to calculate the score value, document and inquiry are all expressed as vector, and document is regarded as a series of words (Term) by we, Each word (Term) is there are one weight (Term weight), and different word (Term) is according to oneself weight in document Marking to influence document relevance calculates,

Then the weight (term weight) of word (term) in this all document is regarded as a vector by we,

Document=term1, term2 ..., term N }

Document Vector=weight1, weight2 ..., weight N }

Equally query statement is regarded as a simple document by we, is also indicated with vector,

Query=term1, term 2 ..., term N }

Query Vector=weight1, weight2 ..., weight N }

We are put into all document vectors searched out and query vector in one N-dimensional space, and each word (term) is One-dimensional, vector space model principle is as shown in Figure 4：

Then the similarity value between document and query statement is obtained by following formula:

Query semantics extend so that more comprehensively, the keyword based on divided group is special for relationship description between searching keyword and word Sign embodies keyword distributed intelligence；Keyword feature based on language model embodies keyword dependence and context language The constraint of adopted keywords semantics；And query terms and descriptor correlativity, body are introduced based on the keyword feature of theme word set The likelihood score between inquiry and document block is showed, our target is, the keyword feature of divided group, language model is special Sign, descriptor feature combine, maximize favourable factors and minimize unfavourable ones, complement one another, and describe a document jointly, to according to these feature calculations Similarity between inquiry and document,

Similarity marking formula after multiple features fusion is as follows：

Score (q, d)

=a*weightword (q, d)+b*gramScore (q, d)+c

*Sl_capte(q, d)

Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through mathematical model Description and solution and training data make parameter combination (a, b, c), and adaptively adjustment is optimal.Specific method is to limit first A, the value range of tri- parameters of b, c is (0,1), rule of thumb takes algebraically appropriate；

(6), output is ranked up to final search result.

The above-described embodiments merely illustrate the principles and effects of the present invention, and is not intended to limit the present invention.It is any ripe The personage for knowing this technology can all carry out modifications and changes to above-described embodiment without violating the spirit and scope of the present invention.Cause This, institute is complete without departing from the spirit and technical ideas disclosed in the present invention by those of ordinary skill in the art such as At all equivalent modifications or change, should by the present invention claim be covered.

Claims

1. a kind of judicial class case searching method based on multi-feature fusion, it is characterised in that：It is as follows：

(1), user input query is asked；

(3), traversal queries set of words looks into each query word in inquiry set of words by semantic dictionary successively Ask semantic extension, and the query semantics lists of keywords after being expanded；

(4), document filtering being carried out using information point, search characteristics inverted index obtains the different characteristic vector of lists of keywords, Multiple features fusion is carried out again；

(6), output is ranked up to final search result.

2. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that：It is described The step of (4) in, the feature vector includes keyword feature vector, language model feature vector, the theme of divided group Word set feature vector.

3. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that：It is described Divided group keyword feature vector by counting the tfidf information of piecemeal entry, then divided group.

4. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that：It is described Language model feature vector by carry out size be N sliding window operate, formation length be N word fragment sequence, often A word segment is known as gram, is counted to the occurrence frequency of whole gram, and is filtered according to the threshold value being previously set, Form key gram lists.

5. one kind according to claim 2 judicial class case searching method based on multi-feature fusion, it is characterised in that：It is described Theme word set feature vector one concept, one side are indicated by using theme, show as a series of relevant keys Topic word is the conditional probability of these key words.

6. one kind according to claim 1 judicial class case searching method based on multi-feature fusion, it is characterised in that：It is described The step of (5) in, similarity after multiple features fusion marking formula is as follows：

Score (q, d)

=a*weightword (q, d)+D*gramScore (q, d)+c

*Sim_ropk(q, d)

Wherein, a+b+c=1, object function are to find one group of possible parameter combination { a, b, c }, pass through the description of mathematical model Making parameter combination (a, b, c) with solution and training data, adaptively adjustment is optimal.Specific method is to limit a, b, c first The value range of three parameters is (0,1), rule of thumb takes algebraically appropriate.