CN106649308A - Updating method and system of word segmentation library - Google Patents
Updating method and system of word segmentation library Download PDFInfo
- Publication number
- CN106649308A CN106649308A CN201510715638.8A CN201510715638A CN106649308A CN 106649308 A CN106649308 A CN 106649308A CN 201510715638 A CN201510715638 A CN 201510715638A CN 106649308 A CN106649308 A CN 106649308A
- Authority
- CN
- China
- Prior art keywords
- participle
- phrase
- word
- variable
- basic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention provides an updating system of a word segmentation library. The system comprises a log collection module, a journal analysis module, a word segmentation evaluation module and a word segmentation correction and filtering module, wherein the word segmentation correction and filtering module comprises a construct submodule, a cut molecular module and a filter submodule. The invention also provides a corresponding method. The implementations of the updating system and the method of word segmentation library are based on the log analyses of the word segmentation operation, and the word segmentation effects of the word segmentation operation system are evaluated, and the word segmentation inputs of bad word segmentation effects are extracted, the word segmentation inputs of bad word segmentation effects are corrected using word segmentations and new words and phrases are filtered and outputted based on the Z word segmentation filtering algorithm of the reference probability table, and the new words and phrases are updated to the word segmentation library, the word segmentation library is continuously improved, and the problem that the word segmentation library cannot timely adapt to the actual word segmentation application environment is solved, and the word segmentation effect is effectively increased.
Description
Technical field
The present invention relates to technical field of data processing, more particularly, it relates to a kind of participle Word library updating method
And system.
Background technology
In search system, the quality of participle effect is the key factor for affecting search effect.And participle process
The dictionary for being relied on, is the important component part of participle technique.
Word stock generating method common at present is the method using statistics:Adjacent co-occurrence is each in being input into language material
The frequency of individual combinatorics on words (i.e. word group) is counted, and calculates its information that appears alternatively, wherein, the information that appears alternatively
The tightness degree of marriage relation between Chinese character is embodied, when tightness degree is higher than some threshold value, then can be recognized
A word may be constituted for this word group.Dictionary is generated by said method, then by dictionary application and line points
Word business.
But the Chinese vocabulary bank generated using the method for above-mentioned word frequency statisticses, the technical problem for existing mainly has:
Jing is often syncopated as some co-occurrence frequency height but is not the commonly used word group of word;Most of dictionary is all belonging to general
Dictionary, is not suitable for some vertical search scenes, such as trade name search, place name search, name search etc.;
Dictionary is often static, generates under line and is used on line again, it is impossible to is quickly carried out more according to actually used situation
It is new and perfect;Dictionary is poor for the recognition effect of neologisms.
The content of the invention
The technical problem to be solved in the present invention is to generate Chinese vocabulary bank for existing word frequency statisticses method
A kind of drawbacks described above, there is provided participle Word library updating method and system.
Technical proposal that the invention solves the above-mentioned problems there is provided a kind of participle Word library updating system, including:
Log acquisition module, for gathering the participle business day that participle operation system is exported in running
Will;
Log analysis module, the participle business diary for collecting to the log acquisition module is carried out
Statistical analysis, and extract associated valid data;
Segmentation tesing module, for carrying out evaluation to the associated valid data according to evaluation rule participle is obtained
The bad participle input of effect;And
Participle is corrected and filtering module, for the participle effect obtained by the segmentation tesing module not
Good participle input carries out participle correction and filters output neologisms phrase, and the neologisms phrase is updated to participle
In dictionary.
In above-mentioned participle Word library updating system, the participle operation system includes search system, the correlation
The Number of Orders of valid data including Search Results or the conversion ratio of number of visits and/or search keyword and/
Or the participle knot of the recall rate and/or participle input of the homepage hit ratio and/or search keyword of Search Results
Really;The evaluation rule includes the conversion ratio of search keyword less than the first predetermined threshold value and/or Search Results
Number is big less than the word segmentation result that predetermined threshold value and/or participle are input into less than the second predetermined threshold value and/or usage amount
In the 3rd predetermined threshold value.
In above-mentioned participle Word library updating system, participle correction and filtering module include construction submodule and
Cutting submodule, wherein:
The construction submodule, for scanning corpus data, and calculates each word to the general of next word
Rate refers to probability tables;
The cutting submodule, carries out full cutting and obtains base for the participle input bad to the participle effect
Plinth participle phrase.
In above-mentioned participle Word library updating system, the participle correction also includes filtering submodule with filtering module
Block, the filter submodule is used for according to being cut to described using the Z participles filter algorithm of the reference probability tables
The described basic participle phrase obtained after the full cutting of molecular modules carries out being filtrated to get the neologisms phrase, and will
The neologisms phrase is updated in participle dictionary.
In above-mentioned participle Word library updating system, the filter submodule includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase
There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein,
The initial value of first variable i is 0;
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng
Examine and the front probability to i-th forward direction word in word list is inquired about in probability tables, and judging i-th forward direction word
Probability exist or during more than or equal to default first threshold, i-th forward direction word is added to into the base
In plinth participle phrase;
First from unit is added, for not existing or less than default in the probability for judging i-th forward direction word
During first threshold, or after i-th forward direction word is added to into the basic participle phrase, the first variable i is certainly
Plus;
Second scanning element, for judging the front length to word list less than or equal to first variable i
When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to
The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second
The initial value of variable j is 0;
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set
The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and
Calculating P (B | A);When P (B | A) is judged less than default Second Threshold, whether second lemma B is judged
Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
Second from plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold, Huo Zhe
When judging the second lemma B Already in participle dictionary, or the second lemma B is added to into the basis point
After word phrase, the second variable j adds certainly;
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this
The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
Present invention also offers a kind of participle Word library updating method, the method is comprised the following steps:
The participle business diary that S1, collection participle operation system are exported in running;
S2, the participle business diary to collecting carry out statistical analysis, and extract associated valid data;
S3, according to evaluation rule the associated valid data is carried out evaluating and obtain the bad participle of participle effect
Input;
The bad participle input of S4, the participle effect to obtaining carries out participle correction and filters output neologisms
Phrase, and the neologisms phrase is updated in participle dictionary.
In above-mentioned participle Word library updating method, the participle operation system includes search system, the correlation
The Number of Orders of valid data including Search Results or the conversion ratio of number of visits and/or search keyword and/
Or the participle knot of the recall rate and/or participle input of the homepage hit ratio and/or search keyword of Search Results
Really;The evaluation rule includes the conversion ratio of search keyword less than the first predetermined threshold value and/or Search Results
Number is big less than the word segmentation result that predetermined threshold value and/or participle are input into less than the second predetermined threshold value and/or usage amount
In the 3rd predetermined threshold value.
In above-mentioned participle Word library updating method, step S4 includes:
S41, scanning corpus data, and calculate each word to the probability of next word to construct a reference
Probability tables;
S42, the participle input bad to the participle effect carry out full cutting and obtain basic participle phrase.
In above-mentioned participle Word library updating method, step S4 also includes:
S43, the cutting submodule is cut entirely according to the Z participles filter algorithm using the reference probability tables
The described basic participle phrase that obtains carries out being filtrated to get the neologisms phrase after point, and by the neologisms phrase
In updating participle dictionary.
In above-mentioned participle Word library updating method, step S43 includes:
S431, scan the basic participle phrase and to obtain in the basic participle phrase basic participle total but do not wrap
The forward direction word list being contained in the basic participle phrase;
S432, the front length to word list is judged whether more than the first variable i, wherein, first variable i
Initial value be 0, if execution step S433, if it is not, then execution step S435;
S433, judge the front length to word list be more than the first variable i when, look into from reference to probability tables
Ask the front probability to i-th forward direction word in word list, and exist in the probability for judging i-th forward direction word or
When person is more than or equal to default first threshold, i-th forward direction word is added in the basic participle phrase;
S434, do not exist or during less than default first threshold in the probability for judging i-th forward direction word,
Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this
From after adding, repeat step S432 is to step S434 for one variable i;
S435, judge the front length to word list be less than or equal to first variable i when, scan the base
Plinth participle phrase, obtains the set with the front phrase to relation, wherein, with the front phrase table to relation
It is shown as { A, B }, A is the first lemma, B is the second lemma;
S436, the size of the set is judged whether less than the second variable j, wherein, the initial value of the second variable j
For 0, if so, then execution step S437, if it is not, then execution step S439;
S437, when the size for judging the set is less than the second variable j, take out in the set in j-th phrase
The first lemma A and the second lemma B, and from reference to inquiring about P (A) and P (AB) in probability tables, and calculate
P(B|A);When P (B | A) is judged less than default Second Threshold, judge whether second lemma B has deposited
In being participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
S438, when P (B | A) is judged more than or equal to default Second Threshold, or judging the second lemma
B Already in participle dictionary when, or the second lemma B is added to after the basic participle phrase,
Two variable j add certainly, and in second variable j from after adding, repeat step S436 is to step S438;
S439, judge the set size be more than or equal to the second variable j when, by the basic participle phrase
The neologisms phrase for carrying out being obtained after re-scheduling is added in participle dictionary.
Implement the participle Word library updating method and system of the present invention, beneficial effect has:Based on participle business diary
Analysis, is evaluated by the participle effect to participle operation system, extracts the bad participle of participle effect
Input, is carried out according to being input into using the participle bad to participle effect of the Z participles filter algorithm with reference to probability tables
Participle corrects and filters output neologisms phrase, and the neologisms phrase is updated in participle dictionary, constantly improve
Participle dictionary, solves the problems, such as that participle dictionary in good time and can not adapt to actual participle applied environment, effectively carries
High participle effect.Meanwhile, participle operation system can periodically load the participle dictionary after updating, and then can continue
Chinese Word Segmentation Service is carried out, can be quickly updated.
Description of the drawings
Fig. 1 is the structural representation of the participle Word library updating system embodiment of the present invention.
Fig. 2 is the flow chart of the participle Word library updating embodiment of the method for the present invention.
Fig. 3 is the specific flow chart of the participle Word library updating embodiment of the method for the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing and reality
Example is applied, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only
To explain the present invention, it is not intended to limit the present invention.
The participle Word library updating system and method for the present invention is analyzed based on participle business diary, by participle industry
The participle effect of business system is evaluated, and the bad participle input of participle effect is extracted, according to using reference
The participle input that the Z participles filter algorithm of probability tables is bad to participle effect carries out participle correction and filters defeated
Go out neologisms phrase, and the neologisms phrase is updated in participle dictionary, constantly improve participle dictionary is solved
Participle dictionary in good time and can not adapt to the problem of actual participle applied environment, effectively improve participle effect.
As shown in figure 1, being the structural representation of the participle Word library updating system embodiment of the present invention.The system
100 include the correction of log acquisition module 110, log analysis module 120, segmentation tesing module 130 and participle
With filtering module 140, wherein:
The input of log acquisition module 110 is connected with participle operation system, for gathering participle operation system
The participle business diary exported in running, wherein, participle operation system is referred to using participle function
System, including search system, now, the participle business diary that search system is exported in running is to search
Rope serve log, including the search of user is input into, the result that the search system is returned and user are tied to search
Fruit browses and order behavior etc..
The input of log analysis module 120 is connected with the output end of log acquisition module 110, for daily record
The participle business diary that acquisition module is collected carries out statistical analysis, and extracts associated valid data.With participle
Operation system includes as a example by search system that the valid data include the Number of Orders of Search Results or browse secondary
The homepage hit ratio of the conversion ratio and/or Search Results of number and/or search keyword and/or search keyword
Recall rate and/or participle input word segmentation result, wherein, the Number of Orders of Search Results or browse time
Number represents and is directed to certain search word, the Number of Orders of user or browses the number of times of details page;Search keyword
Conversion ratio represent that for certain search word user browses the number of times or Number of Orders of details page with search time
Several ratio values;The homepage hit ratio of Search Results represents that user is in Search Results for certain search word
Homepage acquire required for result number of times and searching times between ratio value;Search keyword is called together
The rate of returning represents the number of the result returned for certain search keyword, search system;The participle of participle input
As a result represent for the list number in the final word segmentation result of user input search keyword.
The input of segmentation tesing module 130 is connected with the output end of log analysis module 120, comments for basis
Valency rule is carried out evaluating and obtains the bad participle input of participle effect to associated valid data, wherein, evaluate rule
Then pre-set, the number of evaluation rule is determined according to the species of associated valid data, with participle industry
Business system includes as a example by search system that evaluation rule includes the conversion ratio of search keyword less than the first default threshold
Value and/or Search Results number are less than the second predetermined threshold value and/or usage amount less than predetermined threshold value and/or participle
The word segmentation result of input is more than the 3rd predetermined threshold value, wherein, it is less than first according to the conversion ratio of search keyword
It is to search that predetermined threshold value and/or Search Results number are less than the second predetermined threshold value and carry out evaluating the participle input for obtaining
Rope keyword, usage amount includes the pageview and quantity ordered of commodity details page, according to usage amount less than default threshold
It is worth this evaluation rule to carry out evaluating the participle for obtaining input be popular search record, for example, the business of search
The name of an article claims, label, detailed description etc..
Participle correction is connected with the input of filtering module 140 with the output end of segmentation tesing module 130, is used for
The participle input bad to the participle effect obtained by segmentation tesing module carries out participle correction and filters output
Neologisms phrase, and the neologisms phrase is updated in participle dictionary.So far, the renewal of participle dictionary is realized,
Constantly improve participle dictionary, now, participle operation system can periodically load the participle dictionary after updating, and enter
And Chinese Word Segmentation Service is can proceed with, can quickly be updated.It should be noted that the participle input in invention
Refer in participle operation system participle in need data, for example, need during search index creation
Want the data of participle, such as the title of commodity, description, and the user of participle is needed in search procedure
Input etc..
Specifically, in the present embodiment, participle correction and filtering module 140 include construction submodule 142,
Cutting submodule 141 and filter submodule 143, the input of cutting submodule is used as the participle correction and mistake
The first input end connection of the input, output end and filter submodule 143 of filter module 140, constructs submodule
142 output end is connected with the second input of filter submodule, wherein, the construction submodule 142 is used to sweep
Corpus data is retouched, and calculates each word and refer to probability tables to the probability of next word, needed
Illustrate, corpus data can be the corpus data under specific search environment, such as in product search system
Title, detailed description, label, goods providers title of all commodity etc., or daily common
Corpus data, such as news, novel, biography.Illustrate, if there is a corpus, including language material
AA, AB, AC, ABC and ABCD, then under conditions of A next word be the quantity of A be 1, and language
Quantity in material storehouse with the word of A beginnings as 5, therefore, the probability of A-A is 1/5, i.e., 0.2, correspondingly,
The probability of A-C is 0.2;A-B (P (B | A)) probability be 0.6;A-B-C (P (C | AB)) probability be 1;
A-B-C-D (P (D | ABC)) probability be 1, therefore, the probability of A-A, the probability of A-C, A-B (P (B | A))
Probability, A-B-C (P (C | AB)) probability, A-B-C-D (P (D | ABC)) probability just constitute a reference
Probability tables.
Cutting submodule 141 carries out full cutting and obtains basic participle for the participle input bad to participle effect
Phrase, if the bad participle input of participle effect is " Word Intelligent Segmentation ", carries out the base obtained after full cutting
Plinth participle phrase is by basic participle " intelligence ", " energy ", " dividing ", " word ", " intelligence ", " energy
Point ", " participle ", " intelligence point ", the basic participle word of " can participle " and " Word Intelligent Segmentation " composition
Group.
Filter submodule 143 is used to refer to the Z participles filter algorithm of probability tables to cutting submodule according to use
The basic participle phrase obtained after 141 full cuttings carries out being filtrated to get neologisms phrase, and by the neologisms phrase more
Newly in participle dictionary, specifically, the filter submodule 143 includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase
There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein,
The initial value of first variable i is 0.
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng
Examine to the probability of i-th forward direction word in word list before inquiring about in probability tables, and judging i-th forward direction word
Probability is present or during more than or equal to default first threshold a, and i-th forward direction word is added to into basis point
In word phrase;First from unit is added, for not existing in the probability for judging i-th forward direction word or being less than
During default first threshold a, or after i-th forward direction word is added to into the basic participle phrase, first
Variable i adds certainly.First from plus the output end of unit be connected with the input of the first judging unit, this first
From after adding, the value of the first variable i is 1 to variable i, and when exporting to the first judging unit, the first judging unit is again
Judge, with this repetitive cycling, probability can be inquired not from reference to correspondence in probability tables in word list by front
Exist or be added in basic participle phrase less than the forward direction word of first threshold a, to scan after being judged
Obtain the set with the front phrase to relation.
Second scanning element, for judging the front length to word list less than or equal to first variable i
When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to
The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second
The initial value of variable j is 0.
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set
The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and
Calculating P (B | A);When P (B | A) is judged less than default Second Threshold b, whether second lemma B is judged
Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase.Second
From plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold b, or judging the
Two lemma B Already in participle dictionary when, or the second lemma B is added to into the basic participle phrase
Afterwards, the second variable j adds certainly.Second is connected from the output end for adding unit with the input of the second judging unit,
In second variable j from after adding, the value of the second variable j is changed into 1, and when exporting to the second judging unit, second sentences
Disconnected unit is rejudged, with this repetitive cycling, during this is gathered can be from inquiring with reference to correspondence in probability tables
Probability is less than Second Threshold b and there is no the second lemma in participle dictionary and is added in basic participle phrase,
So that the neologisms phrase that the basic participle phrase carries out being obtained after re-scheduling is added to into participle dictionary after being judged
In, and then the filtration of the bad participle input of participle effect is realized, the neologisms phrase for obtaining is added to point
Word dictionary, realizes the renewal of participle dictionary.
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this
The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
In the present embodiment, first threshold a and Second Threshold b are configurable, and are entered according to actual conditions
Row adjustment and optimization.
As shown in Fig. 2 being the flow chart of the participle Word library updating embodiment of the method for the present invention.The method is started from
Step S1.
In step sl, the participle business diary that participle operation system is exported in running is gathered;Here
In step, participle operation system refers to the system using participle function, including search system, now, search
The participle business diary that system is exported in running is search service daily record, including the search of user is defeated
Enter, the result that the search system is returned and user browsing and order behavior etc. to Search Results.
In step s 2, the participle business diary for collecting to log acquisition module carries out statistical analysis, and carries
Take associated valid data.In this step, so that participle operation system includes search system as an example, the significant figure
According to the Number of Orders including Search Results or the conversion ratio and/or search of number of visits and/or search keyword
The word segmentation result of the recall rate and/or participle input of homepage hit ratio and/or search keyword as a result, its
In, the Number of Orders or number of visits of Search Results are represented for certain search word, the Number of Orders of user
Or browse the number of times of details page;The conversion ratio of search keyword represents that for certain search word user browses
The number of times or Number of Orders of details page and the ratio value of searching times;The homepage hit ratio table of Search Results
Show for certain search word, number of times and the search of result of the user required for the homepage of Search Results is acquired
Ratio value between number of times;The recall rate of search keyword is represented for certain search keyword, search system
The number of the result of return;The word segmentation result of participle input represents final for user input search keyword
Word number in word segmentation result.
In step s3, carry out evaluating that to obtain participle effect bad to associated valid data according to evaluation rule
Participle is input into, wherein, evaluation rule pre-sets, and is determined according to the species of associated valid data and is commented
The number of valency rule, so that participle operation system includes search system as an example, evaluation rule includes search keyword
Conversion ratio less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount
The word segmentation result being input into less than predetermined threshold value and/or participle is more than the 3rd predetermined threshold value, wherein, according to search
The conversion ratio of keyword is carried out less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value
The valid data that evaluation is obtained are search keywords, and usage amount includes the pageview and order of commodity details page
Amount, according to usage amount less than this evaluation rule of predetermined threshold value come to carry out evaluating the valid data for obtaining be popular
Search record, for example, trade name, label, detailed description of search etc..
In step s 4, bad to the participle effect obtained by above-mentioned steps S3 participle input carries out participle
Correction and filtration output neologisms phrase, and the neologisms phrase is updated in participle dictionary.So far, realize
The renewal of participle dictionary, constantly improve participle dictionary, now, participle operation system can periodically load renewal
Participle dictionary afterwards, and then Chinese Word Segmentation Service is can proceed with, can quickly be updated.It should be noted that
The participle input in invention refers in participle operation system the data of institute's participle in need, for example, is searching for
The data of participle, the such as title of commodity, description are needed during index creation, and in search procedure
Need input of the user of participle etc..
Specifically, with reference to Fig. 3, in the present embodiment, above-mentioned steps S4 include:
In step S41, corpus data is scanned, and calculates each word and carry out structure to the probability of next word
Make one and refer to probability tables, it should be noted that corpus data can be the corpus data under specific search environment,
Such as in the product search system title of all commodity, detailed description, label, goods providers title etc.,
Can also be daily common corpus data, such as news, novel, biography.Illustrate, if having one
Corpus, including language material AA, AB, AC, ABC and ABCD, then next word is under conditions of A
The quantity of A is 1, and the quantity in corpus with the word of A beginnings is as 5, therefore, the probability of A-A is 1/5,
I.e. 0.2, correspondingly, the probability of A-C is 0.2;A-B (P (B | A)) probability be 0.6;A-B-C(P(C|AB))
Probability be 1;A-B-C-D (P (D | ABC)) probability be 1, therefore, the probability of A-A, the probability of A-C,
A-B (P (B | A)) probability, A-B-C (P (C | AB)) probability, A-B-C-D (P (D | ABC)) probability with regard to structure
Probability tables is referred to into one.
In step S42, the participle input bad to participle effect carries out full cutting and obtains basic participle phrase,
If the bad participle input of participle effect is " Word Intelligent Segmentation ", the basic participle obtained after full cutting is carried out
Phrase be by basic participle " intelligence ", " energy ", " dividing ", " word ", " intelligence ", " can divide ",
The basic participle phrase that " participle ", " intelligence point ", " energy participle " and " Word Intelligent Segmentation " are constituted.
In step S43, according to using the Z participles filter algorithm with reference to probability tables to obtaining after full cutting
Basic participle phrase carries out being filtrated to get neologisms phrase, and the neologisms phrase is updated in participle dictionary.
Specifically, with reference to Fig. 3, in the present embodiment, above-mentioned steps S43 include:
In step S431, scan the basic participle phrase and obtain in the basic participle phrase basic participle and be total to
There is but is not included in the forward direction word list in the basic participle phrase;In step S432, judge that this is front to word
Whether the length of list is more than the first variable i, wherein, the initial value of first variable i is 0, is if so, then held
Row step S433, if it is not, then execution step S435.In step S433, from reference to inquiry in probability tables
The probability of i-th forward direction word in forward direction word list, and exist or big in the probability for judging i-th forward direction word
In or during equal to default first threshold a, i-th forward direction word is added in basic participle phrase.In step
In rapid S434, do not exist or during less than default first threshold a in the probability for judging i-th forward direction word,
Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this
From after adding, the value of the first variable i is changed into 1 to one variable i, and repeat the above steps S432 are to step S434.With this
Repetitive cycling, by front can not exist or little from probability is inquired with reference to correspondence in probability tables in word list
It is added in basic participle phrase in the forward direction word of first threshold a, is obtained with front with scanning after being judged
To the set of the phrase of relation.
In step S435, the basic participle phrase is scanned, obtains the set with the front phrase to relation,
Wherein, it is expressed as { A, B } to the phrase of relation with front, A is the first lemma, B is the second lemma.
In step S436, whether the size of the set is judged less than the second variable j, wherein, second change
The initial value of amount j is 0, if so, then execution step S437, if it is not, then execution step S439.In step S437
In, the first lemma A and the second lemma B in j-th phrase in the set are taken out, and from reference to probability tables
Inquiry P (A) and P (AB), and calculate P (B | A);When P (B | A) is judged less than default Second Threshold b,
Second lemma B is judged whether Already in participle dictionary, if it is not, then adding second lemma B
To the basic participle phrase.In step S438, judging P (B | A) more than or equal to default Second Threshold
During b, or when judging the second lemma B Already in participle dictionary, or by the second lemma B additions
To after the basic participle phrase, the second variable j adds certainly, and the repeat the above steps after second variable adds certainly
S436 is to step S438.With this repetitive cycling, during this is gathered can be from reference to correspondence inquiry in probability tables
It is less than Second Threshold b and there is no the second lemma in participle dictionary to probability and is added to basic participle phrase
In, so that the neologisms phrase that the basic participle phrase carries out being obtained after re-scheduling is added to into participle word after being judged
In storehouse, and then the filtration that the bad participle of participle effect is input into is realized, the neologisms phrase for obtaining is added to
Participle dictionary, realizes the renewal of participle dictionary.In step S439, the basic participle phrase is arranged
The neologisms phrase obtained after weight is added in participle dictionary.
The above, the only present invention preferably specific embodiment, but protection scope of the present invention not office
Be limited to this, any those familiar with the art the invention discloses technical scope in, can be easily
The change or replacement expected, all should be included within the scope of the present invention.Therefore, protection of the invention
Scope should be defined by scope of the claims.Finally some symbols in the present invention are illustrated,
P (A) represents the probability that A occurs;P (A | B) represent under conditions of B occurs, the probability that A occurs;P(AB)
Represent the simultaneous probability of AB.
Claims (10)
1. a kind of participle Word library updating system, it is characterised in that include:
Log acquisition module, for gathering the participle business day that participle operation system is exported in running
Will;
Log analysis module, the participle business diary for collecting to the log acquisition module is carried out
Statistical analysis, and extract associated valid data;
Segmentation tesing module, for carrying out evaluation to the associated valid data according to evaluation rule participle is obtained
The bad participle input of effect;And
Participle is corrected and filtering module, for the participle effect obtained by the segmentation tesing module not
Good participle input carries out participle correction and filters output neologisms phrase, and the neologisms phrase is updated to participle
In dictionary.
2. according to the participle Word library updating system described in claim 1, it is characterised in that the participle industry
Business system includes search system, and the associated valid data includes the Number of Orders of Search Results or browses secondary
The homepage hit ratio of the conversion ratio and/or Search Results of number and/or search keyword and/or search keyword
Recall rate and/or participle input word segmentation result;The evaluation rule includes the conversion ratio of search keyword
Less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount less than default
Threshold value and/or the word segmentation result of participle input are more than the 3rd predetermined threshold value.
3. according to the participle Word library updating system described in claim 1, it is characterised in that the participle school
Just include construction submodule and cutting submodule with filtering module, wherein:
The construction submodule, for scanning corpus data, and calculates each word to the general of next word
Rate refers to probability tables;
The cutting submodule, carries out full cutting and obtains base for the participle input bad to the participle effect
Plinth participle phrase.
4. according to the participle Word library updating system described in claim 3, it is characterised in that the participle school
Just also include filter submodule with filtering module, the filter submodule is used for according to using the reference probability
The Z participles filter algorithm of table is entered to the described basic participle phrase obtained after the full cutting of the cutting submodule
Row is filtrated to get the neologisms phrase, and the neologisms phrase is updated in participle dictionary.
5. according to the participle Word library updating system described in claim 4, it is characterised in that described to cross filter
Module includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase
There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein,
The initial value of first variable i is 0;
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng
Examine and the front probability to i-th forward direction word in word list is inquired about in probability tables, and judging i-th forward direction word
Probability exist or during more than or equal to default first threshold, i-th forward direction word is added to into the base
In plinth participle phrase;
First from unit is added, for not existing or less than default in the probability for judging i-th forward direction word
During first threshold, or after i-th forward direction word is added to into the basic participle phrase, the first variable i is certainly
Plus;
Second scanning element, for judging the front length to word list less than or equal to first variable i
When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to
The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second
The initial value of variable j is 0;
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set
The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and
Calculating P (B | A);When P (B | A) is judged less than default Second Threshold, whether second lemma B is judged
Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
Second from plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold, Huo Zhe
When judging the second lemma B Already in participle dictionary, or the second lemma B is added to into the basis point
After word phrase, the second variable j adds certainly;
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this
The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
6. a kind of participle Word library updating method, it is characterised in that the method is comprised the following steps:
The participle business diary that S1, collection participle operation system are exported in running;
S2, the participle business diary to collecting carry out statistical analysis, and extract associated valid data;
S3, according to evaluation rule the associated valid data is carried out evaluating and obtain the bad participle of participle effect
Input;
The bad participle input of S4, the participle effect to obtaining carries out participle correction and filters output neologisms
Phrase, and the neologisms phrase is updated in participle dictionary.
7. participle Word library updating method according to claim 6, it is characterised in that the participle business
System includes search system, and the associated valid data includes the Number of Orders or number of visits of Search Results
And/or the conversion ratio of search keyword and/or the homepage hit ratio of Search Results and/or search keyword
Recall rate and/or the word segmentation result of participle input;The evaluation rule includes that the conversion ratio of search keyword is little
In the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount less than default threshold
Value and/or the word segmentation result of participle input are more than the 3rd predetermined threshold value.
8. participle Word library updating method according to claim 6, it is characterised in that step S4
Including:
S41, scanning corpus data, and calculate each word to the probability of next word to construct a reference
Probability tables;
S42, the participle input bad to the participle effect carry out full cutting and obtain basic participle phrase.
9. participle Word library updating method according to claim 8, it is characterised in that step S4
Also include:
S43, the cutting submodule is cut entirely according to the Z participles filter algorithm using the reference probability tables
The described basic participle phrase that obtains carries out being filtrated to get the neologisms phrase after point, and by the neologisms phrase
In updating participle dictionary.
10. participle Word library updating method according to claim 8, it is characterised in that step S43
Including:
S431, scan the basic participle phrase and to obtain in the basic participle phrase basic participle total but do not wrap
The forward direction word list being contained in the basic participle phrase;
S432, the front length to word list is judged whether more than the first variable i, wherein, first variable i
Initial value be 0, if execution step S433, if it is not, then execution step S435;
S433, judge the front length to word list be more than the first variable i when, look into from reference to probability tables
Ask the front probability to i-th forward direction word in word list, and exist in the probability for judging i-th forward direction word or
When person is more than or equal to default first threshold, i-th forward direction word is added in the basic participle phrase;
S434, do not exist or during less than default first threshold in the probability for judging i-th forward direction word,
Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this
From after adding, repeat step S432 is to step S434 for one variable i;
S435, judge the front length to word list be less than or equal to first variable i when, scan the base
Plinth participle phrase, obtains the set with the front phrase to relation, wherein, with the front phrase table to relation
It is shown as { A, B }, A is the first lemma, B is the second lemma;
S436, the size of the set is judged whether less than the second variable j, wherein, the initial value of the second variable j
For 0, if so, then execution step S437, if it is not, then execution step S439;
S437, when the size for judging the set is less than the second variable j, take out in the set in j-th phrase
The first lemma A and the second lemma B, and from reference to inquiring about P (A) and P (AB) in probability tables, and calculate
P(B|A);When P (B | A) is judged less than default Second Threshold, judge whether second lemma B has deposited
In being participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
S438, when P (B | A) is judged more than or equal to default Second Threshold, or judging the second lemma
B Already in participle dictionary when, or the second lemma B is added to after the basic participle phrase,
Two variable j add certainly, and in second variable j from after adding, repeat step S436 is to step S438;
S439, judge the set size be more than or equal to the second variable j when, by the basic participle phrase
The neologisms phrase for carrying out being obtained after re-scheduling is added in participle dictionary.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510715638.8A CN106649308B (en) | 2015-10-28 | 2015-10-28 | Word segmentation and word library updating method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510715638.8A CN106649308B (en) | 2015-10-28 | 2015-10-28 | Word segmentation and word library updating method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649308A true CN106649308A (en) | 2017-05-10 |
CN106649308B CN106649308B (en) | 2020-05-01 |
Family
ID=58831014
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510715638.8A Active CN106649308B (en) | 2015-10-28 | 2015-10-28 | Word segmentation and word library updating method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649308B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182173A (en) * | 2017-12-27 | 2018-06-19 | 福建中金在线信息科技有限公司 | A kind of method, apparatus and electronic equipment for extracting keyword |
CN108920576A (en) * | 2018-06-25 | 2018-11-30 | 中科点击(北京)科技有限公司 | A kind of adaptive text searching method |
CN108984735A (en) * | 2018-07-12 | 2018-12-11 | 广州资宝科技有限公司 | Label Word library updating method, apparatus and electronic equipment |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN110231955A (en) * | 2019-05-13 | 2019-09-13 | 平安科技(深圳)有限公司 | Code process method, apparatus, computer equipment and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10240735A (en) * | 1997-02-27 | 1998-09-11 | Mitsubishi Electric Corp | Method and device for analyzing morpheme |
CN102663139A (en) * | 2012-05-07 | 2012-09-12 | 苏州大学 | Method and system for constructing emotional dictionary |
CN103544165A (en) * | 2012-07-12 | 2014-01-29 | 腾讯科技(深圳)有限公司 | Neologism mining method and system |
CN104035969A (en) * | 2014-05-20 | 2014-09-10 | 微梦创科网络科技(中国)有限公司 | Method and system for building feature word banks in social network |
-
2015
- 2015-10-28 CN CN201510715638.8A patent/CN106649308B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH10240735A (en) * | 1997-02-27 | 1998-09-11 | Mitsubishi Electric Corp | Method and device for analyzing morpheme |
CN102663139A (en) * | 2012-05-07 | 2012-09-12 | 苏州大学 | Method and system for constructing emotional dictionary |
CN103544165A (en) * | 2012-07-12 | 2014-01-29 | 腾讯科技(深圳)有限公司 | Neologism mining method and system |
CN104035969A (en) * | 2014-05-20 | 2014-09-10 | 微梦创科网络科技(中国)有限公司 | Method and system for building feature word banks in social network |
Non-Patent Citations (1)
Title |
---|
姚磊岳 等: "一种基于中文分词算法的信息过滤技术", 《科技广场》 * |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108182173A (en) * | 2017-12-27 | 2018-06-19 | 福建中金在线信息科技有限公司 | A kind of method, apparatus and electronic equipment for extracting keyword |
CN108920576A (en) * | 2018-06-25 | 2018-11-30 | 中科点击(北京)科技有限公司 | A kind of adaptive text searching method |
CN108984735A (en) * | 2018-07-12 | 2018-12-11 | 广州资宝科技有限公司 | Label Word library updating method, apparatus and electronic equipment |
CN108984735B (en) * | 2018-07-12 | 2019-08-13 | 广州资宝科技有限公司 | Label Word library updating method, apparatus and electronic equipment |
CN109858011A (en) * | 2018-11-30 | 2019-06-07 | 平安科技(深圳)有限公司 | Standard dictionary segmenting method, device, equipment and computer readable storage medium |
CN109858011B (en) * | 2018-11-30 | 2022-08-19 | 平安科技(深圳)有限公司 | Standard word bank word segmentation method, device, equipment and computer readable storage medium |
CN110231955A (en) * | 2019-05-13 | 2019-09-13 | 平安科技(深圳)有限公司 | Code process method, apparatus, computer equipment and storage medium |
CN110231955B (en) * | 2019-05-13 | 2024-05-07 | 平安科技(深圳)有限公司 | Code processing method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN106649308B (en) | 2020-05-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106649308A (en) | Updating method and system of word segmentation library | |
CN105808526B (en) | Commodity short text core word extracting method and device | |
CN108829658B (en) | Method and device for discovering new words | |
CN103914478B (en) | Webpage training method and system, webpage Forecasting Methodology and system | |
CN101876981B (en) | A kind of method and device building knowledge base | |
CN102012900B (en) | An information retrieval method and system | |
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
WO2018189589A2 (en) | Systems and methods for document processing using machine learning | |
CN104077407B (en) | A kind of intelligent data search system and method | |
CN106909663B (en) | Label user brand preference behavior prediction method and device | |
US20050267915A1 (en) | Method and apparatus for recognizing specific type of information files | |
CN104598532A (en) | Information processing method and device | |
CN100419755C (en) | Systems and methods for document data analysis | |
CN104063497B (en) | Viewpoint treating method and apparatus and searching method and device | |
US9275015B2 (en) | System and method for performing analysis on information, such as social media | |
WO2012054788A1 (en) | Method and system for performing a comparison | |
CN101169780A (en) | Semantic ontology retrieval system and method | |
CN102207961B (en) | Automatic web page classification method and device | |
CN103678576A (en) | Full-text retrieval system based on dynamic semantic analysis | |
CN102236654A (en) | Web useless link filtering method based on content relevancy | |
CN105975459A (en) | Lexical item weight labeling method and device | |
CN106844482A (en) | A kind of retrieval information matching method and device based on search engine | |
CN109471934B (en) | Financial risk clue mining method based on Internet | |
CN102722526B (en) | Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method | |
US11295078B2 (en) | Portfolio-based text analytics tool |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |