CN106649308A - Updating method and system of word segmentation library - Google Patents

Updating method and system of word segmentation library Download PDF

Info

Publication number
CN106649308A
CN106649308A CN201510715638.8A CN201510715638A CN106649308A CN 106649308 A CN106649308 A CN 106649308A CN 201510715638 A CN201510715638 A CN 201510715638A CN 106649308 A CN106649308 A CN 106649308A
Authority
CN
China
Prior art keywords
participle
phrase
word
variable
basic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510715638.8A
Other languages
Chinese (zh)
Other versions
CN106649308B (en
Inventor
杨睛龙
胡正才
周美芳
刘平华
李海平
曲晓园
高宝兵
陈国锐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aspire Digital Technologies Shenzhen Co Ltd
Original Assignee
Aspire Digital Technologies Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aspire Digital Technologies Shenzhen Co Ltd filed Critical Aspire Digital Technologies Shenzhen Co Ltd
Priority to CN201510715638.8A priority Critical patent/CN106649308B/en
Publication of CN106649308A publication Critical patent/CN106649308A/en
Application granted granted Critical
Publication of CN106649308B publication Critical patent/CN106649308B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides an updating system of a word segmentation library. The system comprises a log collection module, a journal analysis module, a word segmentation evaluation module and a word segmentation correction and filtering module, wherein the word segmentation correction and filtering module comprises a construct submodule, a cut molecular module and a filter submodule. The invention also provides a corresponding method. The implementations of the updating system and the method of word segmentation library are based on the log analyses of the word segmentation operation, and the word segmentation effects of the word segmentation operation system are evaluated, and the word segmentation inputs of bad word segmentation effects are extracted, the word segmentation inputs of bad word segmentation effects are corrected using word segmentations and new words and phrases are filtered and outputted based on the Z word segmentation filtering algorithm of the reference probability table, and the new words and phrases are updated to the word segmentation library, the word segmentation library is continuously improved, and the problem that the word segmentation library cannot timely adapt to the actual word segmentation application environment is solved, and the word segmentation effect is effectively increased.

Description

A kind of participle Word library updating method and system
Technical field
The present invention relates to technical field of data processing, more particularly, it relates to a kind of participle Word library updating method And system.
Background technology
In search system, the quality of participle effect is the key factor for affecting search effect.And participle process The dictionary for being relied on, is the important component part of participle technique.
Word stock generating method common at present is the method using statistics:Adjacent co-occurrence is each in being input into language material The frequency of individual combinatorics on words (i.e. word group) is counted, and calculates its information that appears alternatively, wherein, the information that appears alternatively The tightness degree of marriage relation between Chinese character is embodied, when tightness degree is higher than some threshold value, then can be recognized A word may be constituted for this word group.Dictionary is generated by said method, then by dictionary application and line points Word business.
But the Chinese vocabulary bank generated using the method for above-mentioned word frequency statisticses, the technical problem for existing mainly has: Jing is often syncopated as some co-occurrence frequency height but is not the commonly used word group of word;Most of dictionary is all belonging to general Dictionary, is not suitable for some vertical search scenes, such as trade name search, place name search, name search etc.; Dictionary is often static, generates under line and is used on line again, it is impossible to is quickly carried out more according to actually used situation It is new and perfect;Dictionary is poor for the recognition effect of neologisms.
The content of the invention
The technical problem to be solved in the present invention is to generate Chinese vocabulary bank for existing word frequency statisticses method A kind of drawbacks described above, there is provided participle Word library updating method and system.
Technical proposal that the invention solves the above-mentioned problems there is provided a kind of participle Word library updating system, including:
Log acquisition module, for gathering the participle business day that participle operation system is exported in running Will;
Log analysis module, the participle business diary for collecting to the log acquisition module is carried out Statistical analysis, and extract associated valid data;
Segmentation tesing module, for carrying out evaluation to the associated valid data according to evaluation rule participle is obtained The bad participle input of effect;And
Participle is corrected and filtering module, for the participle effect obtained by the segmentation tesing module not Good participle input carries out participle correction and filters output neologisms phrase, and the neologisms phrase is updated to participle In dictionary.
In above-mentioned participle Word library updating system, the participle operation system includes search system, the correlation The Number of Orders of valid data including Search Results or the conversion ratio of number of visits and/or search keyword and/ Or the participle knot of the recall rate and/or participle input of the homepage hit ratio and/or search keyword of Search Results Really;The evaluation rule includes the conversion ratio of search keyword less than the first predetermined threshold value and/or Search Results Number is big less than the word segmentation result that predetermined threshold value and/or participle are input into less than the second predetermined threshold value and/or usage amount In the 3rd predetermined threshold value.
In above-mentioned participle Word library updating system, participle correction and filtering module include construction submodule and Cutting submodule, wherein:
The construction submodule, for scanning corpus data, and calculates each word to the general of next word Rate refers to probability tables;
The cutting submodule, carries out full cutting and obtains base for the participle input bad to the participle effect Plinth participle phrase.
In above-mentioned participle Word library updating system, the participle correction also includes filtering submodule with filtering module Block, the filter submodule is used for according to being cut to described using the Z participles filter algorithm of the reference probability tables The described basic participle phrase obtained after the full cutting of molecular modules carries out being filtrated to get the neologisms phrase, and will The neologisms phrase is updated in participle dictionary.
In above-mentioned participle Word library updating system, the filter submodule includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein, The initial value of first variable i is 0;
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng Examine and the front probability to i-th forward direction word in word list is inquired about in probability tables, and judging i-th forward direction word Probability exist or during more than or equal to default first threshold, i-th forward direction word is added to into the base In plinth participle phrase;
First from unit is added, for not existing or less than default in the probability for judging i-th forward direction word During first threshold, or after i-th forward direction word is added to into the basic participle phrase, the first variable i is certainly Plus;
Second scanning element, for judging the front length to word list less than or equal to first variable i When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second The initial value of variable j is 0;
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and Calculating P (B | A);When P (B | A) is judged less than default Second Threshold, whether second lemma B is judged Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
Second from plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold, Huo Zhe When judging the second lemma B Already in participle dictionary, or the second lemma B is added to into the basis point After word phrase, the second variable j adds certainly;
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
Present invention also offers a kind of participle Word library updating method, the method is comprised the following steps:
The participle business diary that S1, collection participle operation system are exported in running;
S2, the participle business diary to collecting carry out statistical analysis, and extract associated valid data;
S3, according to evaluation rule the associated valid data is carried out evaluating and obtain the bad participle of participle effect Input;
The bad participle input of S4, the participle effect to obtaining carries out participle correction and filters output neologisms Phrase, and the neologisms phrase is updated in participle dictionary.
In above-mentioned participle Word library updating method, the participle operation system includes search system, the correlation The Number of Orders of valid data including Search Results or the conversion ratio of number of visits and/or search keyword and/ Or the participle knot of the recall rate and/or participle input of the homepage hit ratio and/or search keyword of Search Results Really;The evaluation rule includes the conversion ratio of search keyword less than the first predetermined threshold value and/or Search Results Number is big less than the word segmentation result that predetermined threshold value and/or participle are input into less than the second predetermined threshold value and/or usage amount In the 3rd predetermined threshold value.
In above-mentioned participle Word library updating method, step S4 includes:
S41, scanning corpus data, and calculate each word to the probability of next word to construct a reference Probability tables;
S42, the participle input bad to the participle effect carry out full cutting and obtain basic participle phrase.
In above-mentioned participle Word library updating method, step S4 also includes:
S43, the cutting submodule is cut entirely according to the Z participles filter algorithm using the reference probability tables The described basic participle phrase that obtains carries out being filtrated to get the neologisms phrase after point, and by the neologisms phrase In updating participle dictionary.
In above-mentioned participle Word library updating method, step S43 includes:
S431, scan the basic participle phrase and to obtain in the basic participle phrase basic participle total but do not wrap The forward direction word list being contained in the basic participle phrase;
S432, the front length to word list is judged whether more than the first variable i, wherein, first variable i Initial value be 0, if execution step S433, if it is not, then execution step S435;
S433, judge the front length to word list be more than the first variable i when, look into from reference to probability tables Ask the front probability to i-th forward direction word in word list, and exist in the probability for judging i-th forward direction word or When person is more than or equal to default first threshold, i-th forward direction word is added in the basic participle phrase;
S434, do not exist or during less than default first threshold in the probability for judging i-th forward direction word, Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this From after adding, repeat step S432 is to step S434 for one variable i;
S435, judge the front length to word list be less than or equal to first variable i when, scan the base Plinth participle phrase, obtains the set with the front phrase to relation, wherein, with the front phrase table to relation It is shown as { A, B }, A is the first lemma, B is the second lemma;
S436, the size of the set is judged whether less than the second variable j, wherein, the initial value of the second variable j For 0, if so, then execution step S437, if it is not, then execution step S439;
S437, when the size for judging the set is less than the second variable j, take out in the set in j-th phrase The first lemma A and the second lemma B, and from reference to inquiring about P (A) and P (AB) in probability tables, and calculate P(B|A);When P (B | A) is judged less than default Second Threshold, judge whether second lemma B has deposited In being participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
S438, when P (B | A) is judged more than or equal to default Second Threshold, or judging the second lemma B Already in participle dictionary when, or the second lemma B is added to after the basic participle phrase, Two variable j add certainly, and in second variable j from after adding, repeat step S436 is to step S438;
S439, judge the set size be more than or equal to the second variable j when, by the basic participle phrase The neologisms phrase for carrying out being obtained after re-scheduling is added in participle dictionary.
Implement the participle Word library updating method and system of the present invention, beneficial effect has:Based on participle business diary Analysis, is evaluated by the participle effect to participle operation system, extracts the bad participle of participle effect Input, is carried out according to being input into using the participle bad to participle effect of the Z participles filter algorithm with reference to probability tables Participle corrects and filters output neologisms phrase, and the neologisms phrase is updated in participle dictionary, constantly improve Participle dictionary, solves the problems, such as that participle dictionary in good time and can not adapt to actual participle applied environment, effectively carries High participle effect.Meanwhile, participle operation system can periodically load the participle dictionary after updating, and then can continue Chinese Word Segmentation Service is carried out, can be quickly updated.
Description of the drawings
Fig. 1 is the structural representation of the participle Word library updating system embodiment of the present invention.
Fig. 2 is the flow chart of the participle Word library updating embodiment of the method for the present invention.
Fig. 3 is the specific flow chart of the participle Word library updating embodiment of the method for the present invention.
Specific embodiment
In order that the objects, technical solutions and advantages of the present invention become more apparent, below in conjunction with accompanying drawing and reality Example is applied, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only To explain the present invention, it is not intended to limit the present invention.
The participle Word library updating system and method for the present invention is analyzed based on participle business diary, by participle industry The participle effect of business system is evaluated, and the bad participle input of participle effect is extracted, according to using reference The participle input that the Z participles filter algorithm of probability tables is bad to participle effect carries out participle correction and filters defeated Go out neologisms phrase, and the neologisms phrase is updated in participle dictionary, constantly improve participle dictionary is solved Participle dictionary in good time and can not adapt to the problem of actual participle applied environment, effectively improve participle effect.
As shown in figure 1, being the structural representation of the participle Word library updating system embodiment of the present invention.The system 100 include the correction of log acquisition module 110, log analysis module 120, segmentation tesing module 130 and participle With filtering module 140, wherein:
The input of log acquisition module 110 is connected with participle operation system, for gathering participle operation system The participle business diary exported in running, wherein, participle operation system is referred to using participle function System, including search system, now, the participle business diary that search system is exported in running is to search Rope serve log, including the search of user is input into, the result that the search system is returned and user are tied to search Fruit browses and order behavior etc..
The input of log analysis module 120 is connected with the output end of log acquisition module 110, for daily record The participle business diary that acquisition module is collected carries out statistical analysis, and extracts associated valid data.With participle Operation system includes as a example by search system that the valid data include the Number of Orders of Search Results or browse secondary The homepage hit ratio of the conversion ratio and/or Search Results of number and/or search keyword and/or search keyword Recall rate and/or participle input word segmentation result, wherein, the Number of Orders of Search Results or browse time Number represents and is directed to certain search word, the Number of Orders of user or browses the number of times of details page;Search keyword Conversion ratio represent that for certain search word user browses the number of times or Number of Orders of details page with search time Several ratio values;The homepage hit ratio of Search Results represents that user is in Search Results for certain search word Homepage acquire required for result number of times and searching times between ratio value;Search keyword is called together The rate of returning represents the number of the result returned for certain search keyword, search system;The participle of participle input As a result represent for the list number in the final word segmentation result of user input search keyword.
The input of segmentation tesing module 130 is connected with the output end of log analysis module 120, comments for basis Valency rule is carried out evaluating and obtains the bad participle input of participle effect to associated valid data, wherein, evaluate rule Then pre-set, the number of evaluation rule is determined according to the species of associated valid data, with participle industry Business system includes as a example by search system that evaluation rule includes the conversion ratio of search keyword less than the first default threshold Value and/or Search Results number are less than the second predetermined threshold value and/or usage amount less than predetermined threshold value and/or participle The word segmentation result of input is more than the 3rd predetermined threshold value, wherein, it is less than first according to the conversion ratio of search keyword It is to search that predetermined threshold value and/or Search Results number are less than the second predetermined threshold value and carry out evaluating the participle input for obtaining Rope keyword, usage amount includes the pageview and quantity ordered of commodity details page, according to usage amount less than default threshold It is worth this evaluation rule to carry out evaluating the participle for obtaining input be popular search record, for example, the business of search The name of an article claims, label, detailed description etc..
Participle correction is connected with the input of filtering module 140 with the output end of segmentation tesing module 130, is used for The participle input bad to the participle effect obtained by segmentation tesing module carries out participle correction and filters output Neologisms phrase, and the neologisms phrase is updated in participle dictionary.So far, the renewal of participle dictionary is realized, Constantly improve participle dictionary, now, participle operation system can periodically load the participle dictionary after updating, and enter And Chinese Word Segmentation Service is can proceed with, can quickly be updated.It should be noted that the participle input in invention Refer in participle operation system participle in need data, for example, need during search index creation Want the data of participle, such as the title of commodity, description, and the user of participle is needed in search procedure Input etc..
Specifically, in the present embodiment, participle correction and filtering module 140 include construction submodule 142, Cutting submodule 141 and filter submodule 143, the input of cutting submodule is used as the participle correction and mistake The first input end connection of the input, output end and filter submodule 143 of filter module 140, constructs submodule 142 output end is connected with the second input of filter submodule, wherein, the construction submodule 142 is used to sweep Corpus data is retouched, and calculates each word and refer to probability tables to the probability of next word, needed Illustrate, corpus data can be the corpus data under specific search environment, such as in product search system Title, detailed description, label, goods providers title of all commodity etc., or daily common Corpus data, such as news, novel, biography.Illustrate, if there is a corpus, including language material AA, AB, AC, ABC and ABCD, then under conditions of A next word be the quantity of A be 1, and language Quantity in material storehouse with the word of A beginnings as 5, therefore, the probability of A-A is 1/5, i.e., 0.2, correspondingly, The probability of A-C is 0.2;A-B (P (B | A)) probability be 0.6;A-B-C (P (C | AB)) probability be 1; A-B-C-D (P (D | ABC)) probability be 1, therefore, the probability of A-A, the probability of A-C, A-B (P (B | A)) Probability, A-B-C (P (C | AB)) probability, A-B-C-D (P (D | ABC)) probability just constitute a reference Probability tables.
Cutting submodule 141 carries out full cutting and obtains basic participle for the participle input bad to participle effect Phrase, if the bad participle input of participle effect is " Word Intelligent Segmentation ", carries out the base obtained after full cutting Plinth participle phrase is by basic participle " intelligence ", " energy ", " dividing ", " word ", " intelligence ", " energy Point ", " participle ", " intelligence point ", the basic participle word of " can participle " and " Word Intelligent Segmentation " composition Group.
Filter submodule 143 is used to refer to the Z participles filter algorithm of probability tables to cutting submodule according to use The basic participle phrase obtained after 141 full cuttings carries out being filtrated to get neologisms phrase, and by the neologisms phrase more Newly in participle dictionary, specifically, the filter submodule 143 includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein, The initial value of first variable i is 0.
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng Examine to the probability of i-th forward direction word in word list before inquiring about in probability tables, and judging i-th forward direction word Probability is present or during more than or equal to default first threshold a, and i-th forward direction word is added to into basis point In word phrase;First from unit is added, for not existing in the probability for judging i-th forward direction word or being less than During default first threshold a, or after i-th forward direction word is added to into the basic participle phrase, first Variable i adds certainly.First from plus the output end of unit be connected with the input of the first judging unit, this first From after adding, the value of the first variable i is 1 to variable i, and when exporting to the first judging unit, the first judging unit is again Judge, with this repetitive cycling, probability can be inquired not from reference to correspondence in probability tables in word list by front Exist or be added in basic participle phrase less than the forward direction word of first threshold a, to scan after being judged Obtain the set with the front phrase to relation.
Second scanning element, for judging the front length to word list less than or equal to first variable i When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second The initial value of variable j is 0.
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and Calculating P (B | A);When P (B | A) is judged less than default Second Threshold b, whether second lemma B is judged Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase.Second From plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold b, or judging the Two lemma B Already in participle dictionary when, or the second lemma B is added to into the basic participle phrase Afterwards, the second variable j adds certainly.Second is connected from the output end for adding unit with the input of the second judging unit, In second variable j from after adding, the value of the second variable j is changed into 1, and when exporting to the second judging unit, second sentences Disconnected unit is rejudged, with this repetitive cycling, during this is gathered can be from inquiring with reference to correspondence in probability tables Probability is less than Second Threshold b and there is no the second lemma in participle dictionary and is added in basic participle phrase, So that the neologisms phrase that the basic participle phrase carries out being obtained after re-scheduling is added to into participle dictionary after being judged In, and then the filtration of the bad participle input of participle effect is realized, the neologisms phrase for obtaining is added to point Word dictionary, realizes the renewal of participle dictionary.
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
In the present embodiment, first threshold a and Second Threshold b are configurable, and are entered according to actual conditions Row adjustment and optimization.
As shown in Fig. 2 being the flow chart of the participle Word library updating embodiment of the method for the present invention.The method is started from Step S1.
In step sl, the participle business diary that participle operation system is exported in running is gathered;Here In step, participle operation system refers to the system using participle function, including search system, now, search The participle business diary that system is exported in running is search service daily record, including the search of user is defeated Enter, the result that the search system is returned and user browsing and order behavior etc. to Search Results.
In step s 2, the participle business diary for collecting to log acquisition module carries out statistical analysis, and carries Take associated valid data.In this step, so that participle operation system includes search system as an example, the significant figure According to the Number of Orders including Search Results or the conversion ratio and/or search of number of visits and/or search keyword The word segmentation result of the recall rate and/or participle input of homepage hit ratio and/or search keyword as a result, its In, the Number of Orders or number of visits of Search Results are represented for certain search word, the Number of Orders of user Or browse the number of times of details page;The conversion ratio of search keyword represents that for certain search word user browses The number of times or Number of Orders of details page and the ratio value of searching times;The homepage hit ratio table of Search Results Show for certain search word, number of times and the search of result of the user required for the homepage of Search Results is acquired Ratio value between number of times;The recall rate of search keyword is represented for certain search keyword, search system The number of the result of return;The word segmentation result of participle input represents final for user input search keyword Word number in word segmentation result.
In step s3, carry out evaluating that to obtain participle effect bad to associated valid data according to evaluation rule Participle is input into, wherein, evaluation rule pre-sets, and is determined according to the species of associated valid data and is commented The number of valency rule, so that participle operation system includes search system as an example, evaluation rule includes search keyword Conversion ratio less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount The word segmentation result being input into less than predetermined threshold value and/or participle is more than the 3rd predetermined threshold value, wherein, according to search The conversion ratio of keyword is carried out less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value The valid data that evaluation is obtained are search keywords, and usage amount includes the pageview and order of commodity details page Amount, according to usage amount less than this evaluation rule of predetermined threshold value come to carry out evaluating the valid data for obtaining be popular Search record, for example, trade name, label, detailed description of search etc..
In step s 4, bad to the participle effect obtained by above-mentioned steps S3 participle input carries out participle Correction and filtration output neologisms phrase, and the neologisms phrase is updated in participle dictionary.So far, realize The renewal of participle dictionary, constantly improve participle dictionary, now, participle operation system can periodically load renewal Participle dictionary afterwards, and then Chinese Word Segmentation Service is can proceed with, can quickly be updated.It should be noted that The participle input in invention refers in participle operation system the data of institute's participle in need, for example, is searching for The data of participle, the such as title of commodity, description are needed during index creation, and in search procedure Need input of the user of participle etc..
Specifically, with reference to Fig. 3, in the present embodiment, above-mentioned steps S4 include:
In step S41, corpus data is scanned, and calculates each word and carry out structure to the probability of next word Make one and refer to probability tables, it should be noted that corpus data can be the corpus data under specific search environment, Such as in the product search system title of all commodity, detailed description, label, goods providers title etc., Can also be daily common corpus data, such as news, novel, biography.Illustrate, if having one Corpus, including language material AA, AB, AC, ABC and ABCD, then next word is under conditions of A The quantity of A is 1, and the quantity in corpus with the word of A beginnings is as 5, therefore, the probability of A-A is 1/5, I.e. 0.2, correspondingly, the probability of A-C is 0.2;A-B (P (B | A)) probability be 0.6;A-B-C(P(C|AB)) Probability be 1;A-B-C-D (P (D | ABC)) probability be 1, therefore, the probability of A-A, the probability of A-C, A-B (P (B | A)) probability, A-B-C (P (C | AB)) probability, A-B-C-D (P (D | ABC)) probability with regard to structure Probability tables is referred to into one.
In step S42, the participle input bad to participle effect carries out full cutting and obtains basic participle phrase, If the bad participle input of participle effect is " Word Intelligent Segmentation ", the basic participle obtained after full cutting is carried out Phrase be by basic participle " intelligence ", " energy ", " dividing ", " word ", " intelligence ", " can divide ", The basic participle phrase that " participle ", " intelligence point ", " energy participle " and " Word Intelligent Segmentation " are constituted.
In step S43, according to using the Z participles filter algorithm with reference to probability tables to obtaining after full cutting Basic participle phrase carries out being filtrated to get neologisms phrase, and the neologisms phrase is updated in participle dictionary.
Specifically, with reference to Fig. 3, in the present embodiment, above-mentioned steps S43 include:
In step S431, scan the basic participle phrase and obtain in the basic participle phrase basic participle and be total to There is but is not included in the forward direction word list in the basic participle phrase;In step S432, judge that this is front to word Whether the length of list is more than the first variable i, wherein, the initial value of first variable i is 0, is if so, then held Row step S433, if it is not, then execution step S435.In step S433, from reference to inquiry in probability tables The probability of i-th forward direction word in forward direction word list, and exist or big in the probability for judging i-th forward direction word In or during equal to default first threshold a, i-th forward direction word is added in basic participle phrase.In step In rapid S434, do not exist or during less than default first threshold a in the probability for judging i-th forward direction word, Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this From after adding, the value of the first variable i is changed into 1 to one variable i, and repeat the above steps S432 are to step S434.With this Repetitive cycling, by front can not exist or little from probability is inquired with reference to correspondence in probability tables in word list It is added in basic participle phrase in the forward direction word of first threshold a, is obtained with front with scanning after being judged To the set of the phrase of relation.
In step S435, the basic participle phrase is scanned, obtains the set with the front phrase to relation, Wherein, it is expressed as { A, B } to the phrase of relation with front, A is the first lemma, B is the second lemma.
In step S436, whether the size of the set is judged less than the second variable j, wherein, second change The initial value of amount j is 0, if so, then execution step S437, if it is not, then execution step S439.In step S437 In, the first lemma A and the second lemma B in j-th phrase in the set are taken out, and from reference to probability tables Inquiry P (A) and P (AB), and calculate P (B | A);When P (B | A) is judged less than default Second Threshold b, Second lemma B is judged whether Already in participle dictionary, if it is not, then adding second lemma B To the basic participle phrase.In step S438, judging P (B | A) more than or equal to default Second Threshold During b, or when judging the second lemma B Already in participle dictionary, or by the second lemma B additions To after the basic participle phrase, the second variable j adds certainly, and the repeat the above steps after second variable adds certainly S436 is to step S438.With this repetitive cycling, during this is gathered can be from reference to correspondence inquiry in probability tables It is less than Second Threshold b and there is no the second lemma in participle dictionary to probability and is added to basic participle phrase In, so that the neologisms phrase that the basic participle phrase carries out being obtained after re-scheduling is added to into participle word after being judged In storehouse, and then the filtration that the bad participle of participle effect is input into is realized, the neologisms phrase for obtaining is added to Participle dictionary, realizes the renewal of participle dictionary.In step S439, the basic participle phrase is arranged The neologisms phrase obtained after weight is added in participle dictionary.
The above, the only present invention preferably specific embodiment, but protection scope of the present invention not office Be limited to this, any those familiar with the art the invention discloses technical scope in, can be easily The change or replacement expected, all should be included within the scope of the present invention.Therefore, protection of the invention Scope should be defined by scope of the claims.Finally some symbols in the present invention are illustrated, P (A) represents the probability that A occurs;P (A | B) represent under conditions of B occurs, the probability that A occurs;P(AB) Represent the simultaneous probability of AB.

Claims (10)

1. a kind of participle Word library updating system, it is characterised in that include:
Log acquisition module, for gathering the participle business day that participle operation system is exported in running Will;
Log analysis module, the participle business diary for collecting to the log acquisition module is carried out Statistical analysis, and extract associated valid data;
Segmentation tesing module, for carrying out evaluation to the associated valid data according to evaluation rule participle is obtained The bad participle input of effect;And
Participle is corrected and filtering module, for the participle effect obtained by the segmentation tesing module not Good participle input carries out participle correction and filters output neologisms phrase, and the neologisms phrase is updated to participle In dictionary.
2. according to the participle Word library updating system described in claim 1, it is characterised in that the participle industry Business system includes search system, and the associated valid data includes the Number of Orders of Search Results or browses secondary The homepage hit ratio of the conversion ratio and/or Search Results of number and/or search keyword and/or search keyword Recall rate and/or participle input word segmentation result;The evaluation rule includes the conversion ratio of search keyword Less than the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount less than default Threshold value and/or the word segmentation result of participle input are more than the 3rd predetermined threshold value.
3. according to the participle Word library updating system described in claim 1, it is characterised in that the participle school Just include construction submodule and cutting submodule with filtering module, wherein:
The construction submodule, for scanning corpus data, and calculates each word to the general of next word Rate refers to probability tables;
The cutting submodule, carries out full cutting and obtains base for the participle input bad to the participle effect Plinth participle phrase.
4. according to the participle Word library updating system described in claim 3, it is characterised in that the participle school Just also include filter submodule with filtering module, the filter submodule is used for according to using the reference probability The Z participles filter algorithm of table is entered to the described basic participle phrase obtained after the full cutting of the cutting submodule Row is filtrated to get the neologisms phrase, and the neologisms phrase is updated in participle dictionary.
5. according to the participle Word library updating system described in claim 4, it is characterised in that described to cross filter Module includes:
Scanning element, is total to for scanning the basic participle phrase and obtaining basic participle in the basic participle phrase There is but is not included in the forward direction word list in the basic participle phrase;
First judging unit, for whether judging the front length to word list more than the first variable i, wherein, The initial value of first variable i is 0;
First adding device, for when judging that the front length to word list is more than the first variable i, from ginseng Examine and the front probability to i-th forward direction word in word list is inquired about in probability tables, and judging i-th forward direction word Probability exist or during more than or equal to default first threshold, i-th forward direction word is added to into the base In plinth participle phrase;
First from unit is added, for not existing or less than default in the probability for judging i-th forward direction word During first threshold, or after i-th forward direction word is added to into the basic participle phrase, the first variable i is certainly Plus;
Second scanning element, for judging the front length to word list less than or equal to first variable i When, scan the basic participle phrase, obtain the set with the front phrase to relation, wherein, with it is front to The phrase of relation is expressed as { A, B }, and A is the first lemma, and B is the second lemma;
Second judging unit, for whether judging the size of the set less than the second variable j, wherein, second The initial value of variable j is 0;
Second adding device, when the size for judging the set is less than the second variable j, takes out jth in the set The first lemma A and the second lemma B in individual phrase, and from reference to inquiring about P (A) and P (AB) in probability tables, and Calculating P (B | A);When P (B | A) is judged less than default Second Threshold, whether second lemma B is judged Jing is present in participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
Second from plus unit, for when judging that P (B | A) is more than or equal to default Second Threshold, Huo Zhe When judging the second lemma B Already in participle dictionary, or the second lemma B is added to into the basis point After word phrase, the second variable j adds certainly;
3rd adding device, during for being more than or equal to the second variable j in the size for judging the set, by this The neologisms phrase that basic participle phrase carries out being obtained after re-scheduling is added in participle dictionary.
6. a kind of participle Word library updating method, it is characterised in that the method is comprised the following steps:
The participle business diary that S1, collection participle operation system are exported in running;
S2, the participle business diary to collecting carry out statistical analysis, and extract associated valid data;
S3, according to evaluation rule the associated valid data is carried out evaluating and obtain the bad participle of participle effect Input;
The bad participle input of S4, the participle effect to obtaining carries out participle correction and filters output neologisms Phrase, and the neologisms phrase is updated in participle dictionary.
7. participle Word library updating method according to claim 6, it is characterised in that the participle business System includes search system, and the associated valid data includes the Number of Orders or number of visits of Search Results And/or the conversion ratio of search keyword and/or the homepage hit ratio of Search Results and/or search keyword Recall rate and/or the word segmentation result of participle input;The evaluation rule includes that the conversion ratio of search keyword is little In the first predetermined threshold value and/or Search Results number less than the second predetermined threshold value and/or usage amount less than default threshold Value and/or the word segmentation result of participle input are more than the 3rd predetermined threshold value.
8. participle Word library updating method according to claim 6, it is characterised in that step S4 Including:
S41, scanning corpus data, and calculate each word to the probability of next word to construct a reference Probability tables;
S42, the participle input bad to the participle effect carry out full cutting and obtain basic participle phrase.
9. participle Word library updating method according to claim 8, it is characterised in that step S4 Also include:
S43, the cutting submodule is cut entirely according to the Z participles filter algorithm using the reference probability tables The described basic participle phrase that obtains carries out being filtrated to get the neologisms phrase after point, and by the neologisms phrase In updating participle dictionary.
10. participle Word library updating method according to claim 8, it is characterised in that step S43 Including:
S431, scan the basic participle phrase and to obtain in the basic participle phrase basic participle total but do not wrap The forward direction word list being contained in the basic participle phrase;
S432, the front length to word list is judged whether more than the first variable i, wherein, first variable i Initial value be 0, if execution step S433, if it is not, then execution step S435;
S433, judge the front length to word list be more than the first variable i when, look into from reference to probability tables Ask the front probability to i-th forward direction word in word list, and exist in the probability for judging i-th forward direction word or When person is more than or equal to default first threshold, i-th forward direction word is added in the basic participle phrase;
S434, do not exist or during less than default first threshold in the probability for judging i-th forward direction word, Or after i-th forward direction word is added to into the basic participle phrase, the first variable i from plus, and this From after adding, repeat step S432 is to step S434 for one variable i;
S435, judge the front length to word list be less than or equal to first variable i when, scan the base Plinth participle phrase, obtains the set with the front phrase to relation, wherein, with the front phrase table to relation It is shown as { A, B }, A is the first lemma, B is the second lemma;
S436, the size of the set is judged whether less than the second variable j, wherein, the initial value of the second variable j For 0, if so, then execution step S437, if it is not, then execution step S439;
S437, when the size for judging the set is less than the second variable j, take out in the set in j-th phrase The first lemma A and the second lemma B, and from reference to inquiring about P (A) and P (AB) in probability tables, and calculate P(B|A);When P (B | A) is judged less than default Second Threshold, judge whether second lemma B has deposited In being participle dictionary, if it is not, then second lemma B is added to into the basic participle phrase;
S438, when P (B | A) is judged more than or equal to default Second Threshold, or judging the second lemma B Already in participle dictionary when, or the second lemma B is added to after the basic participle phrase, Two variable j add certainly, and in second variable j from after adding, repeat step S436 is to step S438;
S439, judge the set size be more than or equal to the second variable j when, by the basic participle phrase The neologisms phrase for carrying out being obtained after re-scheduling is added in participle dictionary.
CN201510715638.8A 2015-10-28 2015-10-28 Word segmentation and word library updating method and system Active CN106649308B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510715638.8A CN106649308B (en) 2015-10-28 2015-10-28 Word segmentation and word library updating method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510715638.8A CN106649308B (en) 2015-10-28 2015-10-28 Word segmentation and word library updating method and system

Publications (2)

Publication Number Publication Date
CN106649308A true CN106649308A (en) 2017-05-10
CN106649308B CN106649308B (en) 2020-05-01

Family

ID=58831014

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510715638.8A Active CN106649308B (en) 2015-10-28 2015-10-28 Word segmentation and word library updating method and system

Country Status (1)

Country Link
CN (1) CN106649308B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182173A (en) * 2017-12-27 2018-06-19 福建中金在线信息科技有限公司 A kind of method, apparatus and electronic equipment for extracting keyword
CN108920576A (en) * 2018-06-25 2018-11-30 中科点击(北京)科技有限公司 A kind of adaptive text searching method
CN108984735A (en) * 2018-07-12 2018-12-11 广州资宝科技有限公司 Label Word library updating method, apparatus and electronic equipment
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN110231955A (en) * 2019-05-13 2019-09-13 平安科技(深圳)有限公司 Code process method, apparatus, computer equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10240735A (en) * 1997-02-27 1998-09-11 Mitsubishi Electric Corp Method and device for analyzing morpheme
CN102663139A (en) * 2012-05-07 2012-09-12 苏州大学 Method and system for constructing emotional dictionary
CN103544165A (en) * 2012-07-12 2014-01-29 腾讯科技(深圳)有限公司 Neologism mining method and system
CN104035969A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and system for building feature word banks in social network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH10240735A (en) * 1997-02-27 1998-09-11 Mitsubishi Electric Corp Method and device for analyzing morpheme
CN102663139A (en) * 2012-05-07 2012-09-12 苏州大学 Method and system for constructing emotional dictionary
CN103544165A (en) * 2012-07-12 2014-01-29 腾讯科技(深圳)有限公司 Neologism mining method and system
CN104035969A (en) * 2014-05-20 2014-09-10 微梦创科网络科技(中国)有限公司 Method and system for building feature word banks in social network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
姚磊岳 等: "一种基于中文分词算法的信息过滤技术", 《科技广场》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108182173A (en) * 2017-12-27 2018-06-19 福建中金在线信息科技有限公司 A kind of method, apparatus and electronic equipment for extracting keyword
CN108920576A (en) * 2018-06-25 2018-11-30 中科点击(北京)科技有限公司 A kind of adaptive text searching method
CN108984735A (en) * 2018-07-12 2018-12-11 广州资宝科技有限公司 Label Word library updating method, apparatus and electronic equipment
CN108984735B (en) * 2018-07-12 2019-08-13 广州资宝科技有限公司 Label Word library updating method, apparatus and electronic equipment
CN109858011A (en) * 2018-11-30 2019-06-07 平安科技(深圳)有限公司 Standard dictionary segmenting method, device, equipment and computer readable storage medium
CN109858011B (en) * 2018-11-30 2022-08-19 平安科技(深圳)有限公司 Standard word bank word segmentation method, device, equipment and computer readable storage medium
CN110231955A (en) * 2019-05-13 2019-09-13 平安科技(深圳)有限公司 Code process method, apparatus, computer equipment and storage medium
CN110231955B (en) * 2019-05-13 2024-05-07 平安科技(深圳)有限公司 Code processing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN106649308B (en) 2020-05-01

Similar Documents

Publication Publication Date Title
CN106649308A (en) Updating method and system of word segmentation library
CN105808526B (en) Commodity short text core word extracting method and device
CN108829658B (en) Method and device for discovering new words
CN103914478B (en) Webpage training method and system, webpage Forecasting Methodology and system
CN101876981B (en) A kind of method and device building knowledge base
CN102012900B (en) An information retrieval method and system
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
WO2018189589A2 (en) Systems and methods for document processing using machine learning
CN104077407B (en) A kind of intelligent data search system and method
CN106909663B (en) Label user brand preference behavior prediction method and device
US20050267915A1 (en) Method and apparatus for recognizing specific type of information files
CN104598532A (en) Information processing method and device
CN100419755C (en) Systems and methods for document data analysis
CN104063497B (en) Viewpoint treating method and apparatus and searching method and device
US9275015B2 (en) System and method for performing analysis on information, such as social media
WO2012054788A1 (en) Method and system for performing a comparison
CN101169780A (en) Semantic ontology retrieval system and method
CN102207961B (en) Automatic web page classification method and device
CN103678576A (en) Full-text retrieval system based on dynamic semantic analysis
CN102236654A (en) Web useless link filtering method based on content relevancy
CN105975459A (en) Lexical item weight labeling method and device
CN106844482A (en) A kind of retrieval information matching method and device based on search engine
CN109471934B (en) Financial risk clue mining method based on Internet
CN102722526B (en) Part-of-speech classification statistics-based duplicate webpage and approximate webpage identification method
US11295078B2 (en) Portfolio-based text analytics tool

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant