CN106126495B - One kind being based on large-scale corpus prompter method and apparatus - Google Patents
One kind being based on large-scale corpus prompter method and apparatus Download PDFInfo
- Publication number
- CN106126495B CN106126495B CN201610429967.0A CN201610429967A CN106126495B CN 106126495 B CN106126495 B CN 106126495B CN 201610429967 A CN201610429967 A CN 201610429967A CN 106126495 B CN106126495 B CN 106126495B
- Authority
- CN
- China
- Prior art keywords
- word
- candidate word
- candidate
- indicate
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The present invention provides a kind of prompter method and device based on large-scale corpus, comprising steps of counting candidate word from the corpus of collection coagulates right and candidate word freedom degree;The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;Extract the candidate word at word score more than preset threshold value.In the case where Chinese is defined and divided without standard words, realizes computer system and effectively identified in large-scale corpus and extract word.
Description
Technical field
The present invention relates to language analysis fields, are based on large-scale corpus prompter method and apparatus more particularly to one kind.
Background technique
In the natural language processing of Chinese material, it is often necessary to extract word from corpus.But in Chinese text processing
Field, ambiguous always to the definition of word, which type of individual character or single combinatorics on words can be used as a word, not have still at present
There is a generally acknowledged, authoritative standard.Chinese prompter needs in the case where no standard dictionary, filters out from corpus and most may be used
It can be mainly used in from corpus at the text fragments of word and find word.In the case where Chinese is defined and divided without standard words,
It what standard defined terms to be the key that the extraction word from corpus with.
Computer is that computer system is allowed how from Chinese language text corpus to find word for the key for handling Chinese prompter,
And they are extracted.The word of Chinese is the symbol of record instruction, and what word was made of morpheme, it can independently use
The smallest linguistic unit.But the text of this isolating language of Chinese, there is no the explicit mark in any space etc between word and word
The boundary of deictic words.Therefore, Chinese prompter problem becomes the important process faced when computer disposal isolating language.
Therefore, how to construct it is a kind of based on large-scale corpus prompter method and apparatus become technical problem urgently to be resolved.
Summary of the invention
The embodiment of the present invention provide it is a kind of based on large-scale corpus prompter method and apparatus, to solve in the prior art without
Method effectively identifies from large-scale corpus and extracts the defect of word, realizes that computer system effectively identifies simultaneously in large-scale corpus
Extract word.
To solve the above-mentioned problems, the prompter method based on large-scale corpus that the invention discloses a kind of, comprising steps of
Candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;
The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Extract the candidate word at word score more than preset threshold value.
Method of the present invention, wherein
Coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating.
Method of the present invention, wherein
Coagulating for the candidate word is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIt indicates
The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of i+1 word in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Table
Show that the number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
Method of the present invention, wherein
The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word
Freedom degree.
Method of the present invention, wherein
The freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe frequency on the right of candidate word is appeared in, K indicates candidate
The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe frequency on the candidate word left side is appeared in, M indicates candidate
The left adjacent word of word concentrates Character table number.
To solve the above-mentioned problems, the invention also discloses a kind of prompter device based on large-scale corpus, including corpus
Collector unit, further includes:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;
At word score calculation unit, for by the candidate word coagulate the right product with the freedom degree of candidate word obtain at
Word score;
Prompter unit, for extracting the candidate word at word score more than preset threshold value.
Device of the present invention, wherein
The candidate word coagulates right computing unit, be further used for by calculate corpus in candidate word word between comentropy and
It is right that word frequency obtains coagulating for candidate word.
Device of the present invention, wherein
The candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is i-th of word in candidate word
Right entropy,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIt indicates
The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Table
Show that the number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
Device of the present invention, wherein
The candidate word freedom calculation unit is further used for by calculating and selecting candidate word or so adjacent word collection information
The small person of entropy is candidate word freedom degree.
Device of the present invention, wherein
The candidate word freedom calculation unit is further used for being calculated by the following formula the freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate
The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate
The left adjacent word of word concentrates Character table number.
A kind of prompter method and device based on large-scale corpus provided in an embodiment of the present invention, by calculating candidate word
It coagulates right, the frequency of comentropy between its internal word and candidate word relative to internal each word is combined when calculating candidate word and coagulating right
Rate information;Secondly for calculating the freedom degree of candidate word using the adjacent word collection comentropy in calculating left and right, the i.e. adjacent word collection of candidate word or so
The small person of comentropy;Right and freedom degree product is coagulated as it at word score, to large-scale corpus finally for candidate word
When carrying out prompter, candidate word is extracted at the candidate word that word score is higher than pre-set threshold value.Realize department of computer science
System effectively identification and prompter in large-scale corpus.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair
Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of step flow chart of the prompter embodiment of the method based on large-scale corpus of the present invention;
Fig. 2 is a kind of structural block diagram based on large-scale corpus prompter Installation practice of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is
A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art
Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig.1, a kind of step flow chart based on large-scale corpus prompter method of the embodiment of the present invention is shown.
The method of the present embodiment the following steps are included:
Step 100: candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;In the present embodiment,
The statistics candidate word coagulate it is right can by calculating comentropy and word frequency obtain between the word of candidate word in corpus;The candidate
The freedom degree of word can integrate comentropy reckling by calculating and selecting candidate word or so adjacent word as the freedom degree of candidate word.
Step 200: the right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Step 300: extracting the candidate word at word score more than preset threshold value.
In the present embodiment, the news corpus data of the every field of sohu.com's first half of the month in April, 2016 are had collected.?
In the large-scale corpus of collection, if the text fragments at word are distributed abundance, then relative to its internal solidifying conjunction not at the segment of word
Degree can be higher, and freedom degree is bigger.If regarding the adjacent word in the left and right of word as stochastic variable, the adjacent word collection in the left and right of a word
Comentropy just reflect the randomness of the adjacent word of this word or so, the smaller left or right neighbour's word collection for illustrating the word of entropy is more stable.
The right comentropy of word each in text fragments is multiplied with the left comentropy of its right adjacent words and seeks the maximum, then it can be with
Reflect the minimum stability between each word of text fragment internal, it is smaller to illustrate that text segment is more stable.And candidate word and its
In the frequency ratio of each word reflect the tightness degree of candidate word Yu each word wherein included on the whole again, it is bigger to illustrate candidate
Word and wherein each word relationship are closer.Taking the former inverse to be multiplied with the latter in this way can reflect that the inside of text segment is solidifying
It is right.So we use the inverse and the relative frequency of candidate word of comentropy product the maximum between candidate word internal word in the present invention
Product as coagulating right T inside it, value show to coagulate inside candidate word more greatly it is right higher, it is also higher at word possibility.
For some length be d candidate word, inside coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is i-th of word in candidate word
Right entropy,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordi" appear in number on the right side of it, niIt indicates
The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number.S″i+1It indicates in candidate word
The left entropy of i+1 word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates
The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number.pjTable
Show frequency of the candidate word relative to j-th of word:
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
In the present embodiment, candidate word freely can reflect the rich of candidate word context environmental with degree.By
The definition of comentropy illustrates that the randomness of stochastic variable is stronger it is known that its value is bigger.Here the adjacent word in the left and right of candidate word is taken
The comentropy of collection reflects that its freely uses degree, in order to reflect the freedom degree of candidate word on the whole, takes candidate word or so information
For the smaller of entropy as its freedom degree H, value is bigger to illustrate that candidate word is freer, higher at word possibility.Freedom degree
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate
The right adjacent word of word concentrates Character table number.S " is the left entropy of candidate word:
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate
The left adjacent word of word concentrates Character table number.
A possibility that candidate word is scored at G at word in the present embodiment, and the same score value is bigger, and candidate word is at word is higher,
In,
G=T*H
It realizes candidate word of substring of the prompter to all length occurred in text no more than d all as potential word, calculates
They set a threshold value at word score at word score, then for candidate word, finally take out the candidate word on all threshold values
The word extracted as needs.In concrete operations as, entire corpus can regard to a character string, and to the character string it is all after
Sew and sort by lexcographical order, so that identical candidate word be concentrated in together.It scans one time from the beginning to the end and calculates the interior of each candidate word
Each word of frequency and inside of right neighbour's word comentropy of each word in portion, right adjacent word comentropy corresponding with candidate word and candidate word
Frequency.Now all of suffix will be rearranged after entire corpus backward again, then scan one time to count to obtain in candidate word
The left adjacent word comentropy of each word in portion, and left adjacent word comentropy corresponding with candidate word.By each candidate word of calculating at word
Score extracts the word on threshold value after descending sort, this completes the prompter algorithms of large-scale corpus.
The present embodiment the method realizes computer system and exists in the case where Chinese is defined and divided without standard words
It is effectively identified in large-scale corpus and extracts word.
Embodiment two
Referring to Fig. 2, a kind of structural block diagram based on large-scale corpus prompter device of the embodiment of the present invention is shown.
The device of the present embodiment, including corpus collector unit, further includes:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;In the present embodiment, can be used for passing through
Calculating in corpus comentropy and word frequency between the word of candidate word, to obtain coagulating for candidate word right.
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;It, can be by calculating simultaneously in the present embodiment
The adjacent word of candidate word or so is selected to integrate comentropy reckling as candidate word freedom degree.
At word score calculation unit, for by the candidate word coagulate the right product with the freedom degree of candidate word obtain at
Word score;
Prompter unit, for extracting the candidate word at word score more than preset threshold value.
In the present embodiment, the candidate word coagulates right computing unit, can be used for being calculated by the following formula candidate word
Coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordi" appear in number on the right side of it, niIndicate the
The number that i word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates
The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates daughter element number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
In this embodiment, the candidate word freedom calculation unit, can be used for being calculated by the following formula the candidate
The freedom degree of word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate
The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate
The left adjacent word of word concentrates Character table number.
The device based on large-scale corpus prompter of the present embodiment is for realizing prompter side corresponding in previous embodiment one
Method, and the beneficial effect with corresponding embodiment of the method, details are not described herein.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member
It is physically separated with being or may not be, component shown as a unit may or may not be physics list
Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs
In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness
Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can
It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on
Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should
Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers
It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation
Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although
Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used
To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features;
And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and
Range.
Claims (6)
1. a kind of prompter method based on large-scale corpus, it is characterised in that comprising steps of
Candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;
The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Extract the candidate word at word score more than preset threshold value;
Wherein, coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating;
Wherein, the candidate word coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein, nkjIndicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIndicate i-th of word
The number occurred in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S”i+1Indicate the left entropy of i+1 word in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Indicate candidate
The number that i+1 word occurs in corpus in word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus, njIndicate what j-th of word occurred in corpus
Number.
2. according to the method described in claim 1, it is characterized by:
The freedom degree of the candidate word takes the adjacent word in the left and right of candidate word to concentrate comentropy smaller.
3. according to the method described in claim 2, it is characterized by:
The freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe frequency on the right of candidate word is appeared in, U indicates candidate word
Right neighbour's word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe frequency on the candidate word left side is appeared in, q indicates candidate word
Left neighbour's word concentrates Character table number.
4. a kind of prompter device based on large-scale corpus, including corpus collector unit, it is characterised in that further include:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;
At word score calculation unit, for the right product acquisition with the freedom degree of candidate word of coagulating of the candidate word to be obtained at word
Point;
Prompter unit, for extracting the candidate word at word score more than preset threshold value;
Wherein, the candidate word coagulates right computing unit, is further used for through comentropy between the word of candidate word in calculating corpus
It is right that coagulating for candidate word is obtained with word frequency;
Wherein, the candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is the right entropy of i-th of word in candidate word,
Wherein, nkjIndicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIndicate i-th of word
The number occurred in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S”i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates to wait
The number that i+1 word occurs in corpus in word selection, M indicate that the left adjacent word of i+1 word concentrates daughter element number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus, njIndicate what j-th of word occurred in corpus
Number.
5. device according to claim 4, it is characterised in that:
The candidate word freedom calculation unit is further used for taking the adjacent word collection comentropy smaller in the left and right of candidate word.
6. device according to claim 5, it is characterised in that:
The candidate word freedom calculation unit is further used for being calculated by the following formula the freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, U indicates candidate word
Right neighbour's word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, mjBelong to the left adjacent word collection of candidate word,Indicate mjThe number on the candidate word left side is appeared in, q indicates candidate word
Left neighbour's word concentrates Character table number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429967.0A CN106126495B (en) | 2016-06-16 | 2016-06-16 | One kind being based on large-scale corpus prompter method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610429967.0A CN106126495B (en) | 2016-06-16 | 2016-06-16 | One kind being based on large-scale corpus prompter method and apparatus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106126495A CN106126495A (en) | 2016-11-16 |
CN106126495B true CN106126495B (en) | 2019-03-12 |
Family
ID=57469834
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610429967.0A Active CN106126495B (en) | 2016-06-16 | 2016-06-16 | One kind being based on large-scale corpus prompter method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106126495B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108845982B (en) * | 2017-12-08 | 2021-08-20 | 昆明理工大学 | Chinese word segmentation method based on word association characteristics |
CN112182448A (en) * | 2019-07-05 | 2021-01-05 | 百度在线网络技术(北京)有限公司 | Page information processing method, device and equipment |
CN110991173B (en) * | 2019-11-29 | 2023-09-29 | 支付宝(杭州)信息技术有限公司 | Word segmentation method and system |
CN115034211B (en) * | 2022-05-19 | 2023-04-18 | 一点灵犀信息技术(广州)有限公司 | Unknown word discovery method and device, electronic equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
-
2016
- 2016-06-16 CN CN201610429967.0A patent/CN106126495B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105488098A (en) * | 2015-10-28 | 2016-04-13 | 北京理工大学 | Field difference based new word extraction method |
CN105260362A (en) * | 2015-10-30 | 2016-01-20 | 小米科技有限责任公司 | New word extraction method and device |
Non-Patent Citations (1)
Title |
---|
移动应用用户反馈管理***的设计与实现;林贞斌;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150315;第10页第5段-第14页第4段 |
Also Published As
Publication number | Publication date |
---|---|
CN106126495A (en) | 2016-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019227710A1 (en) | Network public opinion analysis method and apparatus, and computer-readable storage medium | |
CN110968684B (en) | Information processing method, device, equipment and storage medium | |
CN110457672B (en) | Keyword determination method and device, electronic equipment and storage medium | |
CN106126495B (en) | One kind being based on large-scale corpus prompter method and apparatus | |
KR20190038751A (en) | User keyword extraction apparatus, method and computer readable storage medium | |
CN111460153B (en) | Hot topic extraction method, device, terminal equipment and storage medium | |
US20140344230A1 (en) | Methods and systems for node and link identification | |
CN105630767B (en) | The comparative approach and device of a kind of text similarity | |
CN105357586A (en) | Video bullet screen filtering method and device | |
RU2016122051A (en) | METHOD AND DEVICE FOR RECOGNIZING IMAGE OBJECT CATEGORY | |
CN103279478A (en) | Method for extracting features based on distributed mutual information documents | |
CN109657058A (en) | A kind of abstracting method of notice information | |
CN105787121B (en) | A kind of microblogging event summary extracting method based on more story lines | |
MY189086A (en) | System and method for dynamic entity sentiment analysis | |
CN105550359B (en) | Webpage sorting method and device based on vertical search and server | |
CN110188359B (en) | Text entity extraction method | |
Alghamdi et al. | Topic detections in Arabic dark websites using improved vector space model | |
CN112883730B (en) | Similar text matching method and device, electronic equipment and storage medium | |
CN110674365A (en) | Searching method, device, equipment and storage medium | |
US9830533B2 (en) | Analyzing and exploring images posted on social media | |
CN105512300B (en) | information filtering method and system | |
CN107688621B (en) | Method and system for optimizing file | |
CN108536676A (en) | Data processing method, device, electronic equipment and storage medium | |
CN108763192A (en) | Entity relation extraction method and device for text-processing | |
CN108460016A (en) | A kind of entity name analysis recognition method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |