CN106126495B - One kind being based on large-scale corpus prompter method and apparatus - Google Patents

One kind being based on large-scale corpus prompter method and apparatus Download PDF

Info

Publication number
CN106126495B
CN106126495B CN201610429967.0A CN201610429967A CN106126495B CN 106126495 B CN106126495 B CN 106126495B CN 201610429967 A CN201610429967 A CN 201610429967A CN 106126495 B CN106126495 B CN 106126495B
Authority
CN
China
Prior art keywords
word
candidate word
candidate
indicate
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610429967.0A
Other languages
Chinese (zh)
Other versions
CN106126495A (en
Inventor
曹骥
王富田
李健
张连毅
武卫东
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Original Assignee
BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP filed Critical BEIJING INFOQUICK SINOVOICE SPEECH TECHNOLOGY CORP
Priority to CN201610429967.0A priority Critical patent/CN106126495B/en
Publication of CN106126495A publication Critical patent/CN106126495A/en
Application granted granted Critical
Publication of CN106126495B publication Critical patent/CN106126495B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of prompter method and device based on large-scale corpus, comprising steps of counting candidate word from the corpus of collection coagulates right and candidate word freedom degree;The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;Extract the candidate word at word score more than preset threshold value.In the case where Chinese is defined and divided without standard words, realizes computer system and effectively identified in large-scale corpus and extract word.

Description

One kind being based on large-scale corpus prompter method and apparatus
Technical field
The present invention relates to language analysis fields, are based on large-scale corpus prompter method and apparatus more particularly to one kind.
Background technique
In the natural language processing of Chinese material, it is often necessary to extract word from corpus.But in Chinese text processing Field, ambiguous always to the definition of word, which type of individual character or single combinatorics on words can be used as a word, not have still at present There is a generally acknowledged, authoritative standard.Chinese prompter needs in the case where no standard dictionary, filters out from corpus and most may be used It can be mainly used in from corpus at the text fragments of word and find word.In the case where Chinese is defined and divided without standard words, It what standard defined terms to be the key that the extraction word from corpus with.
Computer is that computer system is allowed how from Chinese language text corpus to find word for the key for handling Chinese prompter, And they are extracted.The word of Chinese is the symbol of record instruction, and what word was made of morpheme, it can independently use The smallest linguistic unit.But the text of this isolating language of Chinese, there is no the explicit mark in any space etc between word and word The boundary of deictic words.Therefore, Chinese prompter problem becomes the important process faced when computer disposal isolating language.
Therefore, how to construct it is a kind of based on large-scale corpus prompter method and apparatus become technical problem urgently to be resolved.
Summary of the invention
The embodiment of the present invention provide it is a kind of based on large-scale corpus prompter method and apparatus, to solve in the prior art without Method effectively identifies from large-scale corpus and extracts the defect of word, realizes that computer system effectively identifies simultaneously in large-scale corpus Extract word.
To solve the above-mentioned problems, the prompter method based on large-scale corpus that the invention discloses a kind of, comprising steps of
Candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;
The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Extract the candidate word at word score more than preset threshold value.
Method of the present invention, wherein
Coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating.
Method of the present invention, wherein
Coagulating for the candidate word is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIt indicates The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of i+1 word in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Table Show that the number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
Method of the present invention, wherein
The freedom degree of the candidate word is to integrate the small person of comentropy by calculating and selecting candidate word or so adjacent word as candidate word Freedom degree.
Method of the present invention, wherein
The freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe frequency on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe frequency on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.
To solve the above-mentioned problems, the invention also discloses a kind of prompter device based on large-scale corpus, including corpus Collector unit, further includes:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;
At word score calculation unit, for by the candidate word coagulate the right product with the freedom degree of candidate word obtain at Word score;
Prompter unit, for extracting the candidate word at word score more than preset threshold value.
Device of the present invention, wherein
The candidate word coagulates right computing unit, be further used for by calculate corpus in candidate word word between comentropy and It is right that word frequency obtains coagulating for candidate word.
Device of the present invention, wherein
The candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is i-th of word in candidate word Right entropy,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIt indicates The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Table Show that the number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
Device of the present invention, wherein
The candidate word freedom calculation unit is further used for by calculating and selecting candidate word or so adjacent word collection information The small person of entropy is candidate word freedom degree.
Device of the present invention, wherein
The candidate word freedom calculation unit is further used for being calculated by the following formula the freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.
A kind of prompter method and device based on large-scale corpus provided in an embodiment of the present invention, by calculating candidate word It coagulates right, the frequency of comentropy between its internal word and candidate word relative to internal each word is combined when calculating candidate word and coagulating right Rate information;Secondly for calculating the freedom degree of candidate word using the adjacent word collection comentropy in calculating left and right, the i.e. adjacent word collection of candidate word or so The small person of comentropy;Right and freedom degree product is coagulated as it at word score, to large-scale corpus finally for candidate word When carrying out prompter, candidate word is extracted at the candidate word that word score is higher than pre-set threshold value.Realize department of computer science System effectively identification and prompter in large-scale corpus.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of step flow chart of the prompter embodiment of the method based on large-scale corpus of the present invention;
Fig. 2 is a kind of structural block diagram based on large-scale corpus prompter Installation practice of the present invention.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
Embodiment one
Referring to Fig.1, a kind of step flow chart based on large-scale corpus prompter method of the embodiment of the present invention is shown.
The method of the present embodiment the following steps are included:
Step 100: candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;In the present embodiment, The statistics candidate word coagulate it is right can by calculating comentropy and word frequency obtain between the word of candidate word in corpus;The candidate The freedom degree of word can integrate comentropy reckling by calculating and selecting candidate word or so adjacent word as the freedom degree of candidate word.
Step 200: the right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Step 300: extracting the candidate word at word score more than preset threshold value.
In the present embodiment, the news corpus data of the every field of sohu.com's first half of the month in April, 2016 are had collected.? In the large-scale corpus of collection, if the text fragments at word are distributed abundance, then relative to its internal solidifying conjunction not at the segment of word Degree can be higher, and freedom degree is bigger.If regarding the adjacent word in the left and right of word as stochastic variable, the adjacent word collection in the left and right of a word Comentropy just reflect the randomness of the adjacent word of this word or so, the smaller left or right neighbour's word collection for illustrating the word of entropy is more stable.
The right comentropy of word each in text fragments is multiplied with the left comentropy of its right adjacent words and seeks the maximum, then it can be with Reflect the minimum stability between each word of text fragment internal, it is smaller to illustrate that text segment is more stable.And candidate word and its In the frequency ratio of each word reflect the tightness degree of candidate word Yu each word wherein included on the whole again, it is bigger to illustrate candidate Word and wherein each word relationship are closer.Taking the former inverse to be multiplied with the latter in this way can reflect that the inside of text segment is solidifying It is right.So we use the inverse and the relative frequency of candidate word of comentropy product the maximum between candidate word internal word in the present invention Product as coagulating right T inside it, value show to coagulate inside candidate word more greatly it is right higher, it is also higher at word possibility.
For some length be d candidate word, inside coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is i-th of word in candidate word Right entropy,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordi" appear in number on the right side of it, niIt indicates The number that i-th of word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number.S″i+1It indicates in candidate word The left entropy of i+1 word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates Character table number.pjTable Show frequency of the candidate word relative to j-th of word:
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
In the present embodiment, candidate word freely can reflect the rich of candidate word context environmental with degree.By The definition of comentropy illustrates that the randomness of stochastic variable is stronger it is known that its value is bigger.Here the adjacent word in the left and right of candidate word is taken The comentropy of collection reflects that its freely uses degree, in order to reflect the freedom degree of candidate word on the whole, takes candidate word or so information For the smaller of entropy as its freedom degree H, value is bigger to illustrate that candidate word is freer, higher at word possibility.Freedom degree
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number.S " is the left entropy of candidate word:
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.
A possibility that candidate word is scored at G at word in the present embodiment, and the same score value is bigger, and candidate word is at word is higher, In,
G=T*H
It realizes candidate word of substring of the prompter to all length occurred in text no more than d all as potential word, calculates They set a threshold value at word score at word score, then for candidate word, finally take out the candidate word on all threshold values The word extracted as needs.In concrete operations as, entire corpus can regard to a character string, and to the character string it is all after Sew and sort by lexcographical order, so that identical candidate word be concentrated in together.It scans one time from the beginning to the end and calculates the interior of each candidate word Each word of frequency and inside of right neighbour's word comentropy of each word in portion, right adjacent word comentropy corresponding with candidate word and candidate word Frequency.Now all of suffix will be rearranged after entire corpus backward again, then scan one time to count to obtain in candidate word The left adjacent word comentropy of each word in portion, and left adjacent word comentropy corresponding with candidate word.By each candidate word of calculating at word Score extracts the word on threshold value after descending sort, this completes the prompter algorithms of large-scale corpus.
The present embodiment the method realizes computer system and exists in the case where Chinese is defined and divided without standard words It is effectively identified in large-scale corpus and extracts word.
Embodiment two
Referring to Fig. 2, a kind of structural block diagram based on large-scale corpus prompter device of the embodiment of the present invention is shown.
The device of the present embodiment, including corpus collector unit, further includes:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;In the present embodiment, can be used for passing through Calculating in corpus comentropy and word frequency between the word of candidate word, to obtain coagulating for candidate word right.
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;It, can be by calculating simultaneously in the present embodiment The adjacent word of candidate word or so is selected to integrate comentropy reckling as candidate word freedom degree.
At word score calculation unit, for by the candidate word coagulate the right product with the freedom degree of candidate word obtain at Word score;
Prompter unit, for extracting the candidate word at word score more than preset threshold value.
In the present embodiment, the candidate word coagulates right computing unit, can be used for being calculated by the following formula candidate word Coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein,Indicate Character table " k in the right adjacent character set of i-th of wordi" appear in number on the right side of it, niIndicate the The number that i word occurs in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S″i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates The number that i+1 word occurs in corpus in candidate word, M indicate that the left adjacent word of i+1 word concentrates daughter element number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus.
In this embodiment, the candidate word freedom calculation unit, can be used for being calculated by the following formula the candidate The freedom degree of word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, K indicates candidate The right adjacent word of word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe number on the candidate word left side is appeared in, M indicates candidate The left adjacent word of word concentrates Character table number.
The device based on large-scale corpus prompter of the present embodiment is for realizing prompter side corresponding in previous embodiment one Method, and the beneficial effect with corresponding embodiment of the method, details are not described herein.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (6)

1. a kind of prompter method based on large-scale corpus, it is characterised in that comprising steps of
Candidate word is counted from the corpus of collection coagulates right and candidate word freedom degree;
The right product with the freedom degree of candidate word of coagulating of the candidate word is obtained into word score;
Extract the candidate word at word score more than preset threshold value;
Wherein, coagulating right for the statistics candidate word is that comentropy and word frequency obtain between the word of candidate word in corpus by calculating;
Wherein, the candidate word coagulate it is right
Wherein, T indicates that right, the length of d expression candidate word is coagulated in the inside of candidate word,
S'iIt is the right entropy of i-th of word in candidate word,
Wherein, nkjIndicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIndicate i-th of word The number occurred in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S”i+1Indicate the left entropy of i+1 word in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1Indicate candidate The number that i+1 word occurs in corpus in word, M indicate that the left adjacent word of i+1 word concentrates Character table number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus, njIndicate what j-th of word occurred in corpus Number.
2. according to the method described in claim 1, it is characterized by:
The freedom degree of the candidate word takes the adjacent word in the left and right of candidate word to concentrate comentropy smaller.
3. according to the method described in claim 2, it is characterized by:
The freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe frequency on the right of candidate word is appeared in, U indicates candidate word Right neighbour's word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, miBelong to the left adjacent word collection of candidate word,Indicate miThe frequency on the candidate word left side is appeared in, q indicates candidate word Left neighbour's word concentrates Character table number.
4. a kind of prompter device based on large-scale corpus, including corpus collector unit, it is characterised in that further include:
Candidate word coagulates right computing unit, right for calculating coagulating for candidate word;
Candidate word freedom calculation unit, for calculating the freedom degree of candidate word;
At word score calculation unit, for the right product acquisition with the freedom degree of candidate word of coagulating of the candidate word to be obtained at word Point;
Prompter unit, for extracting the candidate word at word score more than preset threshold value;
Wherein, the candidate word coagulates right computing unit, is further used for through comentropy between the word of candidate word in calculating corpus It is right that coagulating for candidate word is obtained with word frequency;
Wherein, the candidate word coagulates right computing unit, is further used for being calculated by the following formula coagulating for candidate word right
Wherein, T indicates that right, the length of d expression candidate word, S' are coagulated in the inside of candidate wordiIt is the right entropy of i-th of word in candidate word,
Wherein, nkjIndicate Character table " k in the right adjacent character set of i-th of wordj" appear in number on the right side of it, niIndicate i-th of word The number occurred in corpus, K indicate that the right adjacent word of i-th of word concentrates Character table number;
S”i+1Indicate the left entropy of the word of i+1 in candidate word,
Wherein,Indicate that the left adjacent word of i+1 word concentrates Character table " mj" number of appearance to the left, ni+1It indicates to wait The number that i+1 word occurs in corpus in word selection, M indicate that the left adjacent word of i+1 word concentrates daughter element number;
pjIndicate frequency of the candidate word relative to j-th of word,
Wherein, n indicates that the length is the number that the candidate word of d occurs in corpus, njIndicate what j-th of word occurred in corpus Number.
5. device according to claim 4, it is characterised in that:
The candidate word freedom calculation unit is further used for taking the adjacent word collection comentropy smaller in the left and right of candidate word.
6. device according to claim 5, it is characterised in that:
The candidate word freedom calculation unit is further used for being calculated by the following formula the freedom degree of the candidate word
H=min { S', S " }
Wherein, H indicates the freedom degree of candidate word, and S' indicates the right entropy of candidate word,
Wherein, biBelong to the right adjacent word collection of candidate word,Indicate biThe number on the right of candidate word is appeared in, U indicates candidate word Right neighbour's word concentrates Character table number;
S " is the left entropy of candidate word,
Wherein, mjBelong to the left adjacent word collection of candidate word,Indicate mjThe number on the candidate word left side is appeared in, q indicates candidate word Left neighbour's word concentrates Character table number.
CN201610429967.0A 2016-06-16 2016-06-16 One kind being based on large-scale corpus prompter method and apparatus Active CN106126495B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610429967.0A CN106126495B (en) 2016-06-16 2016-06-16 One kind being based on large-scale corpus prompter method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610429967.0A CN106126495B (en) 2016-06-16 2016-06-16 One kind being based on large-scale corpus prompter method and apparatus

Publications (2)

Publication Number Publication Date
CN106126495A CN106126495A (en) 2016-11-16
CN106126495B true CN106126495B (en) 2019-03-12

Family

ID=57469834

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610429967.0A Active CN106126495B (en) 2016-06-16 2016-06-16 One kind being based on large-scale corpus prompter method and apparatus

Country Status (1)

Country Link
CN (1) CN106126495B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108845982B (en) * 2017-12-08 2021-08-20 昆明理工大学 Chinese word segmentation method based on word association characteristics
CN112182448A (en) * 2019-07-05 2021-01-05 百度在线网络技术(北京)有限公司 Page information processing method, device and equipment
CN110991173B (en) * 2019-11-29 2023-09-29 支付宝(杭州)信息技术有限公司 Word segmentation method and system
CN115034211B (en) * 2022-05-19 2023-04-18 一点灵犀信息技术(广州)有限公司 Unknown word discovery method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105488098A (en) * 2015-10-28 2016-04-13 北京理工大学 Field difference based new word extraction method
CN105260362A (en) * 2015-10-30 2016-01-20 小米科技有限责任公司 New word extraction method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
移动应用用户反馈管理***的设计与实现;林贞斌;《中国优秀硕士学位论文全文数据库 信息科技辑》;20150315;第10页第5段-第14页第4段

Also Published As

Publication number Publication date
CN106126495A (en) 2016-11-16

Similar Documents

Publication Publication Date Title
WO2019227710A1 (en) Network public opinion analysis method and apparatus, and computer-readable storage medium
CN110968684B (en) Information processing method, device, equipment and storage medium
CN110457672B (en) Keyword determination method and device, electronic equipment and storage medium
CN106126495B (en) One kind being based on large-scale corpus prompter method and apparatus
KR20190038751A (en) User keyword extraction apparatus, method and computer readable storage medium
CN111460153B (en) Hot topic extraction method, device, terminal equipment and storage medium
US20140344230A1 (en) Methods and systems for node and link identification
CN105630767B (en) The comparative approach and device of a kind of text similarity
CN105357586A (en) Video bullet screen filtering method and device
RU2016122051A (en) METHOD AND DEVICE FOR RECOGNIZING IMAGE OBJECT CATEGORY
CN103279478A (en) Method for extracting features based on distributed mutual information documents
CN109657058A (en) A kind of abstracting method of notice information
CN105787121B (en) A kind of microblogging event summary extracting method based on more story lines
MY189086A (en) System and method for dynamic entity sentiment analysis
CN105550359B (en) Webpage sorting method and device based on vertical search and server
CN110188359B (en) Text entity extraction method
Alghamdi et al. Topic detections in Arabic dark websites using improved vector space model
CN112883730B (en) Similar text matching method and device, electronic equipment and storage medium
CN110674365A (en) Searching method, device, equipment and storage medium
US9830533B2 (en) Analyzing and exploring images posted on social media
CN105512300B (en) information filtering method and system
CN107688621B (en) Method and system for optimizing file
CN108536676A (en) Data processing method, device, electronic equipment and storage medium
CN108763192A (en) Entity relation extraction method and device for text-processing
CN108460016A (en) A kind of entity name analysis recognition method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant