CN110413998A - Self-adaptive Chinese word segmentation method, system and medium for power industry - Google Patents
Self-adaptive Chinese word segmentation method, system and medium for power industry Download PDFInfo
- Publication number
- CN110413998A CN110413998A CN201910638948.2A CN201910638948A CN110413998A CN 110413998 A CN110413998 A CN 110413998A CN 201910638948 A CN201910638948 A CN 201910638948A CN 110413998 A CN110413998 A CN 110413998A
- Authority
- CN
- China
- Prior art keywords
- candidate
- participle
- text
- word
- sentence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000012545 processing Methods 0.000 claims abstract description 9
- 230000003044 adaptive effect Effects 0.000 claims description 21
- 238000012216 screening Methods 0.000 claims description 6
- 230000004069 differentiation Effects 0.000 claims description 5
- 238000004590 computer program Methods 0.000 claims description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 238000001914 filtration Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000001052 transient effect Effects 0.000 description 2
- 238000009825 accumulation Methods 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013523 data management Methods 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y04—INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
- Y04S—SYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
- Y04S10/00—Systems supporting electrical power generation, transmission or distribution
- Y04S10/50—Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications
Landscapes
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention relates to a self-adaptive Chinese word segmentation method for the power industry, a system and a medium thereof, wherein the method comprises the following steps: s1, obtaining candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented; s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; s3, segmenting each candidate text sentence to obtain one or more participles; s4, replacing the participles in the candidate text terms one by one with words with the same meaning as the participles and carrying out semantic judgment, if ambiguity occurs, returning to S3, and if ambiguity does not exist, keeping the participles as candidate participles; s5, acquiring one or more electric power field professional vocabularies similar to the candidate participle semanteme, calculating the similarity between the candidate participle and the one or more electric power field professional vocabularies, and determining the final participle according to the similarity; and S6, sorting the final participles according to the frequency of the participles appearing in the candidate text terms, and outputting the final participles.
Description
Technical field
The present invention relates to power equipment technical field of data processing, and in particular in a kind of adaptive towards power industry
Literary segmenting method and its system, computer readable storage medium.
Background technique
In recent years, as network becomes increasingly popular, the text scale on internet gradually expands, and information resources are continuously increased,
In order to retrieve and excavate valuable information from a large amount of resource, Internet company greatly develops natural language processing field
Technology, Chinese word segmentation is basis and the premise of natural language processing technique, and Chinese word segmentation is in information retrieval, machine translation, letter
It plays an important role in the information processings such as breath filtering, is the key technology and difficult point of information processing;Up to now, national grid
A large amount of data management system has had been established in company, and business datum amount is very huge.
Therefore there are following technical problems: due to each business department and each operation system to data information definition rule not
Together, cause same source data in reality to occur the inconsistent situation of such as title in different operation systems, cause a number
The problem of multi-source, data uniformity brings certain difficulty between each operation system.
Summary of the invention
It is an object of the invention to propose a kind of adaptive Chinese word cutting method and its system, calculating towards power industry
Machine readable storage medium storing program for executing, to solve the above technical problems.
In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides one kind towards electric power row
The adaptive Chinese word cutting method of industry, includes the following steps:
Step S1, candidate text terms are obtained, candidate's text terms are short sentence or paragraph to be segmented;
Step S2, processing is split to the candidate text terms and obtains multiple candidate text sentences;
Step S3, cutting is carried out to each candidate text sentence and obtains one or more participles;
Step S4, the participle in candidate text terms is replaced with and segments the identical vocabulary of word meaning and carries out semanteme one by one
Differentiate, if ambiguity, return step S3 occur in the text terms of front and back after replacement, if the text terms of front and back do not have discrimination after replacement
Justice then retains the participle as candidate participle;
Step S5, acquisition and the semantic similar one or more power domain specialized vocabularies of candidate participle, calculate candidate point
The similarity of word and one or more power domain specialized vocabularies simultaneously determines final participle according to similarity;
Step S6, it is exported after being ranked up final participle by the frequency that participle occurs in the candidate text terms.
Preferably, the step S2 includes:
By in the candidate text terms punctuate and space be separated to obtain multiple textual portions, and remove described more
Punctuate and space in a textual portions obtain multiple text sentences to be filtered;
Judge whether the character in each text sentence to be filtered is power industry profession participle, if so, extracting text
Simultaneously cutting is word to all identical characters in sentence, if it is not, then extracting all identical characters in text sentence and giving up;Wherein, institute
It is that the text after character and character together cutting is obtained candidate text sentence that state cutting, which be word,.
Preferably, the step S3 includes:
Vocabulary corresponding with vocabulary in dictionary database in candidate text sentence is extracted and is segmented;Wherein, institute
Stating vocabulary in dictionary database is vocabulary in the dedicated dictionary for word segmentation of power domain.
Preferably, the step S4 includes:
When a candidate text sentence is corresponding with multiple candidate participles, each candidate participle in candidate's text sentence is calculated
The corresponding similarity value of candidate participle is accumulated by with the similarity value of one or more power domain specialized vocabularies and carrying out;
Choose final participle of the highest candidate participle of similarity value as candidate text sentence.
Preferably, the step S6 includes:
Final participle after sequence is exported by interval of space, and the top ten after selected and sorted carries out emphasis and shows
Show, other final word segmentation results are then hidden.
According to a second aspect of the present invention, the embodiment of the present invention provides a kind of adaptive Chinese word segmentation system towards power industry
System, comprising:
Text acquiring unit, for obtaining candidate text terms, candidate's text terms are short sentence or section to be segmented
It falls;
Text segmentation unit obtains multiple candidate text sentences for being split processing to the candidate text terms;
Participle unit obtains one or more participles for carrying out cutting to each candidate text sentence;
First participle screening unit replaces with the participle in candidate text terms for one by one identical with participle word meaning
Vocabulary simultaneously carries out semantic differentiation, if ambiguity, return step S3, if front and back after replacement occur in the text terms of front and back after replacement
Text terms do not have ambiguity, then retain the participle as candidate participle;
Second participle screening unit, for obtaining and the semantic similar one or more power domain profession words of candidate participle
It converges, calculate the similarity of candidate participle and one or more power domain specialized vocabularies and final participle is determined according to similarity;
Output unit, after being ranked up final participle by the frequency that participle occurs in the candidate text terms
Output.
Preferably, the text segmentation unit includes:
First cutting unit, for by the candidate text terms punctuate and space be separated to obtain multiple texts
Part, and remove punctuate and space in the multiple textual portions and obtain multiple text sentences to be filtered;
Second cutting unit, for judging whether the character in each text sentence to be filtered is power industry profession point
Word, if so, extracting all identical characters in text sentence, simultaneously cutting is word, if it is not, then extracting all identical in text sentence
Character is simultaneously given up;Wherein, it is that the text after character and character together cutting is obtained candidate text sentence that the cutting, which is word,.
Preferably, the participle unit is specifically used for word corresponding with vocabulary in dictionary database in candidate text sentence
Remittance, which extracts, to be segmented;Wherein, vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database;
The output unit includes:
Similarity calculated, for calculating candidate text when a candidate text sentence is corresponding with multiple candidate participles
Each candidate segments the similarity value with one or more power domain specialized vocabularies and carries out being accumulated by the time in this sentence
Choosing segments corresponding similarity value;
Final participle determination unit, for choosing the highest candidate participle of similarity value as the final of candidate text sentence
Participle.
Preferably, the output unit includes:
Display unit is exported for the final participle after sorting by interval of space, and after selected and sorted before
Ten progress emphasis show that other final word segmentation results are then hidden.
According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with
Computer program realizes the adaptive Chinese word cutting method towards power industry when the program is executed by processor.
In embodiments of the present invention, in conjunction with the characteristics of electric power data, the exclusive dictionary for word segmentation library of power domain is established, according to
Vocabulary split to candidate text sentence in the dictionary for word segmentation library and ambiguity differentiates to obtain candidate participle, and further to time
Choosing participle determines final participle to the similarity of similar vocabulary in dictionary for word segmentation library, substantially increases the accuracy of participle, root
According to by data match analysis between each operation system, the service efficiency of working efficiency and data can be significantly improved.
Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification
It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification,
Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or
Method does not necessarily require achieving all the advantages described above at the same time.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with
It obtains other drawings based on these drawings.
Fig. 1 is a kind of adaptive Chinese word cutting method flow chart towards power industry in the embodiment of the present invention one.
Fig. 2 is a kind of adaptive Chinese automatic word-cut schematic diagram towards power industry in the embodiment of the present invention two.
Specific embodiment
Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing
Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove
It non-specifically points out, it is not necessary to attached drawing drawn to scale.
In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This
Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this
Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of adaptive Chinese word cutting method towards power industry, including such as
Lower step:
Step S1, candidate text terms are obtained, candidate's text terms are short sentence or paragraph to be segmented;
Step S2, processing is split to the candidate text terms and obtains multiple candidate text sentences;
Step S3, cutting is carried out to each candidate text sentence and obtains one or more participles;
Step S4, the participle in candidate text terms is replaced with and segments the identical vocabulary of word meaning and carries out semanteme one by one
Differentiate, if ambiguity, return step S3 occur in the text terms of front and back after replacement, if the text terms of front and back do not have discrimination after replacement
Justice then retains the participle as candidate participle;
Step S5, acquisition and the semantic similar one or more power domain specialized vocabularies of candidate participle, calculate candidate point
The similarity of word and one or more power domain specialized vocabularies simultaneously determines final participle according to similarity;
Step S6, it is exported after being ranked up final participle by the frequency that participle occurs in the candidate text terms.
Wherein, the step S2 is specifically included:
By in the candidate text terms punctuate and space be separated to obtain multiple textual portions, and remove described more
Punctuate and space in a textual portions obtain multiple text sentences to be filtered;
Judge whether the character in each text sentence to be filtered is power industry profession participle, if so, extracting text
Simultaneously cutting is word to all identical characters in sentence, if it is not, then extracting all identical characters in text sentence and giving up;Wherein, institute
It is that the text after character and character together cutting is obtained candidate text sentence that state cutting, which be word,.
Specifically, extracting first character first, and judge this first for text sentence to be filtered for one
Whether character is power industry profession participle, if so, extracting all identical characters and cutting in text sentence is word, if it is not,
It then extracts all identical characters in text sentence and gives up;The differentiation for then proceeding to successive character, until taking out text to be filtered
Last character in sentence, to realize the filtering to candidate text sentence.Wherein, according to the power industry special term of building
Table and daily vocabulary dictionary for word segmentation, the character taken out in text sentence and the dedicated vocabulary of power industry are compared, and judgement should
Whether character is the dedicated participle of power industry.
Wherein, the step S3 includes:
Vocabulary corresponding with vocabulary in dictionary database in candidate text sentence is extracted and is segmented;Wherein, institute
Stating vocabulary in dictionary database is vocabulary in the dedicated dictionary for word segmentation of power domain.
Specifically, vocabulary corresponding with vocabulary in dictionary database and semantic similar vocabulary, a candidate text language
There may be zero or more participles for sentence.
Wherein, the step S4 includes:
When a candidate text sentence is corresponding with multiple candidate participles, each candidate participle in candidate's text sentence is calculated
The corresponding similarity value of candidate participle is accumulated by with the similarity value of one or more power domain specialized vocabularies and carrying out;
Choose final participle of the highest candidate participle of similarity value as candidate text sentence.
It is segmented specifically, a candidate text sentence may be corresponding with multiple candidates, according to similarity in this step
Value screens these candidate's participles, and final one candidate text sentence only exports a participle, reduces participle error rate.
Wherein, the step S6 includes:
Final participle after sequence is exported by interval of space, and the top ten after selected and sorted carries out emphasis and shows
Show, other final word segmentation results are then hidden.
Specifically, be ranked up each word segmentation result being calculated according to the frequency of appearance in the present embodiment, and
Word segmentation result after sequence is exported by interval of space, the top ten after selected and sorted carries out emphasis and shows, subsequent
Word segmentation result is then hidden, and can click respective keys when needing to watch, and shows remaining word segmentation result, and by whole participles
As a result it is exported in bar graph form to display device, shows user.
The embodiment of the present invention is by choosing the participle data in the dedicated dictionary for word segmentation of power domain, by the candidate text of extraction
Term can be separated with punctuate and space, be split as multiple text sentences, exported, and can be located in advance to text terms
Reason reduces the punctuate contained in text terms and the participle interference of space bring, also increases the pretreatment efficiency of text terms,
Solves the efficiency of existing text terms processing, by taking out a character of the text sentence split out, by taking-up
Character substitutes into comparison, judges whether the character is the dedicated participle of power industry, until taking out the last character in text sentence
The text sentence split out can substitute into and judge by word, and take out all identical characters by symbol, be not required to substitute into all characters
Judgement is compared, the workload of character comparison judgement is reduced, so that more efficient, the filtered time of character comparison judgement
It selects text terms to will do it cutting, ambiguity differentiation is carried out to the participle data obtained after cutting, until participle does not contain ambiguity, is subtracted
, still there is ambiguity after avoiding text terms cutting, user is caused to see in less to producing ambiguity after text terms cutting the case where
The cognition that mistake is generated when seeing, increases the accuracy segmented to text data, by the power for calculating all word segmentation results
Weight score value, and carries out accumulation calculating, filters out the maximum word segmentation result of numerical value, and be ranked up according to the frequency of appearance carry out it is defeated
Out, the participle data that can be obtained to cutting in text terms are ranked up output, and participle data viewing is more intuitive, more for item
Rationality, so that thinking is more clear when user watches, to significantly improve the service efficiency of working efficiency and data.
As shown in Fig. 2, second embodiment of the present invention provides a kind of adaptive Chinese automatic word-cut towards power industry, packet
It includes:
Text acquiring unit 1, for obtaining candidate text terms, candidate's text terms are short sentence or section to be segmented
It falls;
Text segmentation unit 2 obtains multiple candidate text sentences for being split processing to the candidate text terms;
Participle unit 3 obtains one or more participles for carrying out cutting to each candidate text sentence;
First participle screening unit 4 replaces with the participle in candidate text terms for one by one identical as participle word meaning
Vocabulary and carry out semantic differentiation, if there is ambiguity in the text terms of front and back after replacement, return step S3, if front and back after replacement
Text terms there is no ambiguity, then retain the participle as candidate participle;
Second participle screening unit 5, for obtaining and the semantic similar one or more power domain professions of candidate participle
Vocabulary calculates the similarity of candidate participle and one or more power domain specialized vocabularies and determines final point according to similarity
Word;
Output unit 6, for final participle to be ranked up by the frequency that participle occurs in the candidate text terms
After export.
Wherein, the text segmentation unit 2 includes:
First cutting unit, for by the candidate text terms punctuate and space be separated to obtain multiple texts
Part, and remove punctuate and space in the multiple textual portions and obtain multiple text sentences to be filtered;
Second cutting unit, for judging whether the character in each text sentence to be filtered is power industry profession point
Word, if so, extracting all identical characters in text sentence, simultaneously cutting is word, if it is not, then extracting all identical in text sentence
Character is simultaneously given up;Wherein, it is that the text after character and character together cutting is obtained candidate text sentence that the cutting, which is word,.
Wherein, the participle unit 3 is specifically used for word corresponding with vocabulary in dictionary database in candidate text sentence
Remittance, which extracts, to be segmented;Wherein, vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database;
The output unit 6 includes:
Similarity calculated, for calculating candidate text when a candidate text sentence is corresponding with multiple candidate participles
Each candidate segments the similarity value with one or more power domain specialized vocabularies and carries out being accumulated by the time in this sentence
Choosing segments corresponding similarity value;
Final participle determination unit, for choosing the highest candidate participle of similarity value as the final of candidate text sentence
Participle.
Wherein, the output unit 6 includes:
Display unit is exported for the final participle after sorting by interval of space, and after selected and sorted before
Ten progress emphasis show that other final word segmentation results are then hidden.
It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement
One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one
Method content obtains, and details are not described herein again.
It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including
As process, device or system.Method described herein partly can execute this method by being used to indicate processor
Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable
Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc..
In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link
It send.
The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program
The adaptive Chinese word cutting method towards power industry described in embodiment one is realized when being executed by processor.
Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and
It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill
Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport
In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its
Its those of ordinary skill can understand each embodiment disclosed herein.
Claims (10)
1. a kind of adaptive Chinese word cutting method towards power industry, which comprises the steps of:
Step S1, candidate text terms are obtained, candidate's text terms are short sentence or paragraph to be segmented;
Step S2, processing is split to the candidate text terms and obtains multiple candidate text sentences;
Step S3, cutting is carried out to each candidate text sentence and obtains one or more participles;
Step S4, one by one the participle in candidate text terms is replaced with to anticipate with participle word and identical vocabulary and carries out semanteme and sentence
Not, if ambiguity, return step S3 occur in the text terms of front and back after replacement, if the text terms of front and back do not have discrimination after replacement
Justice then retains the participle as candidate participle;
Step S5, acquisition and the semantic similar one or more power domain specialized vocabularies of candidate participle, calculate candidate participle with
The similarity of one or more power domain specialized vocabularies simultaneously determines final participle according to similarity;
Step S6, it is exported after being ranked up final participle by the frequency that participle occurs in the candidate text terms.
2. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S2
Include:
By in the candidate text terms punctuate and space be separated to obtain multiple textual portions, and remove the multiple text
Punctuate and space in this part obtain multiple text sentences to be filtered;
Judge whether the character in each text sentence to be filtered is power industry profession participle, if so, extracting text sentence
In all identical characters and cutting be word, if it is not, then extracting all identical characters in text sentence and giving up;Wherein, described to cut
Being divided into word is that the text after character and character together cutting is obtained candidate text sentence.
3. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S3
Include:
Vocabulary corresponding with vocabulary in dictionary database in candidate text sentence is extracted and is segmented;Wherein, institute's predicate
Vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in allusion quotation database.
4. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S4
Include:
When a candidate text sentence is corresponding with multiple candidate participles, each candidate participle and one in candidate's text sentence is calculated
The similarity value of a or multiple power domain specialized vocabularies simultaneously carries out being accumulated by the corresponding similarity value of candidate participle;
Choose final participle of the highest candidate participle of similarity value as candidate text sentence.
5. the adaptive Chinese word cutting method towards power industry as claimed in claim 4, which is characterized in that the step S6
Include:
Final participle after sequence is exported by interval of space, and the top ten after selected and sorted carries out emphasis and shows,
Other final word segmentation results are then hidden.
6. a kind of adaptive Chinese automatic word-cut towards power industry characterized by comprising
Text acquiring unit, for obtaining candidate text terms, candidate's text terms are short sentence or paragraph to be segmented;
Text segmentation unit obtains multiple candidate text sentences for being split processing to the candidate text terms;
Participle unit obtains one or more participles for carrying out cutting to each candidate text sentence;
First participle screening unit, for replacing with by the participle in candidate text terms one by one and segmenting the identical vocabulary of word meaning
And semantic differentiation is carried out, if there is ambiguity, return step S3, if the text before and after after replacement in the text terms of front and back after replacement
Term does not have ambiguity, then retains the participle as candidate participle;
Second participle screening unit is used to obtain similar one or more power domain specialized vocabularies with candidate participle semanteme,
It calculates the similarity of candidate participle and one or more power domain specialized vocabularies and final participle is determined according to similarity;
Output unit, for defeated after being ranked up final participle by the frequency that participle occurs in the candidate text terms
Out.
7. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 6, which is characterized in that the text point
Cutting unit includes:
First cutting unit, for by the candidate text terms punctuate and space be separated to obtain multiple text portions
Point, and remove punctuate and space in the multiple textual portions and obtain multiple text sentences to be filtered;
Second cutting unit, for judging whether the character in each text sentence to be filtered is power industry profession participle, if
It is then to extract in text sentence all identical characters and cutting is word, if it is not, then extracting in text sentence all identical characters simultaneously
Give up;Wherein, it is that the text after character and character together cutting is obtained candidate text sentence that the cutting, which is word,.
8. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 6, which is characterized in that the participle is single
Member is specifically used for extracting vocabulary corresponding with vocabulary in dictionary database in candidate text sentence being segmented;Wherein,
Vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database;
The output unit includes:
Similarity calculated, for calculating candidate's text language when a candidate text sentence is corresponding with multiple candidate participles
Each candidate segments the similarity value with one or more power domain specialized vocabularies and carries out being accumulated by the candidate point in sentence
The corresponding similarity value of word;
Final participle determination unit, for choosing final point as candidate text sentence of the highest candidate participle of similarity value
Word.
9. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 8, which is characterized in that the output is single
Member includes:
Display unit is exported for the final participle after sorting by interval of space, and the top ten after selected and sorted
It carries out emphasis and shows that other final word segmentation results are then hidden.
10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor
Benefit require any one of 1~5 described in the adaptive Chinese word cutting method towards power industry.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638948.2A CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910638948.2A CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110413998A true CN110413998A (en) | 2019-11-05 |
CN110413998B CN110413998B (en) | 2023-04-21 |
Family
ID=68361553
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910638948.2A Active CN110413998B (en) | 2019-07-16 | 2019-07-16 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110413998B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN112257425A (en) * | 2020-09-29 | 2021-01-22 | 国网天津市电力公司 | Power data analysis method and system based on data classification model |
CN112926320A (en) * | 2021-03-24 | 2021-06-08 | 山东亿云信息技术有限公司 | Text key content intelligent extraction method and system based on subject term optimization |
CN114881017A (en) * | 2022-04-25 | 2022-08-09 | 南京烽火星空通信发展有限公司 | Self-adaptive dynamic word segmentation method |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN106844326A (en) * | 2015-12-04 | 2017-06-13 | 北京国双科技有限公司 | A kind of method and device for obtaining word |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN107918604A (en) * | 2017-11-13 | 2018-04-17 | 彩讯科技股份有限公司 | A kind of Chinese segmenting method and device |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
-
2019
- 2019-07-16 CN CN201910638948.2A patent/CN110413998B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104077275A (en) * | 2014-06-27 | 2014-10-01 | 北京奇虎科技有限公司 | Method and device for performing word segmentation based on context |
CN106844326A (en) * | 2015-12-04 | 2017-06-13 | 北京国双科技有限公司 | A kind of method and device for obtaining word |
CN107608968A (en) * | 2017-09-22 | 2018-01-19 | 深圳市易图资讯股份有限公司 | Chinese word cutting method, the device of text-oriented big data |
CN107918604A (en) * | 2017-11-13 | 2018-04-17 | 彩讯科技股份有限公司 | A kind of Chinese segmenting method and device |
CN109828981A (en) * | 2017-11-22 | 2019-05-31 | 阿里巴巴集团控股有限公司 | A kind of data processing method and calculate equipment |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111079428A (en) * | 2019-12-27 | 2020-04-28 | 出门问问信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN111079428B (en) * | 2019-12-27 | 2023-09-19 | 北京羽扇智信息科技有限公司 | Word segmentation and industry dictionary construction method and device and readable storage medium |
CN112257425A (en) * | 2020-09-29 | 2021-01-22 | 国网天津市电力公司 | Power data analysis method and system based on data classification model |
CN112926320A (en) * | 2021-03-24 | 2021-06-08 | 山东亿云信息技术有限公司 | Text key content intelligent extraction method and system based on subject term optimization |
CN112926320B (en) * | 2021-03-24 | 2022-12-27 | 山东亿云信息技术有限公司 | Text key content intelligent extraction method and system based on subject term optimization |
CN114881017A (en) * | 2022-04-25 | 2022-08-09 | 南京烽火星空通信发展有限公司 | Self-adaptive dynamic word segmentation method |
Also Published As
Publication number | Publication date |
---|---|
CN110413998B (en) | 2023-04-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110413998A (en) | Self-adaptive Chinese word segmentation method, system and medium for power industry | |
CN106649783B (en) | Synonym mining method and device | |
US20150074112A1 (en) | Multimedia Question Answering System and Method | |
CN104881458B (en) | A kind of mask method and device of Web page subject | |
JP6394388B2 (en) | Synonym relation determination device, synonym relation determination method, and program thereof | |
CN109710947A (en) | Power specialty word stock generating method and device | |
CN111475607B (en) | Web data clustering method based on Mashup service function feature representation and density peak detection | |
CN108304382A (en) | Mass analysis method based on manufacturing process text data digging and system | |
CN110704638A (en) | Clustering algorithm-based electric power text dictionary construction method | |
CN108536676B (en) | Data processing method and device, electronic equipment and storage medium | |
CN114579104A (en) | Data analysis scene generation method, device, equipment and storage medium | |
CN112784009A (en) | Subject term mining method and device, electronic equipment and storage medium | |
CN111861596A (en) | Text classification method and device | |
CN114239588A (en) | Article processing method and device, electronic equipment and medium | |
CN107577713B (en) | Text handling method based on electric power dictionary | |
CN113779983B (en) | Text data processing method and device, storage medium and electronic device | |
CN110413997A (en) | New word discovery method, system and readable storage medium for power industry | |
CN107291952B (en) | Method and device for extracting meaningful strings | |
CN108733733B (en) | Biomedical text classification method, system and storage medium based on machine learning | |
CN106933797B (en) | Target information generation method and device | |
CN113221538A (en) | Event library construction method and device, electronic equipment and computer readable medium | |
JP4985096B2 (en) | Document analysis system, document analysis method, and computer program | |
CN109446239A (en) | Text method for digging, device and computer readable storage medium under line | |
CN112836529B (en) | Method and device for generating target corpus sample | |
CN115905297B (en) | Method, apparatus and medium for retrieving data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |