CN110413998A

CN110413998A - Self-adaptive Chinese word segmentation method, system and medium for power industry

Info

Publication number: CN110413998A
Application number: CN201910638948.2A
Authority: CN
Inventors: 张云翔; 饶竹一
Original assignee: Shenzhen Power Supply Co ltd
Current assignee: Shenzhen Power Supply Co ltd
Priority date: 2019-07-16
Filing date: 2019-07-16
Publication date: 2019-11-05
Anticipated expiration: 2039-07-16
Also published as: CN110413998B

Abstract

The invention relates to a self-adaptive Chinese word segmentation method for the power industry, a system and a medium thereof, wherein the method comprises the following steps: s1, obtaining candidate text terms, wherein the candidate text terms are short sentences or paragraphs to be segmented; s2, carrying out segmentation processing on the candidate text terms to obtain a plurality of candidate text sentences; s3, segmenting each candidate text sentence to obtain one or more participles; s4, replacing the participles in the candidate text terms one by one with words with the same meaning as the participles and carrying out semantic judgment, if ambiguity occurs, returning to S3, and if ambiguity does not exist, keeping the participles as candidate participles; s5, acquiring one or more electric power field professional vocabularies similar to the candidate participle semanteme, calculating the similarity between the candidate participle and the one or more electric power field professional vocabularies, and determining the final participle according to the similarity; and S6, sorting the final participles according to the frequency of the participles appearing in the candidate text terms, and outputting the final participles.

Description

A kind of adaptive Chinese word cutting method and its system, medium towards power industry

Technical field

The present invention relates to power equipment technical field of data processing, and in particular in a kind of adaptive towards power industry Literary segmenting method and its system, computer readable storage medium.

Background technique

In recent years, as network becomes increasingly popular, the text scale on internet gradually expands, and information resources are continuously increased, In order to retrieve and excavate valuable information from a large amount of resource, Internet company greatly develops natural language processing field Technology, Chinese word segmentation is basis and the premise of natural language processing technique, and Chinese word segmentation is in information retrieval, machine translation, letter It plays an important role in the information processings such as breath filtering, is the key technology and difficult point of information processing；Up to now, national grid A large amount of data management system has had been established in company, and business datum amount is very huge.

Therefore there are following technical problems: due to each business department and each operation system to data information definition rule not Together, cause same source data in reality to occur the inconsistent situation of such as title in different operation systems, cause a number The problem of multi-source, data uniformity brings certain difficulty between each operation system.

Summary of the invention

It is an object of the invention to propose a kind of adaptive Chinese word cutting method and its system, calculating towards power industry Machine readable storage medium storing program for executing, to solve the above technical problems.

In order to achieve the object of the present invention, according to a first aspect of the present invention, the embodiment of the present invention provides one kind towards electric power row The adaptive Chinese word cutting method of industry, includes the following steps:

Step S1, candidate text terms are obtained, candidate's text terms are short sentence or paragraph to be segmented；

Step S2, processing is split to the candidate text terms and obtains multiple candidate text sentences；

Step S3, cutting is carried out to each candidate text sentence and obtains one or more participles；

Step S4, the participle in candidate text terms is replaced with and segments the identical vocabulary of word meaning and carries out semanteme one by one Differentiate, if ambiguity, return step S3 occur in the text terms of front and back after replacement, if the text terms of front and back do not have discrimination after replacement Justice then retains the participle as candidate participle；

Step S5, acquisition and the semantic similar one or more power domain specialized vocabularies of candidate participle, calculate candidate point The similarity of word and one or more power domain specialized vocabularies simultaneously determines final participle according to similarity；

Step S6, it is exported after being ranked up final participle by the frequency that participle occurs in the candidate text terms.

Preferably, the step S2 includes:

By in the candidate text terms punctuate and space be separated to obtain multiple textual portions, and remove described more Punctuate and space in a textual portions obtain multiple text sentences to be filtered；

Judge whether the character in each text sentence to be filtered is power industry profession participle, if so, extracting text Simultaneously cutting is word to all identical characters in sentence, if it is not, then extracting all identical characters in text sentence and giving up；Wherein, institute It is that the text after character and character together cutting is obtained candidate text sentence that state cutting, which be word,.

Preferably, the step S3 includes:

Vocabulary corresponding with vocabulary in dictionary database in candidate text sentence is extracted and is segmented；Wherein, institute Stating vocabulary in dictionary database is vocabulary in the dedicated dictionary for word segmentation of power domain.

Preferably, the step S4 includes:

When a candidate text sentence is corresponding with multiple candidate participles, each candidate participle in candidate's text sentence is calculated The corresponding similarity value of candidate participle is accumulated by with the similarity value of one or more power domain specialized vocabularies and carrying out；

Choose final participle of the highest candidate participle of similarity value as candidate text sentence.

Preferably, the step S6 includes:

Final participle after sequence is exported by interval of space, and the top ten after selected and sorted carries out emphasis and shows Show, other final word segmentation results are then hidden.

According to a second aspect of the present invention, the embodiment of the present invention provides a kind of adaptive Chinese word segmentation system towards power industry System, comprising:

Text acquiring unit, for obtaining candidate text terms, candidate's text terms are short sentence or section to be segmented It falls；

Text segmentation unit obtains multiple candidate text sentences for being split processing to the candidate text terms；

Participle unit obtains one or more participles for carrying out cutting to each candidate text sentence；

First participle screening unit replaces with the participle in candidate text terms for one by one identical with participle word meaning Vocabulary simultaneously carries out semantic differentiation, if ambiguity, return step S3, if front and back after replacement occur in the text terms of front and back after replacement Text terms do not have ambiguity, then retain the participle as candidate participle；

Second participle screening unit, for obtaining and the semantic similar one or more power domain profession words of candidate participle It converges, calculate the similarity of candidate participle and one or more power domain specialized vocabularies and final participle is determined according to similarity；

Output unit, after being ranked up final participle by the frequency that participle occurs in the candidate text terms Output.

Preferably, the text segmentation unit includes:

First cutting unit, for by the candidate text terms punctuate and space be separated to obtain multiple texts Part, and remove punctuate and space in the multiple textual portions and obtain multiple text sentences to be filtered；

Second cutting unit, for judging whether the character in each text sentence to be filtered is power industry profession point Word, if so, extracting all identical characters in text sentence, simultaneously cutting is word, if it is not, then extracting all identical in text sentence Character is simultaneously given up；Wherein, it is that the text after character and character together cutting is obtained candidate text sentence that the cutting, which is word,.

Preferably, the participle unit is specifically used for word corresponding with vocabulary in dictionary database in candidate text sentence Remittance, which extracts, to be segmented；Wherein, vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database；

The output unit includes:

Similarity calculated, for calculating candidate text when a candidate text sentence is corresponding with multiple candidate participles Each candidate segments the similarity value with one or more power domain specialized vocabularies and carries out being accumulated by the time in this sentence Choosing segments corresponding similarity value；

Final participle determination unit, for choosing the highest candidate participle of similarity value as the final of candidate text sentence Participle.

Preferably, the output unit includes:

Display unit is exported for the final participle after sorting by interval of space, and after selected and sorted before Ten progress emphasis show that other final word segmentation results are then hidden.

According to a third aspect of the present invention, the embodiment of the present invention provides a kind of computer readable storage medium, is stored thereon with Computer program realizes the adaptive Chinese word cutting method towards power industry when the program is executed by processor.

In embodiments of the present invention, in conjunction with the characteristics of electric power data, the exclusive dictionary for word segmentation library of power domain is established, according to Vocabulary split to candidate text sentence in the dictionary for word segmentation library and ambiguity differentiates to obtain candidate participle, and further to time Choosing participle determines final participle to the similarity of similar vocabulary in dictionary for word segmentation library, substantially increases the accuracy of participle, root According to by data match analysis between each operation system, the service efficiency of working efficiency and data can be significantly improved.

Other features and advantages of the present invention will be illustrated in the following description, also, partly becomes from specification It obtains it is clear that being emerged from by implementing the present invention.The objectives and other advantages of the invention can by specification, Specifically noted structure is achieved and obtained in claims and attached drawing.Certainly, implement any of the products of the present invention or Method does not necessarily require achieving all the advantages described above at the same time.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of adaptive Chinese word cutting method flow chart towards power industry in the embodiment of the present invention one.

Fig. 2 is a kind of adaptive Chinese automatic word-cut schematic diagram towards power industry in the embodiment of the present invention two.

Specific embodiment

Various exemplary embodiments, feature and the aspect of the disclosure are described in detail below with reference to attached drawing.It is identical in attached drawing Appended drawing reference indicate element functionally identical or similar.Although the various aspects of embodiment are shown in the attached drawings, remove It non-specifically points out, it is not necessary to attached drawing drawn to scale.

In addition, in order to better illustrate the present invention, numerous details is given in specific embodiment below.This Field is it will be appreciated by the skilled person that without certain details, the present invention equally be can be implemented.In some instances, for this Means known to the technical staff of field are not described in detail, in order to highlight purport of the invention.

As shown in Figure 1, the embodiment of the present invention provides a kind of adaptive Chinese word cutting method towards power industry, including such as Lower step:

Wherein, the step S2 is specifically included:

Specifically, extracting first character first, and judge this first for text sentence to be filtered for one Whether character is power industry profession participle, if so, extracting all identical characters and cutting in text sentence is word, if it is not, It then extracts all identical characters in text sentence and gives up；The differentiation for then proceeding to successive character, until taking out text to be filtered Last character in sentence, to realize the filtering to candidate text sentence.Wherein, according to the power industry special term of building Table and daily vocabulary dictionary for word segmentation, the character taken out in text sentence and the dedicated vocabulary of power industry are compared, and judgement should Whether character is the dedicated participle of power industry.

Wherein, the step S3 includes:

Specifically, vocabulary corresponding with vocabulary in dictionary database and semantic similar vocabulary, a candidate text language There may be zero or more participles for sentence.

Wherein, the step S4 includes:

It is segmented specifically, a candidate text sentence may be corresponding with multiple candidates, according to similarity in this step Value screens these candidate's participles, and final one candidate text sentence only exports a participle, reduces participle error rate.

Wherein, the step S6 includes:

Specifically, be ranked up each word segmentation result being calculated according to the frequency of appearance in the present embodiment, and Word segmentation result after sequence is exported by interval of space, the top ten after selected and sorted carries out emphasis and shows, subsequent Word segmentation result is then hidden, and can click respective keys when needing to watch, and shows remaining word segmentation result, and by whole participles As a result it is exported in bar graph form to display device, shows user.

The embodiment of the present invention is by choosing the participle data in the dedicated dictionary for word segmentation of power domain, by the candidate text of extraction Term can be separated with punctuate and space, be split as multiple text sentences, exported, and can be located in advance to text terms Reason reduces the punctuate contained in text terms and the participle interference of space bring, also increases the pretreatment efficiency of text terms, Solves the efficiency of existing text terms processing, by taking out a character of the text sentence split out, by taking-up Character substitutes into comparison, judges whether the character is the dedicated participle of power industry, until taking out the last character in text sentence The text sentence split out can substitute into and judge by word, and take out all identical characters by symbol, be not required to substitute into all characters Judgement is compared, the workload of character comparison judgement is reduced, so that more efficient, the filtered time of character comparison judgement It selects text terms to will do it cutting, ambiguity differentiation is carried out to the participle data obtained after cutting, until participle does not contain ambiguity, is subtracted , still there is ambiguity after avoiding text terms cutting, user is caused to see in less to producing ambiguity after text terms cutting the case where The cognition that mistake is generated when seeing, increases the accuracy segmented to text data, by the power for calculating all word segmentation results Weight score value, and carries out accumulation calculating, filters out the maximum word segmentation result of numerical value, and be ranked up according to the frequency of appearance carry out it is defeated Out, the participle data that can be obtained to cutting in text terms are ranked up output, and participle data viewing is more intuitive, more for item Rationality, so that thinking is more clear when user watches, to significantly improve the service efficiency of working efficiency and data.

As shown in Fig. 2, second embodiment of the present invention provides a kind of adaptive Chinese automatic word-cut towards power industry, packet It includes:

Text acquiring unit 1, for obtaining candidate text terms, candidate's text terms are short sentence or section to be segmented It falls；

Text segmentation unit 2 obtains multiple candidate text sentences for being split processing to the candidate text terms；

Participle unit 3 obtains one or more participles for carrying out cutting to each candidate text sentence；

First participle screening unit 4 replaces with the participle in candidate text terms for one by one identical as participle word meaning Vocabulary and carry out semantic differentiation, if there is ambiguity in the text terms of front and back after replacement, return step S3, if front and back after replacement Text terms there is no ambiguity, then retain the participle as candidate participle；

Second participle screening unit 5, for obtaining and the semantic similar one or more power domain professions of candidate participle Vocabulary calculates the similarity of candidate participle and one or more power domain specialized vocabularies and determines final point according to similarity Word；

Output unit 6, for final participle to be ranked up by the frequency that participle occurs in the candidate text terms After export.

Wherein, the text segmentation unit 2 includes:

Wherein, the participle unit 3 is specifically used for word corresponding with vocabulary in dictionary database in candidate text sentence Remittance, which extracts, to be segmented；Wherein, vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database；

The output unit 6 includes:

Wherein, the output unit 6 includes:

It should be noted that system described in the present embodiment two be it is corresponding with one the method for embodiment, be used to implement One the method for example, therefore, other contents not described of system described in related embodiment two can be refering to described in embodiment one Method content obtains, and details are not described herein again.

It should also be understood that system described in one the method for embodiment and embodiment two can be implemented in many ways, including As process, device or system.Method described herein partly can execute this method by being used to indicate processor Program instruction and the instruction being recorded in non-transient computer readable storage medium and implement, non-transient computer is readable Storage medium hard drive, floppy disk, optical disc (small-sized dish (CD) or digital universal dish (DVD)), flash memory etc.. In some embodiments, program instruction can be stored remotely and be sent out on network via optics or electronic communication link It send.

The embodiment of the present invention three provides a kind of computer readable storage medium, is stored thereon with computer program, the program The adaptive Chinese word cutting method towards power industry described in embodiment one is realized when being executed by processor.

Various embodiments of the present invention are described above, above description is exemplary, and non-exclusive, and It is not limited to disclosed each embodiment.Without departing from the scope and spirit of illustrated each embodiment, for this skill Many modifications and changes are obvious for the those of ordinary skill in art field.The selection of term used herein, purport In principle, the practical application or to the technological improvement in market for best explaining each embodiment, or make the art its Its those of ordinary skill can understand each embodiment disclosed herein.

Claims

1. a kind of adaptive Chinese word cutting method towards power industry, which comprises the steps of:

Step S4, one by one the participle in candidate text terms is replaced with to anticipate with participle word and identical vocabulary and carries out semanteme and sentence Not, if ambiguity, return step S3 occur in the text terms of front and back after replacement, if the text terms of front and back do not have discrimination after replacement Justice then retains the participle as candidate participle；

Step S5, acquisition and the semantic similar one or more power domain specialized vocabularies of candidate participle, calculate candidate participle with The similarity of one or more power domain specialized vocabularies simultaneously determines final participle according to similarity；

2. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S2 Include:

By in the candidate text terms punctuate and space be separated to obtain multiple textual portions, and remove the multiple text Punctuate and space in this part obtain multiple text sentences to be filtered；

Judge whether the character in each text sentence to be filtered is power industry profession participle, if so, extracting text sentence In all identical characters and cutting be word, if it is not, then extracting all identical characters in text sentence and giving up；Wherein, described to cut Being divided into word is that the text after character and character together cutting is obtained candidate text sentence.

3. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S3 Include:

Vocabulary corresponding with vocabulary in dictionary database in candidate text sentence is extracted and is segmented；Wherein, institute's predicate Vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in allusion quotation database.

4. the adaptive Chinese word cutting method towards power industry as described in claim 1, which is characterized in that the step S4 Include:

When a candidate text sentence is corresponding with multiple candidate participles, each candidate participle and one in candidate's text sentence is calculated The similarity value of a or multiple power domain specialized vocabularies simultaneously carries out being accumulated by the corresponding similarity value of candidate participle；

5. the adaptive Chinese word cutting method towards power industry as claimed in claim 4, which is characterized in that the step S6 Include:

Final participle after sequence is exported by interval of space, and the top ten after selected and sorted carries out emphasis and shows, Other final word segmentation results are then hidden.

6. a kind of adaptive Chinese automatic word-cut towards power industry characterized by comprising

Text acquiring unit, for obtaining candidate text terms, candidate's text terms are short sentence or paragraph to be segmented；

First participle screening unit, for replacing with by the participle in candidate text terms one by one and segmenting the identical vocabulary of word meaning And semantic differentiation is carried out, if there is ambiguity, return step S3, if the text before and after after replacement in the text terms of front and back after replacement Term does not have ambiguity, then retains the participle as candidate participle；

Second participle screening unit is used to obtain similar one or more power domain specialized vocabularies with candidate participle semanteme, It calculates the similarity of candidate participle and one or more power domain specialized vocabularies and final participle is determined according to similarity；

Output unit, for defeated after being ranked up final participle by the frequency that participle occurs in the candidate text terms Out.

7. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 6, which is characterized in that the text point Cutting unit includes:

First cutting unit, for by the candidate text terms punctuate and space be separated to obtain multiple text portions Point, and remove punctuate and space in the multiple textual portions and obtain multiple text sentences to be filtered；

Second cutting unit, for judging whether the character in each text sentence to be filtered is power industry profession participle, if It is then to extract in text sentence all identical characters and cutting is word, if it is not, then extracting in text sentence all identical characters simultaneously Give up；Wherein, it is that the text after character and character together cutting is obtained candidate text sentence that the cutting, which is word,.

8. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 6, which is characterized in that the participle is single Member is specifically used for extracting vocabulary corresponding with vocabulary in dictionary database in candidate text sentence being segmented；Wherein, Vocabulary is vocabulary in the dedicated dictionary for word segmentation of power domain in the dictionary database；

The output unit includes:

Similarity calculated, for calculating candidate's text language when a candidate text sentence is corresponding with multiple candidate participles Each candidate segments the similarity value with one or more power domain specialized vocabularies and carries out being accumulated by the candidate point in sentence The corresponding similarity value of word；

Final participle determination unit, for choosing final point as candidate text sentence of the highest candidate participle of similarity value Word.

9. the adaptive Chinese automatic word-cut towards power industry as claimed in claim 8, which is characterized in that the output is single Member includes:

Display unit is exported for the final participle after sorting by interval of space, and the top ten after selected and sorted It carries out emphasis and shows that other final word segmentation results are then hidden.

10. a kind of computer readable storage medium, is stored thereon with computer program, power is realized when which is executed by processor Benefit require any one of 1~5 described in the adaptive Chinese word cutting method towards power industry.