CN111191448A - Word processing method, device, storage medium and processor - Google Patents

Word processing method, device, storage medium and processor Download PDF

Info

Publication number
CN111191448A
CN111191448A CN201911360710.4A CN201911360710A CN111191448A CN 111191448 A CN111191448 A CN 111191448A CN 201911360710 A CN201911360710 A CN 201911360710A CN 111191448 A CN111191448 A CN 111191448A
Authority
CN
China
Prior art keywords
word
text
special
words
tested
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911360710.4A
Other languages
Chinese (zh)
Inventor
王培祎
王艳松
胡彩娥
姚晓明
李香龙
王健
马龙飞
陆斯悦
张禄
徐蕙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Beijing Electric Power Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Beijing Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Beijing Electric Power Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201911360710.4A priority Critical patent/CN111191448A/en
Publication of CN111191448A publication Critical patent/CN111191448A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word processing method, a word processing device, a storage medium and a processor. Wherein, the method comprises the following steps: acquiring a special word configured in advance; acquiring a plurality of texts to be processed; using special words to perform word segmentation on each text; marking each text obtained after word segmentation; inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and one text is one of a plurality of texts; inputting a text to be tested into a CRF model to obtain a marked text to be tested; and obtaining the word segmentation result of the text to be tested according to the marked text to be tested. The invention solves the technical problem of poor word processing effect in the prior art.

Description

Word processing method, device, storage medium and processor
Technical Field
The present invention relates to the field of word processing, and in particular, to a word processing method, apparatus, storage medium, and processor.
Background
Under the large background of the ubiquitous power internet of things, a 95598 client service system serves as an important component of the ubiquitous power internet of things application and registers massive client information. At present, manual statistics work order analysis is mainly relied on, and the risk prompt can not be carried out on the change of customer appeal due to the problems of insufficient production efficiency and the like. With the continuous development of the prior art, more and more words which are not logged in appear in the work order, such as: coal to electricity, etc. How to effectively process text information with words which are not logged in, comprehensively develop multi-dimensional research and intelligent analysis of 95598 work orders, introduce artificial intelligence technologies such as text mining in the natural language field to realize big data research in the customer service field, and become more urgent needs and problems at present.
In the prior art, a hidden markov model and a maximum-entropy are usually adopted, and the hidden markov model and the maximum-entropy cannot cover all characteristics by using a simple characteristic function due to the complexity of the structure of an entity; the binarization feature in the model is only whether the record feature appears or not, and the intensity of the feature is enhanced in the text classification, so that the binarization feature is not optimal. The algorithm has low convergence speed, so that the maximum entropy model has high calculation cost and the problem of data sparsity is serious.
Aiming at the problem of poor word processing effect in the prior art, no effective solution is provided at present.
Disclosure of Invention
Embodiments of the present invention provide a word processing method, an apparatus, a storage medium, and a processor, so as to at least solve the technical problem in the prior art that a word processing effect is not good.
According to an aspect of an embodiment of the present invention, there is provided a word processing method including: acquiring a preset special word, wherein the special word is a special word in a preset field, the special word is multiple, and the special word is composed of at least one Chinese character; acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences; performing word segmentation on each piece of text by using the special word, wherein each piece of text after word segmentation consists of one or more words; marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word; inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and the text is one of the texts; inputting a text to be tested into the CRF model to obtain a marked text to be tested; and obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word.
Optionally, after obtaining the word segmentation result of the text to be tested according to the marked text to be tested, the method further includes: comparing each word in the word segmentation result with the special word to obtain a word which is not included in the special word; adding words not present in the special words to the special words.
Optionally, adding a word that does not appear in the special word to the special word comprises: obtaining words in the word segmentation results of the texts to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the texts to be tested in the preset number; adding a word not present in the special word to the special word if the number of times exceeds a threshold.
Optionally, after adding a word that is not in the special word to the special word, the method further comprises: segmenting and marking the plurality of texts again by using the added special words; and taking the result obtained after the re-word segmentation and marking as training data to train the CRF model again.
According to another aspect of the embodiments of the present invention, there is also provided a word processing apparatus, including: the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a special word which is configured in advance, the special word is a special word in a preset field, the special word is a plurality of special words, and the special word is composed of at least one Chinese character; the second acquisition module is used for acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences; the word segmentation module is used for segmenting each text by using the special word, wherein each text after word segmentation consists of one or more words; the marking module is used for marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word; the training module is used for inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and the text is one of the texts; the input module is used for inputting the text to be tested into the CRF model to obtain the marked text to be tested; and the obtaining module is used for obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word.
Optionally, the method further comprises: and the adding module is used for comparing each word in the word segmentation result with the special word to obtain a word which is not included in the special word, and adding the word which is not included in the special word into the special word.
Optionally, the adding module is configured to: obtaining words in the word segmentation results of the texts to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the texts to be tested in the preset number; adding a word not present in the special word to the special word if the number of times exceeds a threshold.
Optionally, after the adding module adds a word that is not included in the special word to the special word, the word segmentation module and the marking module are configured to segment and mark the plurality of texts again by using the added special word; and the training module is used for training the CRF model again by taking the result obtained after the re-word segmentation and marking as training data.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and when the program runs, a device in which the storage medium is located is controlled to execute any one of the above methods.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes to perform the method of any one of the above.
In the embodiment of the invention, a pre-configured special word is obtained, wherein the special word is a special word in a preset field, a plurality of special words are provided, and the special word is composed of at least one Chinese character; acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences; performing word segmentation on each piece of text by using the special word, wherein each piece of text after word segmentation consists of one or more words; marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word; inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and the text is one of the texts; inputting a text to be tested into the CRF model to obtain a marked text to be tested; and obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word, and the text is subjected to corresponding word segmentation processing through a CRF (fuzzy C-means) model, so that the aim of quickly and accurately obtaining the word segmentation result is fulfilled, words in the text are effectively identified, the technical effect of improving the word processing efficiency is achieved, and the technical problem of poor word processing effect in the prior art is solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow diagram of a word processing method according to an embodiment of the invention;
FIG. 2 is a flow diagram of a word processing method in accordance with an alternative embodiment of the present invention;
fig. 3 is a schematic diagram of a word processing apparatus according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of a word processing method, it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases, the steps illustrated or described may be executed in an order different than that herein.
Fig. 1 is a flowchart of a word processing method according to an embodiment of the present invention, as shown in fig. 1, the method including the steps of:
step S102, acquiring pre-configured special words, wherein the special words are special words in a preset field, the special words are multiple, and the special words consist of at least one Chinese character;
the special words are special words of a predetermined field, wherein the predetermined field may include, but is not limited to, an electric power field, a financial field, a translation field, and the like. Optionally, the predetermined field is a power field, specifically a special word included in the customer service work order. Alternatively, the pre-configured specific words may be stored in a specific thesaurus.
Step S104, acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences;
step S106, performing word segmentation on each text by using special words, wherein each text after word segmentation consists of one or more words;
step S108, marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word;
step S110, inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained by marking the text after word segmentation, and one text is one of a plurality of texts;
step S112, inputting a text to be tested into a CRF model to obtain a marked text to be tested;
and S114, obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word.
Through the steps, the method is suitable for multi-text classification, corresponding word segmentation processing can be carried out on the text through the CRF model, and the purpose of quickly and accurately obtaining word segmentation results is achieved, so that the technical effects of effectively identifying words in the text and improving word processing efficiency are achieved, and the technical problem that the word processing effect in the prior art is poor is solved.
Optionally, after obtaining the word segmentation result of the text to be tested according to the marked text to be tested, the method further includes: comparing each word in the word segmentation result with the special word to obtain a word which is not included in the special word; words not present in the special words are added to the special words.
By the method, whether the words in the word segmentation result are special words can be effectively distinguished, and specifically, a word-to-word comparison mode is adopted, for example, each word in the word segmentation result is compared with the special word to obtain words which are not included in the special words. Further, in order to update the special words in time, words which are not included in the special words can be added into the special words, so that the special words are enriched, and important basis can be provided for subsequent recognition work.
Optionally, adding a word to the special word that does not appear in the special word comprises: obtaining words in the word segmentation results of the text to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the text to be tested in the preset number; in the case where the number of times exceeds the threshold value, a word that is not present in the special word is added to the special word.
When updating the special words, in order to reduce the workload and effectively reserve valuable special words, a predetermined condition may be set when a word not included in the special words is added to the special words, and some words with a small number of usage may be filtered according to the predetermined condition. For example, the number of times that a word that does not appear in the special word appears in the segmentation results of a predetermined number of texts to be tested is determined to be compared with a preset threshold, and in the case that the number of times is greater than or equal to the threshold, a word that does not appear in the special word is added to the special word. In a specific implementation process, the threshold may be set according to a specific application scenario, and optionally, the threshold is 15.
Optionally, after adding a word that is not included in the special word to the special word, the method further includes: segmenting and marking a plurality of texts again by using the added special words; and taking the result obtained after the re-word segmentation and marking as training data to train the CRF model again.
In the word processing method, the added special words can be used for segmenting and marking a plurality of texts again, and the result obtained after re-segmenting and marking is used as training data to train the CRF model again. And training the CRF model in the iteration mode to improve the stability of the model.
An alternative embodiment of the invention is described below.
Conditional Random Field (CRF) is a sequence modeling algorithm, and performs sequence labeling based on sequence information to directly obtain new words or to perform new word judgment after obtaining candidate words. Meanwhile, the characteristics of a maximum entropy model and a hidden Markov model are combined, the model is an undirected graph model, and a good effect is achieved in sequence tagging tasks such as word segmentation, part of speech tagging, named entity recognition and the like. When the correlation relations such as entity recognition, part of speech recognition and the like are carried out, the conditional random field uses word sequences instead of words, and is suitable for the problem of multi-text classification.
A Conditional Random Field (CRF) is a markov random variable Y corresponding to a hidden state sequence given a random variable X. The broad definition of CRF is P (Y)v|X,Yw,w≠v)=P(Yv|X,YwW to v). Generally, CRF is used for sequence modeling, and is specifically referred to as a CRF linear chain.
The joint probability distribution of a conceptual undirected graph can be expressed under factorization as:
Figure BDA0002337090100000061
in a linear chain CRF, each (I)i~Oi) The pair is a maximum cluster, i.e., c ═ i in the above formula. And the linear chain satisfies: p (I)i|O;I1;...;In)=P(Ii|O;Ii-1;...;Ii+1)
The CRF modeling formula is therefore as follows:
Figure BDA0002337090100000062
(1) characteristic function
And (3) giving a weight to each characteristic function, constructing the characteristic, and performing weighted summation on each characteristic function. And (6) carrying out normalization to form a probability value. Under the condition of a given observation sequence, the probability of the hidden state sequence is obtained by using CRF. The sequence content may be all observed sequences of the entire corpus, for sequence labeling problems, the samples are predicted, and the optimal probability for decoding by the Viterbi is selected.
(2) Sequence annotation process
And finding out the most probable hidden state sequence on the new sample by using the well-learned CRF model. And solving the probability of each node correctly by adopting a Viterbi algorithm.
(3) Sequence probabilistic procedure
And respectively training and constructing a specific CRF model for each batch of data, and selecting the model with the highest score as a required category according to different score probabilities of the sequence in each model.
Compared with the prior art, the invention has the advantages that:
1. the CRF has strong reasoning capability, can use complex, overlapped and non-independent features for training and reasoning, can fully utilize context information as the features, and can also arbitrarily add other external features, so that the information which can be acquired by the model is very rich.
2. The CRF model by itself has advantages in combining multiple features and avoids the marker bias problem.
3. The CRF has better performance, the fusion capability of the CRF to the characteristics is stronger, and for the time-class ME with a smaller example, the identification effect of the CRF is obviously higher than that of the ME.
Fig. 2 is a flow chart of a word processing method according to an alternative embodiment of the invention, which, as shown in fig. 2, includes three aspects: training text preprocessing, testing text preprocessing and updating word stock.
The method comprises the steps of preprocessing original data of a training set by using electric power special words, dividing and marking the training data to obtain a preprocessed training text, training by using a CRF tool, importing a CRF model, verifying the testing set by using the training text obtained by the training data, dividing the text of the testing set to obtain a final word segmentation result, and extracting unknown words and hot words in the final word segmentation result.
Combining the obtained new unknown words and hot words with the electric power special word stock used by the training original data to obtain a new word stock, preprocessing the training set again, dividing and marking again, training a new CRF model, and repeating the steps.
The following description will be made in detail by taking the 95598 text as an example.
Based on the 95598 customer service work order electric opinion database, work order data is imported into the database in real time/batch. The method comprises the steps of performing feature processing on work order contents by using a professional vocabulary library of a power system to obtain each feature function, performing weighted summation on each feature function to form a probability value, solving the probability of a hidden state sequence by adopting CRF (random number decomposition), performing sequence labeling based on the whole training expectation, searching the most probable hidden state sequence with the highest probability on a new sample by using a well-learned CRF model, and solving the probability of each node.
Training each batch of data to construct a specific CRF model, verifying the test set text according to the CRF model with the highest score according to different score probabilities of the sequence in each CRF model, and extracting corresponding non-login words and hot words from the final word segmentation result obtained by segmenting the test set text.
And combining the extracted unregistered words/hot words and a power system professional vocabulary library used in original data training to obtain an updated vocabulary library, performing work such as feature extraction on the work order content again, performing division and sequence marking, and performing iterative training on the CRF model.
Therefore, the work of constructing a professional vocabulary library (containing unregistered words/hot words) of the power system and effectively splitting the work order content in the 95598 text analysis project can be completed.
It should be noted that, in the above method, the manual statistical work order analysis can be converted into natural language processing, so that the overall efficiency is improved; through automatic identification, the client appeal can be more accurately distinguished and managed; iteratively training the model to improve the stability of the model; and a special vocabulary library of the power system is constructed and continuously updated, so that the work order content is effectively split, and the work related to the effective recognition of the unregistered words and the hot words is achieved.
Therefore, in 95598 text analysis, work order content is split, a CRF model is used for training to obtain a hidden state sequence, new hot spots/unknown words are discovered and recorded through test data in batches, and finally a log-in word bank is perfected. And then the work order text information with new hot spots/unknown words can be effectively processed, risk prompt can be accurately carried out on the client appeal, a manual statistical mode is changed, and the working efficiency is improved. In addition, the CRF model can effectively identify unknown words and hot words in the 95598 work order content.
Example 2
According to another aspect of the embodiments of the present invention, there is also provided an embodiment of an apparatus for performing the word processing method in embodiment 1 above, and fig. 3 is a schematic diagram of a word processing apparatus according to an embodiment of the present invention, as shown in fig. 3, the word processing apparatus includes: a first obtaining module 302, a second obtaining module 304, a word segmentation module 306, a labeling module 308, a training module 310, an input module 312, and a deriving module 314. The word processing device will be described in detail below.
A first obtaining module 302, configured to obtain pre-configured special words, where the special words are special words in a predetermined field, the special words are multiple, and the special words are composed of at least one Chinese character;
a second obtaining module 304, connected to the first obtaining module 302, configured to obtain multiple texts to be processed, where each text in the multiple texts includes one or more sentences;
a word segmentation module 306, connected to the second obtaining module 304, configured to perform word segmentation on each piece of text by using a special word, where each piece of text after word segmentation is composed of one or more words;
a marking module 308, connected to the word segmentation module 306, configured to mark each piece of text obtained after word segmentation, where the mark is used to mark a position of each word in the word;
a training module 310, connected to the labeling module 308, configured to input training data into a conditional random field CRF model for training, where the training data includes multiple sets of data, each set of data in the multiple sets of data includes a text and a result obtained by performing labeling after word segmentation on the text, and a text is one of multiple texts;
an input module 312, connected to the training module 310, for inputting the text to be tested into the CRF model to obtain a marked text to be tested;
an obtaining module 314, connected to the input module 312, configured to obtain a word segmentation result of the text to be tested according to the marked text to be tested, where the word segmentation result includes at least one word.
The device is suitable for multi-text classification, corresponding word segmentation processing can be carried out on the text through the CRF model, and the purpose of quickly and accurately obtaining word segmentation results is achieved, so that the technical effects of effectively identifying words in the text and improving word processing efficiency are achieved, and the technical problem that the word processing effect in the prior art is poor is solved.
It should be noted that the first obtaining module 302, the second obtaining module 304, the word segmentation module 306, the labeling module 308, the training module 310, the input module 312 and the obtaining module 314 correspond to steps S102 to S114 in embodiment 1, and the modules are the same as the corresponding steps in the implementation example and application scenario, but are not limited to the disclosure in embodiment 1. It should be noted that the modules described above as part of an apparatus may be implemented in a computer system such as a set of computer-executable instructions.
Optionally, the method further comprises: and the adding module is used for comparing each word in the word segmentation result with the special word to obtain a word which does not exist in the special word, and adding the word which does not exist in the special word into the special word.
The module can effectively distinguish whether the words in the word segmentation result are special words, and particularly, a word-to-word comparison mode is adopted, for example, each word in the word segmentation result is compared with the special words to obtain words which do not exist in the special words. Further, in order to update the special words in time, words which are not included in the special words can be added into the special words, so that the special words are enriched, and important basis can be provided for subsequent recognition work.
Optionally, the adding module is configured to: obtaining words in the word segmentation results of the text to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the text to be tested in the preset number; in the case where the number of times exceeds the threshold value, a word that is not present in the special word is added to the special word.
When updating the special words, in order to reduce the workload and effectively reserve valuable special words, a predetermined condition may be set when a word not included in the special words is added to the special words, and some words with a small number of usage may be filtered according to the predetermined condition. For example, the number of times that a word that does not appear in the special word appears in the segmentation results of a predetermined number of texts to be tested is determined to be compared with a preset threshold, and in the case that the number of times is greater than or equal to the threshold, a word that does not appear in the special word is added to the special word. In a specific implementation process, the threshold may be set according to a specific application scenario, and optionally, the threshold is 15.
Optionally, after the adding module adds a word which is not included in the special word to the special word, the word segmentation module and the marking module are used for segmenting and marking the plurality of texts again by using the added special word; and the training module is used for training the CRF model again by taking the result obtained after the re-word segmentation and marking as training data.
In word processing, the added special words can be used for segmenting and marking a plurality of texts again, and the result obtained after re-segmenting and marking is used as training data to train the CRF model again. And training the CRF model in the iteration mode to improve the stability of the model.
Example 3
According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium includes a stored program, and where the program is executed to control a device in which the storage medium is located to perform any one of the above methods.
Example 4
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes to perform any one of the methods described above.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A word processing method, comprising:
acquiring a preset special word, wherein the special word is a special word in a preset field, the special word is multiple, and the special word is composed of at least one Chinese character;
acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences;
performing word segmentation on each piece of text by using the special word, wherein each piece of text after word segmentation consists of one or more words;
marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word;
inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and the text is one of the texts;
inputting a text to be tested into the CRF model to obtain a marked text to be tested;
and obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word.
2. The method of claim 1, wherein after deriving the segmentation results of the text to be tested from the labeled text to be tested, the method further comprises:
comparing each word in the word segmentation result with the special word to obtain a word which is not included in the special word;
adding words not present in the special words to the special words.
3. The method of claim 2, wherein adding words to the special word that do not appear in the special word comprises:
obtaining words in the word segmentation results of the texts to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the texts to be tested in the preset number;
adding a word not present in the special word to the special word if the number of times exceeds a threshold.
4. The method according to claim 3, wherein after adding words to the specialized word that are not in the specialized word, the method further comprises:
segmenting and marking the plurality of texts again by using the added special words;
and taking the result obtained after the re-word segmentation and marking as training data to train the CRF model again.
5. A word processing apparatus, comprising:
the system comprises a first acquisition module, a second acquisition module and a third acquisition module, wherein the first acquisition module is used for acquiring a special word which is configured in advance, the special word is a special word in a preset field, the special word is a plurality of special words, and the special word is composed of at least one Chinese character;
the second acquisition module is used for acquiring a plurality of texts to be processed, wherein each text in the plurality of texts comprises one or more sentences;
the word segmentation module is used for segmenting each text by using the special word, wherein each text after word segmentation consists of one or more words;
the marking module is used for marking each text obtained after word segmentation, wherein the marks are used for marking the position of each character in the word;
the training module is used for inputting training data into a conditional random field CRF model for training, wherein the training data comprises a plurality of groups of data, each group of data in the plurality of groups of data comprises a text and a result obtained after the text is subjected to word segmentation and then is marked, and the text is one of the texts;
the input module is used for inputting the text to be tested into the CRF model to obtain the marked text to be tested;
and the obtaining module is used for obtaining a word segmentation result of the text to be tested according to the marked text to be tested, wherein the word segmentation result comprises at least one word.
6. The apparatus of claim 5, further comprising:
and the adding module is used for comparing each word in the word segmentation result with the special word to obtain a word which is not included in the special word, and adding the word which is not included in the special word into the special word.
7. The apparatus of claim 6, wherein the adding module is configured to:
obtaining words in the word segmentation results of the texts to be tested in a preset number, and determining the times of the words which do not appear in the special words appearing in the word segmentation results of the texts to be tested in the preset number;
adding a word not present in the special word to the special word if the number of times exceeds a threshold.
8. The apparatus according to claim 7, wherein after said adding module adds a word to said special word that is not in said special word,
the word segmentation module and the marking module are used for segmenting and marking the plurality of texts again by using the added special words;
and the training module is used for training the CRF model again by taking the result obtained after the re-word segmentation and marking as training data.
9. A storage medium comprising a stored program, wherein the program, when executed, controls an apparatus in which the storage medium is located to perform the method of any one of claims 1 to 4.
10. A processor configured to execute a program, wherein the program when executed performs the method of any one of claims 1 to 4.
CN201911360710.4A 2019-12-25 2019-12-25 Word processing method, device, storage medium and processor Pending CN111191448A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911360710.4A CN111191448A (en) 2019-12-25 2019-12-25 Word processing method, device, storage medium and processor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911360710.4A CN111191448A (en) 2019-12-25 2019-12-25 Word processing method, device, storage medium and processor

Publications (1)

Publication Number Publication Date
CN111191448A true CN111191448A (en) 2020-05-22

Family

ID=70705822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911360710.4A Pending CN111191448A (en) 2019-12-25 2019-12-25 Word processing method, device, storage medium and processor

Country Status (1)

Country Link
CN (1) CN111191448A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218893A (en) * 2022-02-21 2022-03-22 湖南星汉数智科技有限公司 Hierarchical ordered list identification method and device, computer equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015135452A1 (en) * 2014-03-14 2015-09-17 Tencent Technology (Shenzhen) Company Limited Text information processing method and apparatus
CN109992766A (en) * 2017-12-29 2019-07-09 北京京东尚科信息技术有限公司 The method and apparatus for extracting target word

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114218893A (en) * 2022-02-21 2022-03-22 湖南星汉数智科技有限公司 Hierarchical ordered list identification method and device, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN107729468B (en) answer extraction method and system based on deep learning
CN110532353B (en) Text entity matching method, system and device based on deep learning
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN107832781B (en) Multi-source data-oriented software defect representation learning method
CN113705237B (en) Relationship extraction method and device integrating relationship phrase knowledge and electronic equipment
CN110046943B (en) Optimization method and optimization system for network consumer subdivision
CN111191051B (en) Method and system for constructing emergency knowledge map based on Chinese word segmentation technology
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN112860898B (en) Short text box clustering method, system, equipment and storage medium
CN113282729B (en) Knowledge graph-based question and answer method and device
CN117493513A (en) Question-answering system and method based on vector and large language model
CN117076688A (en) Knowledge question-answering method and device based on domain knowledge graph and electronic equipment
CN115130538A (en) Training method of text classification model, text processing method, equipment and medium
CN115146062A (en) Intelligent event analysis method and system fusing expert recommendation and text clustering
CN111325019A (en) Word bank updating method and device and electronic equipment
CN113204643B (en) Entity alignment method, device, equipment and medium
Jui et al. A machine learning-based segmentation approach for measuring similarity between sign languages
CN110245234A (en) A kind of multi-source data sample correlating method based on ontology and semantic similarity
Chang et al. Efficient graph-based word sense induction by distributional inclusion vector embeddings
CN116049376B (en) Method, device and system for retrieving and replying information and creating knowledge
CN111191448A (en) Word processing method, device, storage medium and processor
CN112667819A (en) Entity description reasoning knowledge base construction and reasoning evidence quantitative information acquisition method and device
CN116975738A (en) Polynomial naive Bayesian classification method for question intent recognition
CN111597400A (en) Computer retrieval system and method based on way-finding algorithm
JP2015018372A (en) Expression extraction model learning device, expression extraction model learning method and computer program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination