CN116011450A - Word segmentation model training method, system, equipment, storage medium and word segmentation method - Google Patents

Word segmentation model training method, system, equipment, storage medium and word segmentation method Download PDF

Info

Publication number
CN116011450A
CN116011450A CN202310157115.0A CN202310157115A CN116011450A CN 116011450 A CN116011450 A CN 116011450A CN 202310157115 A CN202310157115 A CN 202310157115A CN 116011450 A CN116011450 A CN 116011450A
Authority
CN
China
Prior art keywords
word segmentation
segmentation model
word
training
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310157115.0A
Other languages
Chinese (zh)
Inventor
胡意仪
阮晓雯
吴振宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310157115.0A priority Critical patent/CN116011450A/en
Publication of CN116011450A publication Critical patent/CN116011450A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to an artificial intelligence technology, and discloses a word segmentation model training method, a word segmentation model training system, computer equipment and a storage medium, which are suitable for the technical term segmentation in the medical field. Since the anchor words are terms which are strongly related to the field, other operation words are easy to appear around the anchor words, text contents potentially containing professional terms can be screened out by matching the anchor words with a corpus, so that training of word segmentation models can be performed, data do not need to be marked, the algorithm is simple, and the efficiency is high.

Description

Word segmentation model training method, system, equipment, storage medium and word segmentation method
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a word segmentation model training method, a word segmentation model training system, a computer device, and a storage medium.
Background
The word segmentation is used as a basis for natural language processing (such as machine translation, automatic abstract, automatic classification, full text retrieval of a document library, a search engine and the like), and the accuracy of the word segmentation directly influences the result of the natural language processing.
With the development of artificial intelligence and digital medical technology, word segmentation algorithms are increasingly applied, for example, in an intelligent inquiry scene, text information input by a user is segmented to further perform semantic analysis to understand the intention of the user, so that intelligent inquiry is performed based on the text information input by the user. If the word cannot be accurately segmented, text content input by the user cannot be accurately understood, and even larger understanding deviation is generated for the intention of the user, so that the intelligent inquiry program cannot be normally executed.
In the medical field, there are a large number of specialized terms that are very different from the terms used in daily life. Especially in the field of traditional Chinese medicine, traditional Chinese medicine has a large number of special terms in the aspects of elucidation of etiology and pathogenesis, parting of diseases and medication, and the special terms are mostly formed by compounding various parts of speech, such as: the lung failing to disperse and descend and liver affecting the stomach are the concept of traditional Chinese medicine and embody the pathogenesis of the disease.
The word segmentation model in the general field cannot understand the description of the mechanism contained in the text semantically, and if the word segmentation model in the general field is multiplexed, a large amount of field labeling texts are usually required to be prepared, and the word segmentation is performed by applying technologies such as a supervised neural network and the like, so that the cost is high.
Disclosure of Invention
In order to solve the word segmentation problem of technical terms in a specific field, the application provides a word segmentation model training method, a word segmentation model training system, computer equipment and a storage medium.
According to an aspect of an embodiment of the present application, a word segmentation model training method is disclosed, including:
acquiring a term set corresponding to a target field, wherein the term set comprises a plurality of professional terms of the target field;
matching the plurality of technical terms with a corpus, and taking the technical terms matched in the corpus as anchor words; the corpus comprises application texts of at least part of professional terms in the target field;
acquiring text fragments containing the anchor words in the corpus, and taking the text fragments as training data;
training a word segmentation model by adopting the training data until the word segmentation model meets the preset requirement.
In an exemplary embodiment, the obtaining the text segment including the anchor word in the corpus, and taking the text segment as training data includes:
acquiring sentences in which the anchor words are located in the corpus, and taking the whole sentences in which the anchor words are located as anchor sentences;
intercepting a plurality of first fragments in the anchor sentence by adopting a first sliding window, and acquiring the weight of each word in each first fragment by adopting an unsupervised training algorithm;
intercepting the anchor sentences by adopting a plurality of different second sliding windows respectively to obtain a plurality of second fragments containing the anchor words;
based on the weights of the words in the first segment, obtaining a weight average value of all words in each second segment;
and comparing the weight average value with a first threshold value, and taking a second segment corresponding to the weight average value which is larger than or equal to the first threshold value as training data.
In an exemplary embodiment, said obtaining weights of the words in each of said first segments using an unsupervised training algorithm includes:
and obtaining the weight of each word in each first segment based on a TextRank algorithm.
In an exemplary embodiment, the training the word segmentation model using the training data until the word segmentation model meets a preset requirement includes:
Masking part of the words in the training data, and inputting the data after masking the part of the words into a word segmentation model;
acquiring data reversely generated by the word segmentation model based on the data after masking part characters, and acquiring a first loss function based on the difference between the reversely generated data and the training data;
when the word segmentation model is obtained to perform semantic segmentation, a corresponding second loss function is obtained;
and obtaining a total loss function based on the first loss function and the second loss function, and training the word segmentation model until the total loss function meets a preset requirement.
According to an aspect of an embodiment of the present application, a word segmentation method is disclosed, including:
receiving a target text to be segmented;
and performing word segmentation on the target text by using the word segmentation model trained by the word segmentation model training method to obtain a word segmentation result.
In an exemplary embodiment, the word segmentation is performed on the target text by using the word segmentation model trained by using the word segmentation model training method as described above, to obtain a word segmentation result, including:
dividing the target text into a plurality of words by adopting a plurality of dividing modes;
obtaining the score of each word based on a dynamic programming algorithm;
The word with the highest score is used as the word segmentation result.
In an exemplary embodiment, the obtaining the score of each word based on the dynamic programming algorithm includes:
based on the word segmentation model, respectively acquiring first probabilities corresponding to the multiple segmentation modes;
obtaining a second probability based on the frequency of occurrence of the term in the term set;
based on the first probability and the second probability, obtaining a total probability by adopting an interpolation method;
and obtaining the score of each word by solving a recursive algorithm of the dynamic programming algorithm based on the total probability.
According to an aspect of an embodiment of the present application, a word segmentation model training system is disclosed, including:
the acquisition module is used for acquiring a term set corresponding to the target field, wherein the term set comprises a plurality of professional terms of the target field;
the matching module is used for matching the plurality of technical terms with the corpus, and taking the technical terms matched in the corpus as anchor words; the corpus comprises application texts of at least part of professional terms in the target field;
the processing module is used for acquiring text fragments containing the anchor words in the corpus, and taking the text fragments as training data;
And the training module is used for training the word segmentation model by adopting the training data until the word segmentation model meets the preset requirement.
According to an aspect of embodiments of the present application, a computer device is disclosed, the computer device comprising one or more processors and a memory for storing one or more programs, which when executed by the one or more processors, cause the computer device to implement the aforementioned word segmentation model training method.
According to an aspect of embodiments of the present application, a computer apparatus is disclosed, the computer apparatus comprising one or more processors and a memory for storing one or more programs, which when executed by the one or more processors, cause the computer apparatus to implement the aforementioned word segmentation method.
According to an aspect of an embodiment of the present application, a computer-readable storage medium is disclosed, which stores computer-readable instructions that, when executed by a processor of a computer, cause the computer to perform the aforementioned word segmentation model training method.
According to an aspect of an embodiment of the present application, a computer-readable storage medium storing computer-readable instructions that, when executed by a processor of a computer, cause the computer to perform the aforementioned word segmentation method is disclosed.
The technical scheme provided by the embodiment of the application at least comprises the following beneficial effects:
according to the word segmentation model training scheme, the term set corresponding to the target field is obtained, the professional terms in the term set are matched with the corpus, anchor words are obtained, the text fragments containing the anchor words in the corpus are used as training data, and then the training data is used for training the word segmentation model, so that the word segmentation model is trained. Since the anchor words are terms which are strongly related to the field, other operation words are easy to appear around the anchor words, text contents potentially containing professional terms can be screened out by matching the anchor words with a corpus, so that training of word segmentation models can be performed, data do not need to be marked, the algorithm is simple, and the efficiency is high.
According to the word segmentation model training scheme, a random word missing mode is used in a text generation task, so that the word segmentation model learns the probability of predicting the next word by a given input text segment; in the semantic representation task, the terms in the term set are pulled close to the semantics of the context in the corpus, and are distinguished from other terms as far as possible in terms of semantics, which is equivalent to implicitly learning semantic boundary division between terms.
According to the word segmentation scheme of the word segmentation model, the word segmentation model is combined with the dynamically planned word segmentation, the probability of the word segmentation model training output is utilized, the probability result of the term set based on word frequency simple statistics can be effectively corrected, and the score of each word can be obtained more accurately. Further, through a dynamic programming algorithm, the score of each word can be calculated recursively from the text segment from back to front, and word segmentation of a very important but less common professional term in a specific field is achieved effectively. The word segmentation model can be used only by training once, the dynamic programming method is high in operation efficiency, and overall, the cost can be reduced, and the performance is improved.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.
FIG. 1 is a flowchart illustrating a word segmentation model training method, according to an example embodiment.
Fig. 2 is a detailed flowchart of step S103 in the corresponding embodiment of fig. 1.
FIG. 3 is a diagram illustrating a second sliding window word taking according to an example embodiment.
Fig. 4 is a detailed flowchart of step S104 in the corresponding embodiment of fig. 1.
FIG. 5 is a schematic diagram of a word segmentation model, according to an example embodiment.
FIG. 6 is a flowchart illustrating a method of word segmentation according to an example embodiment.
Fig. 7 is a detailed flowchart of step S202 in the corresponding embodiment of fig. 6.
Fig. 8 is a schematic diagram of a cut text shown in accordance with an exemplary embodiment.
Fig. 9 is a detailed flowchart of step S2022 in the corresponding embodiment of fig. 7.
FIG. 10 is a block diagram illustrating a word segmentation model training system, according to an example embodiment.
FIG. 11 is a block diagram illustrating a word segmentation model, according to an example embodiment.
FIG. 12 is a block diagram illustrating the architecture of a computer system for implementing computer devices of embodiments of the present application, according to an exemplary embodiment.
The reference numerals are explained as follows:
101. a coding unit; 102. a decoding unit; 200. the word segmentation model training system; 201. an acquisition module; 202. a matching module; 203. a processing module; 204. a training module; 300. a word segmentation model; 301. a text receiving module; 302. a segmentation module; 303. a score calculation module; 304. a word segmentation module; 400. a computer system; 401. a CPU; 402. a ROM; 403. a storage section; 404. a RAM; 405. a bus; 406. an I/O interface; 407. an input section; 408. an output section; 409. a communication section; 410. a driver; 411. removable media.
Detailed Description
While this application is susceptible of embodiment in different forms, there is shown in the drawings and will herein be described in detail, specific embodiments thereof with the understanding that the present disclosure is to be considered as an exemplification of the principles of the application and is not intended to limit the application to that as illustrated herein.
Furthermore, references to the terms "comprising," "including," "having," and any variations thereof in the description of the present application are intended to cover a non-exclusive inclusion. Such as a process, method, system, article, or apparatus that comprises a list of steps or modules is not limited to the list of steps or modules but may, alternatively, include other steps or modules not listed or inherent to such process, method, article, or apparatus.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more features.
It should be noted that, in the embodiments of the present application, words such as "exemplary" or "for example" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of the word "exemplary" or "for example" or "such as" is intended to present the relevant concepts in a concrete manner.
Exemplary embodiments will be described in detail below. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present application as detailed in the summary.
First, several terms referred to in the present application are explained:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and viewpoint mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.
Corpus: a corpus is a structured, representative, computer program-retrievable, scale corpus that is specifically collected for one or more application targets. Essentially, a corpus is essentially a collection of language practices determined in a study, represented by language samples of a certain size, by random sampling of the natural language practices. Corpus generally refers to language instances where large scale is practically impossible to observe in statistical natural language processing.
TextRank algorithm: the TextRank algorithm is a graph-based ranking algorithm for text. The basic idea is derived from the PageRank algorithm of Google, and by dividing a text into a plurality of constituent units (words and sentences) and establishing a graph model, the important components in the text are ordered by utilizing a voting mechanism, and keyword extraction and abstract can be realized by utilizing the information of a single document.
Dynamic programming (dynamic programming, dp) algorithm: dynamic programming algorithms are typically used to solve for an optimization problem (optimization problem), which may have a number of solutions, each with a value, the solution found to be the optimal value. Dynamic planning is a strategy to plan and plan computation, by decomposing the original problem into a number of overlapping sub-problems, which are the same or similar in form to the original problem, but of smaller scale. When solving problems with dynamic programming, we often refer to a set of values for each variable associated with a sub-problem as a "state". A "state" corresponds to one or more sub-problems, and a "value" in a "state" is a solution to the sub-problem to which the "state" corresponds. The collection of all "states" constitutes the "state space" of the problem. The size of the "state space" is directly related to the time complexity of solving the problem with dynamic programming.
Recursive algorithm (recursion algorithm): in computer science, a method for solving a problem by repeatedly decomposing the problem into sub-problems of the same kind is referred to. The recursive method can be used to solve many computer science problems and is therefore a very important concept in computer science. Most programming languages support self-tuning of functions, in which the function can recursively call itself. Computational theory may prove that the effect of recursion can completely replace loops, so it is customary in many functional programming languages (e.g., schema) to implement loops with recursion.
The embodiment of the application provides a word segmentation model training method, a word segmentation model training system, computer equipment and a computer readable storage medium, which can intelligently segment texts in a specific field based on an artificial intelligence technology, do not need to label data, and have simple algorithm and high efficiency. The word segmentation model training method and the word segmentation method provided by the embodiment of the application can be applied to special fields containing a large number of specialized terms such as medical fields and the like, and can also be applied to general fields.
The word segmentation model training method, the word segmentation model training system, the computer equipment and the computer readable storage medium provided by the embodiment of the application are specifically described through the following embodiments, and the word segmentation model training method in the embodiment of the application is described first.
The embodiment of the application firstly provides a word segmentation model training method. The word segmentation model training method comprises the following steps:
acquiring a term set corresponding to the target field, wherein the term set comprises a plurality of professional terms of the target field;
matching a plurality of technical terms with a corpus, and taking the technical terms matched in the corpus as anchor words; the corpus comprises application texts of at least part of terms in the target field;
acquiring text fragments containing anchor words in a corpus, and taking the text fragments as training data;
training the word segmentation model by adopting training data until the word segmentation model meets the preset requirement.
According to the technical scheme, the technical terms in the term set are matched with the corpus by acquiring the term set corresponding to the target field, anchor words are obtained, the text segments containing the anchor words in the corpus are used as training data, and then the training data is used for training the word segmentation model, so that the training word segmentation model is realized. Since the anchor words are terms which are strongly related to the field, other operation words are easy to appear around the anchor words, text contents potentially containing professional terms can be screened out by matching the anchor words with a corpus, so that training of word segmentation models can be performed, data do not need to be marked, the algorithm is simple, and the efficiency is high. Is particularly suitable for the fields with stronger professionals and denser terms.
Embodiments of the present application are further elaborated below in conjunction with the drawings in the examples of the present specification.
Referring to fig. 1, the word segmentation model training method according to an exemplary embodiment of the present application includes the following steps S101 to S104.
S101, acquiring a term set corresponding to the target field, wherein the term set comprises a plurality of professional terms of the target field.
In detail, the term set includes a term list of the target field and a Term Frequency (TF) corresponding to each term, where the term frequency is a frequency of occurrence of the term in the term set. The term set contains the technical terms which can be collected in advance or integrated by directly utilizing the existing technical term word stock.
The target field is not limited to a specific field or specific fields, and may be, for example, a medical field.
One word segmentation model may be a word segmentation applied to one field or a word segmentation applied to a plurality of fields. It can be understood that if one word segmentation model is applied to word segmentation in a plurality of fields, the word segmentation model can be further trained by respectively acquiring the term sets corresponding to each field by adopting the word segmentation model training method of the application.
S102, matching a plurality of technical terms with a corpus, and taking the technical terms matched in the corpus as anchor words.
The corpus comprises application texts of at least part of terms in the target field. The corpus may be, for example, books, internet encyclopedia data, APP data, etc.
In detail, in step S102, a string matching technique is used to match a plurality of terms with a corpus. Besides determining the anchor words, the peripheral sentences of the anchor words in the corpus can be obtained, and further relevant information of the anchor words in the corpus, such as what section, which sentence, which position in the sentences, and the like of the anchor words in the corpus text can be obtained.
S103, obtaining text fragments containing anchor words in the corpus, and taking the text fragments as training data.
In one exemplary embodiment, as shown in fig. 2, step S103 includes the following steps S1031 to S1035.
S1031, obtaining sentences in which anchor point words are located in the corpus, and taking the whole sentences in which the anchor point words are located as anchor sentences.
In detail, the whole sentence where the anchor word is located is obtained from the anchor word to the beginning and the end of the sentence by expanding the anchor word leftwards and rightwards respectively, and the whole sentence where the anchor word is located is taken as the anchor sentence.
S1032, adopting a first sliding window to intercept a plurality of first segments in the anchor sentence, and adopting an unsupervised training algorithm to obtain the weight of each word in each first segment.
In detail, weights of the words in each first segment are obtained based on TextRank algorithm.
More specifically, a plurality of first fragments in the anchor sentence are intercepted through a first sliding window with a set size of N; then, the words in the same sliding window (first segment) in the anchor sentence are connected by one side by taking the words as units to form a graph; then, unsupervised training is performed, and the score (weight) of each word is calculated based on the following mapping relation:
Figure BDA0004103935180000091
wherein V is i Representing words, W (V) i ) Representation word V i D represents a damping coefficient, the range of which is 0-1, representing the probability of pointing to any other point from a specific point in the graph, and is generally set to 0.85,In(V i ) Representing the pointer word V i Set of words of V j Representing words, w ji Representation word V j To word V i Weights of edges, w jk Representation word V k To word V j Is the edge weight of Out (V) j ) Representation word V j The set of pointed words, v k Representing words.
For example, if the anchor word obtained in step S102 is "body deficiency", in step S1031, the whole sentence in which "body deficiency" is located is obtained, for example, as follows: the chronic disease of body deficiency, the lack of blood control, overstrain and excessive desire, or the deficiency of body due to the chronic disease, the deficiency of qi and yin of heart, spleen and kidney, and the bleeding caused by the blood not circulating. "
If the size of the first sliding window is set to be 3, for 'deficiency', the nodes connected with the first sliding window are 'weak and long', 'deficiency and long disease', 'disease and deficiency leading', and the like, and the scores of the words in the first sliding window are calculated according to the mapping relation.
In the step S1032, the first segment is obtained by intercepting based on the anchor sentence, and then the score of each word is calculated, so that not only can the sliding window be prevented from running out of the sentence boundary, but also the semantics of the sentence are generally complete, which is helpful for better score calculation.
S1033, adopting a plurality of different second sliding windows to intercept the anchor sentence respectively, and obtaining a plurality of second fragments containing anchor words.
Unlike the TextRank algorithm, the present application calculates a score W (V i ) Then, the keywords are obtained by processing without setting score threshold values and the like. In this application, the anchor word is taken as the center again, and a plurality of different second sliding windows are respectively adopted to take the word, so as to obtain a plurality of second segments corresponding to each anchor word, and the second segments are screened in step S1034, so that the second segments with larger weight average value are obtained as training data. An exemplary second sliding window word is shown in fig. 3.
It will be appreciated that since the words are typically 2-6 words in length, the second sliding window is typically sized to be 2-6 words.
S1034, obtaining the weight average value of all words in each second segment based on the weights of the words in the first segment.
In detail, since for each word V i In the foregoing step S1032, the score W (V i ) Thus, the sum of the scores of all words in the second segment can be calculated from the score of each word, and further the score mean (i.e., weight mean) of all words located within the second segment can be calculated.
It will be appreciated that if there is a word of unknown score in the second segment, i.e. all the first segments in step S1032 do not contain the word, the score of the word may be defined as zero.
S1035, comparing the weight average value with the first threshold value, and taking the second segment corresponding to the weight average value which is larger than or equal to the first threshold value as training data.
The first threshold may be a numerical value set according to experience, and the larger the weight average value of all words in the second segment is, the higher the word forming probability is, so that the second segment with the weight average value smaller than the first threshold is eliminated, and only the second segment with the weight average value larger than or equal to the first threshold is reserved as training data.
S104, training the word segmentation model by adopting training data until the word segmentation model meets the preset requirement.
In detail, the performance of the word segmentation model is measured by a loss function (loss function), the smaller the loss function is, the higher the word segmentation accuracy of the word segmentation model is, and on the contrary, the larger the loss function is, the lower the word segmentation accuracy of the word segmentation model is. In step S104, training the word segmentation model by using the training data until the loss function of the word segmentation model meets the preset requirement.
In more detail, the preset requirements may be: the loss function is less than a second threshold. When the loss function is greater than or equal to a second threshold value, continuing training the word segmentation model; and when the loss function is reduced to be smaller than the second threshold value, finishing training of the word segmentation model.
When training the word segmentation model by adopting training data, on one hand, the word segmentation model needs to learn how to generate the second segments containing anchor words; on the other hand, the word segmentation model needs to make the semantic representation of the anchor word as close to the semantic representation of the segment where it is located as possible. Based on the training objectives of the two aspects, the final design language generates a loss function (i.e., a first loss function) and a semantic loss function (i.e., a second loss function), and combines the two loss functions to train the word segmentation model until the loss meets the preset requirement.
In detail, in order to reduce the difficulty of training the word segmentation model, in one embodiment, instead of enabling the word segmentation model to generate text fragments containing anchor words from scratch, character random masking is performed on all second fragments containing anchor words, and then the masked character content is reversely predicted in the decoding process by utilizing the word segmentation model.
In detail, the semantic loss function is calculated by the following mapping relationship:
Figure BDA0004103935180000111
wherein L is SS Representing semantic loss function, h i Representing anchor word V i * The semantic vector encoded by the word segmentation model,
Figure BDA0004103935180000112
representing anchor word V i * Semantic vector of the second segment where +.>
Figure BDA0004103935180000113
Semantic vectors representing the second segment where other anchor words are located, and therefore i+.j. />
Figure BDA0004103935180000114
Representing anchor word V i * Dot product between the semantic of the fragment where the dot product is located represents similarity; />
Figure BDA0004103935180000115
Representing anchor word V i * And other fragment semanticsDot product between them, representing similarity.
According to the method, a random word missing mode is used for a text generation task, so that a word segmentation model learns the probability of predicting the next word of a given input text; the semantic representation task brings the terms in the term set close to the semantics of the context in the corpus, and simultaneously, the terms are distinguished from other terms as far as possible, which is equivalent to implicitly learning the semantic boundary division between terms.
In one exemplary embodiment, as shown in FIG. 4, step S104 includes the following steps S1041-S1044.
S1041, masking part of words in the training data, and inputting the data after masking part of words into the word segmentation model.
Referring to fig. 5, fig. 5 is a schematic diagram of a word segmentation model according to an exemplary embodiment, where the word segmentation model includes an encoding unit 101 and a decoding unit 102, for example, for the word "yin deficiency fire hyperactivity", the "virtual" and "fire hyperactivity" are masked, and then input to the encoding unit 101 for encoding, and then the "virtual" and "fire hyperactivity" are restored by the decoding unit 102 to obtain "yin deficiency fire hyperactivity". In step S1042, a first loss function is obtained according to the difference between the data restored by the decoding unit 102 and the original training data.
S1042, acquiring data reversely generated by the word segmentation model based on the data after masking part of the words, and acquiring a first loss function based on the difference between the reversely generated data and training data.
S1043, obtaining a corresponding second loss function when the word segmentation model performs semantic segmentation.
S1044, obtaining a total loss function based on the first loss function and the second loss function, and training the word segmentation model until the total loss function meets preset requirements.
For example, the first loss function is denoted as L MLM The second loss function is denoted as L SS The total Loss function is expressed as Loss: loss=l MLM +α*L SS . Wherein alpha represents a coefficient, the value is 0-1, and the second loss function L is regulated by setting the size of alpha SS For the total loss functionInfluence force.
It should be understood that, in some embodiments, only one of the first loss function and the second loss function may be considered, for example, steps S1043 and S1044 are not performed any more, and after step S1042, the word segmentation model is trained until the first loss function meets the preset requirement by taking the first loss function as the criterion for judging whether to end the training.
Based on the training method, the obtained word segmentation model can be used for inputting a text X= { X according to one section 1 ,…x i Next word x is predicted i+1
Referring next to fig. 6, an exemplary embodiment of the present application provides a word segmentation method of a word segmentation model. The word segmentation method can be applied to the medical field, for example, the intelligent inquiry system, when the intelligent inquiry system carries out inquiry dialogue, word segmentation is firstly carried out on the text content by adopting a word segmentation model after the text content input by a user is acquired, for example, proper noun concepts and the like appearing in the text can be found out through word segmentation, so that semantic analysis is further carried out to understand the intention of the user, further analysis processing is carried out based on the understanding of the intention of the user, for example, inquiry problems of the user are replied, the user is further inquired based on the understanding of the intention of the user, and the like.
For example, when the medical information retrieval system is used for retrieving patient groups with certain specific symptoms or diseases, firstly, the word segmentation model is adopted for segmenting the input text content to obtain the professional terms for representing the certain specific symptoms or diseases in the text content, then, based on the professional terms obtained by the word segmentation and the information base storing the symptoms or diseases information of the patient groups, a character matching algorithm is executed to retrieve the patient groups with certain specific symptoms or diseases, and the accuracy of retrieval results can be greatly improved by the accurate word segmentation.
For example, the method is applied to a medical scientific research scene, word segmentation can be used as a preprocessing step, word segmentation is carried out on a large number of scientific research papers by adopting a word segmentation model, so that professional terms in the scientific research papers are obtained, and then keywords (professional terms) required to be researched can be found out from the obtained professional terms, so that convenience is brought to scientific research staff to carry out induction arrangement and search.
As shown in fig. 6, the word segmentation method includes the following steps S201 to S202.
S201, receiving target text to be segmented.
S202, word segmentation is carried out on the target text by utilizing the word segmentation model obtained through training, and a word segmentation result is obtained.
In one exemplary embodiment, as shown in FIG. 7, step S202 includes the following steps S2021 to S2023.
S2021, segmenting the target text into a plurality of words by adopting a plurality of segmentation modes.
In detail, for a piece of target text to be segmented that is input to the segmentation model, a directed acyclic graph (Directed Acycling Graph) of all segmentation possibilities for the target text may be generated. Wherein, the directed acyclic graph refers to: there is no directed graph of loops (loops) in the figure.
For example, the target text is: the occurrence of depression is related to liver qi stagnation. In step S2021, the target text is segmented into a plurality of words using a plurality of segmentation methods, as shown in fig. 8. The arrow connection in fig. 8 represents a word segmentation possibility, for example, a single word segmentation is also a segmentation mode, so that each word is connected by an arrow. The arrows represent various branches. For example, from the "liver" there are three dividing modes of "liver qi", "liver qi stagnation" and "liver qi stagnation".
S2022, obtaining the score of each word based on a dynamic programming algorithm.
In one exemplary embodiment, as shown in FIG. 9, step S2022 includes the following steps S20221 to S20224.
S20221, based on the word segmentation model, respectively acquiring first probabilities corresponding to the multiple segmentation modes.
S20222, deriving a second probability based on the frequency of occurrence of the term in the term set.
S20223, based on the first probability and the second probability, interpolation is used to obtain the total probability.
S20224, based on the total probability, obtains the score of each term by solving a recursive algorithm of the dynamic programming algorithm.
The word segmentation model is combined with the dynamically planned word segmentation, the probability (second probability) result of the term set based on the word frequency simple statistics can be effectively corrected through the first probability output by the large-scale word segmentation model training, and the score of each word can be obtained more accurately. Further, through a dynamic programming algorithm, the score of each word can be calculated recursively from the text segment from back to front, and word segmentation of a very important but less common professional term in a specific field is achieved effectively. The word segmentation model can be used only by training once, the dynamic programming method is high in operation efficiency, and overall, the cost can be reduced, and the performance is improved.
And S2023, taking the word with the highest score as a word segmentation result.
One embodiment is:
assuming that the current processing position is "liver" as indicated by the arrow in fig. 8, a recursive algorithm is adopted, and then a Score "liver qi stagnation" with a word of "liver qi stagnation" is:
Score "liver qi depression" =w "liver qi depression" +path_score "junction";
where W "liver qi stagnation" denotes the probability of the word "liver qi stagnation", path_score ("knot" denotes a Score starting with "knot").
The scores for the "liver" starting points shown in fig. 8 are:
path_score ("liver") =max (Score ("liver qi", score ("liver qi depression";
the Path_score is used as a Score, and a plurality of words with the highest scores possibly falling are taken as word segmentation results. For example, score "liver qi stagnation" is the largest, i.e., takes "liver qi stagnation" as a word.
The calculation of the probability W is based on the first probability obtained by the word segmentation model and the second probability obtained by the frequency of the word in the term set. In detail, for a segmentation w= { w composed of n words 1 ,…w n By training the word segmentation model, the segmentation probability LM is calculated 1 ,…w n ). Meanwhile, based on the frequency of the word obtained by segmentation in the term set, a second probability P is obtained vocab (w). Finally, the total probability W is obtained by interpolation method calculation:
W=λ 1 LM(w i ,…w n )+(1-λ 1 )P vocab (w);
wherein lambda is 1 Representing the representation coefficient, taking the value of 0-1 by setting lambda 1 And adjusting the magnitude of the influence of the first probability and the second probability on the total probability.
Substituting the obtained total probability W into a mapping relation Score (the "liver qi stagnation" =w "liver qi stagnation" +path_score "to obtain a Score of the word).
It will be appreciated that the score of each term is obtained based on a dynamic programming algorithm, and is not limited to being implemented using a recursive algorithm, but may be implemented using other algorithms, such as bottom-up analysis (bottom-up analysis method), top-down analysis (Top-down analysis method), and the like.
Referring next to fig. 10, fig. 10 is a block diagram illustrating a word segmentation model training system 200 that may perform all or part of the steps of the word segmentation model training method shown in any of fig. 1-2 and 4, according to an exemplary embodiment. As shown in fig. 10, the word segmentation model training system 200 includes, but is not limited to: an acquisition module 201, a matching module 202, a processing module 203 and a training module 204.
The obtaining module 201 is configured to obtain a term set corresponding to the target domain, where the term set includes a plurality of terms of the target domain.
The matching module 202 is configured to match a plurality of terms with a corpus, and take the terms matched in the corpus as anchor words; the corpus contains application text of at least part of the technical terms of the target field.
The processing module 203 is configured to obtain a text segment including anchor words in the corpus, and use the text segment as training data.
The training module 204 is configured to train the word segmentation model using the training data until the word segmentation model meets a preset requirement.
In detail, the processing module 203 obtains sentences in which anchor words are located in the corpus, and takes the whole sentences in which the anchor words are located as anchor sentences; intercepting a plurality of first fragments in the anchor sentence by adopting a first sliding window, and acquiring the weight of each word in each first fragment by adopting an unsupervised training algorithm; intercepting anchor sentences by adopting a plurality of different second sliding windows respectively to obtain a plurality of second fragments containing anchor words; based on the weights of the words in the first segment, obtaining the weight average value of all the words in each second segment; and comparing the weight average value with the first threshold value, and taking a second segment corresponding to the weight average value which is larger than or equal to the first threshold value as training data.
In detail, the training module 204 masks part of the words in the training data, and inputs the data after masking part of the words into the word segmentation model; acquiring data reversely generated by the word segmentation model based on the data after masking part of characters, and acquiring a first loss function based on the difference between the reversely generated data and training data; when the word segmentation model is acquired to perform semantic segmentation, a corresponding second loss function is obtained; and obtaining a total loss function based on the first loss function and the second loss function, and training the word segmentation model until the total loss function meets the preset requirement.
The more detailed implementation process of the functions and roles of each module in the word segmentation model training system 200 is specifically described in the implementation process of the corresponding steps in the word segmentation model training method, and will not be described herein.
Word segmentation model training system 200 may be loaded in any computer device having data processing capabilities, such as a desktop computer, a notebook computer, or the like.
Referring next to fig. 11, fig. 11 is a block diagram illustrating a word segmentation model 300 that may perform all or part of the steps of the word segmentation method shown in any of fig. 6, 7, and 9, according to an exemplary embodiment. As shown in fig. 11, the word segmentation model 300 includes, but is not limited to: text receiving module 301, segmentation module 302, score calculating module 303 and word segmentation module 304.
The text receiving module 301 is configured to receive a target text to be segmented.
The segmentation module 302 is configured to segment the target text into a plurality of words in a plurality of segmentation manners.
The score calculation module 303 is configured to obtain a score of each term based on a dynamic programming algorithm.
The word segmentation module 304 is configured to output the word with the highest score as a word segmentation result.
The more detailed implementation process of the functions and roles of each module in the word segmentation model 300 is specifically described in the implementation process of the corresponding steps in the word segmentation method, and will not be described herein.
Word segmentation model 300 may be loaded in any computer device having data processing capabilities, such as a desktop computer, a notebook computer, or the like.
Referring next to fig. 12, fig. 12 schematically shows a block diagram of a computer system architecture of a computer device for implementing a word segmentation model training method or a word segmentation method according to an embodiment of the present application.
It should be noted that, the computer system 400 of the computer device shown in fig. 12 is only an example, and should not impose any limitation on the functions and the application scope of the embodiments of the present application.
As shown in fig. 12, the computer system 400 includes a central processing unit 401 (Central Processing Unit, CPU) which can execute various appropriate actions and processes according to a program stored in a Read-Only Memory 402 (ROM) or a program loaded from a storage section 403 into a random access Memory 404 (Random Access Memory, RAM). In the random access memory 404, various programs and data required for the system operation are also stored. The central processing unit 401, the read only memory 402, and the random access memory 404 are connected to each other via a bus 405. An Input/Output interface 406 (i.e., an I/O interface) is also connected to bus 405.
The following components are connected to the input/output interface 406: an input section 407 including a keyboard, a mouse, and the like; an output section 408 including a Cathode Ray Tube (CRT), a liquid crystal display (Liquid Crystal Display, LCD), and the like, a speaker, and the like; a storage section 403 including a hard disk or the like; and a communication section 409 including a network interface card such as a local area network card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet. The drive 410 is also connected to the input/output interface 406 as needed. A removable medium 411 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed on the drive 410 as needed, so that a computer program read out therefrom is installed into the storage section 403 as needed.
In particular, according to embodiments of the present application, the processes described in the various method flowcharts may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 409 and/or installed from the removable medium 411. The computer programs, when executed by the central processor 401, perform the various functions defined in the system of the present application.
It should be noted that, the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (Erasable Programmable Read Only Memory, EPROM), flash Memory, an optical fiber, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal that propagates in baseband or as part of a carrier wave, with the computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.
Those of skill in the art will appreciate that in one or more of the examples described above, the functions described herein may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, these functions may be stored on or transmitted over as one or more instructions or code on a computer-readable storage medium.
From the foregoing description of the embodiments, it will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of functional modules is illustrated, and in practical application, the above-described functional allocation may be implemented by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules to implement all or part of the functions described above.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, e.g., divided modularly, divided into only one type of logic functions, and other manners of division are possible in actual practice. For example, multiple units or components may be combined or may be integrated into another device, or some features may be omitted, or not performed.
It is to be understood that the present application is not limited to the precise construction set forth above and shown in the drawings, and that various modifications and changes may be effected therein without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims (10)

1. The word segmentation model training method is characterized by comprising the following steps of:
acquiring a term set corresponding to a target field, wherein the term set comprises a plurality of professional terms of the target field;
matching the plurality of technical terms with a corpus, and taking the technical terms matched in the corpus as anchor words; the corpus comprises application texts of at least part of professional terms in the target field;
acquiring text fragments containing the anchor words in the corpus, and taking the text fragments as training data;
training a word segmentation model by adopting the training data until the word segmentation model meets the preset requirement.
2. The word segmentation model training method according to claim 1, wherein the obtaining the text segment containing the anchor word in the corpus, taking the text segment as training data, includes:
acquiring sentences in which the anchor words are located in the corpus, and taking the whole sentences in which the anchor words are located as anchor sentences;
Intercepting a plurality of first fragments in the anchor sentence by adopting a first sliding window, and acquiring the weight of each word in each first fragment by adopting an unsupervised training algorithm;
intercepting the anchor sentences by adopting a plurality of different second sliding windows respectively to obtain a plurality of second fragments containing the anchor words;
based on the weights of the words in the first segment, obtaining a weight average value of all words in each second segment;
and comparing the weight average value with a first threshold value, and taking a second segment corresponding to the weight average value which is larger than or equal to the first threshold value as training data.
3. The word segmentation model training method according to claim 2, wherein the step of obtaining the weight of each word in each first segment by using an unsupervised training algorithm includes:
and obtaining the weight of each word in each first segment based on a TextRank algorithm.
4. The word segmentation model training method according to claim 1, wherein training the word segmentation model using the training data until the word segmentation model meets a preset requirement comprises:
masking part of the words in the training data, and inputting the data after masking the part of the words into a word segmentation model;
Acquiring data reversely generated by the word segmentation model based on the data after masking part characters, and acquiring a first loss function based on the difference between the reversely generated data and the training data;
when the word segmentation model is obtained to perform semantic segmentation, a corresponding second loss function is obtained;
and obtaining a total loss function based on the first loss function and the second loss function, and training the word segmentation model until the total loss function meets a preset requirement.
5. A method of word segmentation, comprising:
receiving a target text to be segmented;
word segmentation is carried out on the target text by using a word segmentation model trained by the word segmentation model training method according to any one of claims 1 to 4, so as to obtain a word segmentation result.
6. The word segmentation method according to claim 5, wherein the word segmentation is performed on the target text by using the word segmentation model trained by the word segmentation model training method according to any one of claims 1 to 4, so as to obtain a word segmentation result, and the word segmentation method comprises:
dividing the target text into a plurality of words by adopting a plurality of dividing modes;
obtaining the score of each word based on a dynamic programming algorithm;
the word with the highest score is used as the word segmentation result.
7. The word segmentation method according to claim 6, wherein the obtaining the score of each word based on the dynamic programming algorithm includes:
based on the word segmentation model, respectively acquiring first probabilities corresponding to the multiple segmentation modes;
obtaining a second probability based on the frequency of occurrence of the term in the term set;
based on the first probability and the second probability, obtaining a total probability by adopting an interpolation method;
and obtaining the score of each word by solving a recursive algorithm of the dynamic programming algorithm based on the total probability.
8. A word segmentation model training system, comprising:
the acquisition module is used for acquiring a term set corresponding to the target field, wherein the term set comprises a plurality of professional terms of the target field;
the matching module is used for matching the plurality of technical terms with the corpus, and taking the technical terms matched in the corpus as anchor words; the corpus comprises application texts of at least part of professional terms in the target field;
the processing module is used for acquiring text fragments containing the anchor words in the corpus, and taking the text fragments as training data;
And the training module is used for training the word segmentation model by adopting the training data until the word segmentation model meets the preset requirement.
9. A computer device, comprising:
one or more processors;
a memory for storing one or more programs that, when executed by the one or more processors, cause the computer device to implement the word segmentation model training method of any of claims 1-4, or cause the computer device to implement the word segmentation method of any of claims 5-7.
10. A computer-readable storage medium storing computer-readable instructions that, when executed by a processor of a computer, cause the computer to perform the word segmentation model training method of any one of claims 1 to 4 or the word segmentation method of any one of claims 5 to 7.
CN202310157115.0A 2023-02-15 2023-02-15 Word segmentation model training method, system, equipment, storage medium and word segmentation method Pending CN116011450A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310157115.0A CN116011450A (en) 2023-02-15 2023-02-15 Word segmentation model training method, system, equipment, storage medium and word segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310157115.0A CN116011450A (en) 2023-02-15 2023-02-15 Word segmentation model training method, system, equipment, storage medium and word segmentation method

Publications (1)

Publication Number Publication Date
CN116011450A true CN116011450A (en) 2023-04-25

Family

ID=86035782

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310157115.0A Pending CN116011450A (en) 2023-02-15 2023-02-15 Word segmentation model training method, system, equipment, storage medium and word segmentation method

Country Status (1)

Country Link
CN (1) CN116011450A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786427A (en) * 2024-02-26 2024-03-29 星云海数字科技股份有限公司 Vehicle type main data matching method and system

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117786427A (en) * 2024-02-26 2024-03-29 星云海数字科技股份有限公司 Vehicle type main data matching method and system
CN117786427B (en) * 2024-02-26 2024-05-24 星云海数字科技股份有限公司 Vehicle type main data matching method and system

Similar Documents

Publication Publication Date Title
CN109241524B (en) Semantic analysis method and device, computer-readable storage medium and electronic equipment
WO2022007823A1 (en) Text data processing method and device
Yao et al. An improved LSTM structure for natural language processing
US11068653B2 (en) System and method for context-based abbreviation disambiguation using machine learning on synonyms of abbreviation expansions
KR20180062321A (en) Method for drawing word related keyword based on deep learning and computerprogram
CN111914097A (en) Entity extraction method and device based on attention mechanism and multi-level feature fusion
US11170169B2 (en) System and method for language-independent contextual embedding
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
Yang et al. Towards bidirectional hierarchical representations for attention-based neural machine translation
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
CN113707307A (en) Disease analysis method and device, electronic equipment and storage medium
Ren et al. Detecting the scope of negation and speculation in biomedical texts by using recursive neural network
CN112052318A (en) Semantic recognition method and device, computer equipment and storage medium
WO2021129411A1 (en) Text processing method and device
CN116187282B (en) Training method of text review model, text review method and device
CN111597807B (en) Word segmentation data set generation method, device, equipment and storage medium thereof
Dmytriv et al. Comparative Analysis of Using Different Parts of Speech in the Ukrainian Texts Based on Stylistic Approach.
CN116011450A (en) Word segmentation model training method, system, equipment, storage medium and word segmentation method
CN108875024B (en) Text classification method and system, readable storage medium and electronic equipment
CN116757195B (en) Implicit emotion recognition method based on prompt learning
CN113705207A (en) Grammar error recognition method and device
Raharjo et al. Detecting proper nouns in indonesian-language translation of the quran using a guided method
Celikyilmaz et al. An empirical investigation of word class-based features for natural language understanding
Bender et al. Identifying and translating subjective content descriptions among texts

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination