CN111199150B - Text segmentation method, related device and readable storage medium - Google Patents

Text segmentation method, related device and readable storage medium Download PDF

Info

Publication number
CN111199150B
CN111199150B CN201911398383.1A CN201911398383A CN111199150B CN 111199150 B CN111199150 B CN 111199150B CN 201911398383 A CN201911398383 A CN 201911398383A CN 111199150 B CN111199150 B CN 111199150B
Authority
CN
China
Prior art keywords
text
segmentation
unit
word
text unit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911398383.1A
Other languages
Chinese (zh)
Other versions
CN111199150A (en
Inventor
闫莉
孔常青
万根顺
高建清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201911398383.1A priority Critical patent/CN111199150B/en
Publication of CN111199150A publication Critical patent/CN111199150A/en
Application granted granted Critical
Publication of CN111199150B publication Critical patent/CN111199150B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The application discloses a text segmentation method, related equipment and a readable storage medium, wherein after a text to be segmented is acquired, segmentation characteristics of each text unit in the text to be segmented are acquired, segmentation boundaries of the text to be segmented are determined according to the segmentation characteristics of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.

Description

Text segmentation method, related device and readable storage medium
Technical Field
The present invention relates to the field of natural language processing, and more particularly, to a text segmentation method, a related device, and a readable storage medium.
Background
With the rapid development of statistical natural language processing technology, text segmentation is becoming an important research direction. The text segmentation is to determine a segmentation boundary of the long-space non-segmentation text, segment the long-space non-segmentation text into text fragments according to the determined segmentation boundary, and compared with the long-space non-segmentation text, the segmented text fragments have short lengths and accord with the reading habit of a user; meanwhile, the segmented text fragments are provided with simple and clear topics, so that a user can be helped to extract key information rapidly, and reading pressure is relieved.
Accordingly, there is a need to provide a text segmentation method.
Disclosure of Invention
In view of the foregoing, the present application proposes a text segmentation method, related apparatus, and readable storage medium. The specific scheme is as follows:
a text segmentation method, comprising:
acquiring a text to be segmented;
obtaining segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and dividing the text to be divided based on the dividing boundary of the text to be divided.
Optionally, the obtaining the segmentation feature of each text unit in the text to be segmented includes:
and acquiring word sequences and clue word characteristics of each text unit in the text to be segmented, wherein the word sequences and clue word characteristics of each text unit are used as segmentation characteristics of each text unit.
Optionally, extracting and acquiring word sequences and clue word features of each text unit in the text to be segmented includes:
word segmentation is carried out on each text unit, and a word sequence of each text unit is obtained;
determining clue words from the word sequence based on a predetermined clue word set;
Acquiring position information of the clue words in the corresponding text units;
and generating clue word characteristics of each text unit according to the position information of clue words in each text unit.
Optionally, the determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit includes:
inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by training a training sample by taking segmentation characteristics of each text unit in a training text and a sample label by taking segmentation boundary identification marking information of the training text.
Optionally, the text segmentation model includes:
a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer.
Optionally, inputting the segmentation feature of each text unit into a text segmentation model to obtain an output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented, including:
acquiring a segment length characteristic of each text unit by using a text segmentation model, wherein the segment length characteristic is used for representing segment length information from a last segmentation boundary of each text unit to each text unit;
Word coding is carried out on the segmentation features of each text unit by utilizing a word coding layer of the text segmentation model, so that semantic representation of each text unit is obtained;
performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
utilizing a fusion layer of the text segmentation model to fuse semantic representation of sentences of each text unit and segment length characteristics of each text unit to obtain complete word representation of the sentences of each text unit;
performing sentence coding on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
and calculating the sentence representation of each text unit and the sentence representation at the last moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented.
Optionally, the word coding layer of the text segmentation model performs word coding on the segmentation feature of each text unit to obtain semantic representation of each text unit, including:
word coding is carried out on word sequences in the segmentation features of each text unit, so that word meaning characterization of each text unit is obtained;
Obtaining the clue word semantic representation of each text unit based on the word semantic representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the word semantic representation and the cue word semantic representation serve as the semantic representation.
Optionally, the performing attention calculation on the semantic representation of each text unit by using the attention layer of the text segmentation model to obtain the sentence semantic representation of each text unit includes:
performing attention computation on the word semantic representation to obtain a first sentence semantic representation of each text unit;
and performing attention computation on the clue word semantic representation to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
A text segmentation apparatus comprising:
the segmentation text acquisition unit is used for acquiring a text to be segmented;
the segmentation feature acquisition unit is used for acquiring segmentation features of each text unit in the text to be segmented;
a segmentation boundary determining unit, configured to determine a segmentation boundary of the text to be segmented according to a segmentation feature of each text unit;
The segmentation unit is used for segmenting the text to be segmented based on the segmentation boundary of the text to be segmented.
Optionally, the segmentation feature acquisition unit includes:
the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristic of each text unit in the text to be segmented, and the word sequence and clue word characteristic of each text unit are used as the segmentation characteristic of each text unit.
Optionally, the word sequence and clue word feature acquiring unit includes:
the word segmentation unit is used for segmenting each text unit to obtain a word sequence of each text unit;
a clue word determining unit configured to determine clue words from the word sequence based on a predetermined clue word set;
the clue word position information acquisition unit is used for acquiring the position information of the clue word in the corresponding text unit;
and the clue word characteristic generating unit is used for generating clue word characteristics of each text unit according to the position information of clue words in each text unit.
Optionally, the segmentation boundary determination unit includes:
the model application unit is used for inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by training a training sample by taking segmentation characteristics of each text unit in a training text and a sample label by taking segmentation boundary identification marking information of the training text.
Optionally, the text segmentation model includes:
a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer.
Optionally, the model application unit includes:
a segment length feature obtaining unit, configured to obtain a segment length feature of each text unit using a text segmentation model, where the segment length feature is used to represent segment length information from a last segmentation boundary of each text unit to each text unit;
the word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
the attention calculating unit is used for carrying out attention calculation on the semantic representation of each text unit by utilizing the attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
the fusion unit is used for fusing semantic representation of sentences of each text unit and segment length characteristics of each text unit by utilizing a fusion layer of the text segmentation model to obtain complete word representation of the sentences of each text unit;
the sentence coding unit is used for carrying out sentence coding on the complete word representation of the sentence of each text unit by utilizing the sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
And the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the last moment by using the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
Optionally, the word encoding unit includes:
the first word coding subunit is used for carrying out word coding on word sequences in the segmentation characteristics of each text unit to obtain word meaning representation of each text unit;
the second word coding subunit is used for obtaining the clue word semantic representation of each text unit based on the word semantic representation of each text unit and clue word features in the segmentation features of each text unit; the word semantic representation and the cue word semantic representation serve as the semantic representation.
Optionally, the attention calculating unit includes:
the first attention computing unit is used for performing attention computation on the word semantic representation to obtain a first sentence semantic representation of each text unit;
the second attention calculating unit is used for carrying out attention calculation on the clue word semantic representation to obtain a second sentence semantic representation of each text unit, and the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representation.
A text segmentation device comprising a memory and a processor;
the memory is used for storing programs;
the processor is configured to execute the program to implement the steps of the text segmentation method as described above.
A readable storage medium having stored thereon a computer program, characterized in that the computer program, when executed by a processor, implements the steps of the text segmentation method as described above.
By means of the technical scheme, the application discloses a text segmentation method, related equipment and a readable storage medium, after a text to be segmented is obtained, segmentation characteristics of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation characteristics of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:
Fig. 1 is a schematic flow chart of a text segmentation method disclosed in an embodiment of the present application;
FIG. 2 is a schematic diagram of a text segmentation model according to an embodiment of the present disclosure;
fig. 3 is a schematic structural diagram of a text segmentation device according to an embodiment of the present disclosure;
fig. 4 is a block diagram of a hardware structure of a text segmentation device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The text segmentation method disclosed in the present application can be applied to a post-processing module of a voice recognition system, a man-machine question-answer system, an information retrieval system, and the like, and is described by the following embodiments.
Referring to fig. 1, fig. 1 is a flow chart of a text segmentation method disclosed in an embodiment of the present application, where the method may include:
S101: and obtaining the text to be segmented.
In the present application, the text to be segmented may be any text without segmentation, for example, a speech recording manuscript without segmentation, an electronic book without segmentation, and the like, which are obtained after a speech recognition system is used to perform speech recognition on speech of a user.
In the present application, the text to be segmented may be obtained by a user uploading manner, or may be obtained from the output of other natural language processing systems (such as a speech recognition system, a man-machine question-answering system, an information retrieval system, etc.), which is not limited in this application.
S102: and obtaining the segmentation characteristic of each text unit in the text to be segmented.
In the present application, each text unit in the text to be segmented may be a sentence segmented by an ending punctuation (such as a period, a mark, a question mark, etc.) in the text, or may be a clause or phrase segmented by a punctuation mark in the text, which is not limited in any way.
In the present application, the segmentation feature of each text unit may be any feature that can be used to determine a segmentation boundary of the text to be segmented, for example, a word sequence of each text unit in the text to be segmented, a clue word feature of each text unit in the text to be segmented, a segment length segmentation threshold of the text to be segmented, and the like, which is not limited in this application.
It should be noted that the word sequence of each text unit is the sequence of words contained in the text unit. Clue words are a class of words with strong text segmentation guidance, such as "first", "next", "last", etc., that often appear at the beginning of a paragraph. The clue word feature may be a feature used to characterize information such as clue word content, number, location, etc. of each text unit in the text to be segmented. The segment length segmentation threshold is a threshold used to define the segment length of text to be segmented after segmentation.
As a preferred embodiment, acquiring the segmentation feature of each text unit in the text to be segmented may include acquiring a word sequence and a clue word feature of each text unit in the text to be segmented, the word sequence and clue word feature of each text unit as the segmentation feature of each text unit.
It should be noted that, in the subsequent embodiments of the present application, the text segmentation process is described based on the word sequence and the clue word feature of each text unit as the segmentation feature of each text unit. However, on the basis, other segmentation features are combined with the word sequence and the clue word features of each text unit, and the scheme as the segmentation feature of each text unit is also within the protection scope of the application.
S103: and determining the segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit.
In the present application, determining the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit specifically refers to determining whether the text unit is the segmentation boundary of the text to be segmented according to the segmentation feature of each text unit. Specifically, determining whether the text unit is a segmentation boundary of the text to be segmented refers to determining whether a starting position of the text unit is a segmentation boundary of the text to be segmented. The segmentation boundary of the text to be segmented can be determined according to the segmentation characteristic of each text unit, and the text to be segmented can be one or a plurality of segmentation boundaries.
S104: and dividing the text to be divided based on the dividing boundary of the text to be divided.
In the present application, after determining the segmentation boundary of the text to be segmented, the text units at both ends of the segmentation boundary may be segmented into different paragraphs.
The embodiment discloses a text segmentation method, after a text to be segmented is obtained, segmentation characteristics of each text unit in the text to be segmented are obtained, segmentation boundaries of the text to be segmented are determined according to the segmentation characteristics of each text unit, and finally the text to be segmented is segmented based on the segmentation boundaries of the text to be segmented. Based on the scheme, the segmentation of the text to be segmented can be realized.
In the application, a specific implementation manner for acquiring word sequences and clue word characteristics of each text unit in a text to be segmented is disclosed, and the method comprises the following steps:
s201: and segmenting each text unit to obtain a word sequence of each text unit.
In the application, an existing word segmentation system can be adopted to segment each text unit, so that a word sequence of each text unit is obtained.
S202: clue words are determined from the word sequence based on a predetermined set of clue words.
In the application, the words in the word sequence can be sequentially searched in a preset clue word dictionary, if the words are searched, the words are determined to be clue words, and if all the words are not searched in the clue word dictionary, no clue words are determined to exist in the text unit.
However, in some cases, the number of words in the word sequence of the text unit is too large, and the probability that the word appearing at the end of the text unit is a clue word is relatively low, so that in order to improve the efficiency of determining clue words, the first few words in the word sequence can be searched in a preset clue word dictionary, if so, the word is determined to be a clue word, and if not, no clue word is determined to be in the text unit.
In the application, training texts can be predetermined, for example, news texts, electronic books and the like are collected through a network to serve as the training texts, the texts have natural segmentation information, the acquisition is simple, the scale is large, or the non-segmented texts can be labeled in a segmentation mode through manual labeling, so that the training texts are obtained. After the training text is determined, a clue word dictionary is built based on the training text.
The cue word dictionary construction method comprises the steps of firstly obtaining reserved prepositions, conjunctions and adverbs after real words such as nouns, adjectives and numbers with practical meanings are removed from the first few words of each segment in a training text, then counting the occurrence frequency of each reserved word in the training text, and then carrying out descending order sorting on the reserved words according to the occurrence frequency of each word in the training text to obtain a cue word dictionary composed of a preset number of words with the front sorting position.
S203: and acquiring the position information of the clue word in the corresponding text unit.
In the current text segmentation method depending on clue words, the same clue words are described by adopting the same word representation.
However, the same clue word has different meanings in different contexts, and its segmentation directives are also different, e.g. "last" at the end of sentence ", and natural language understanding tasks are very challenging. The segmentation directives in "and" last lecture guest is XX "are clearly different. Given the unified word representation of the clue word 'last' in all sentences, the semantic difference of the two 'last' cannot be reflected, and the accuracy of text segmentation can be affected.
In order to solve the above problem, the present application obtains the position information of the clue word in the corresponding text unit, so as to generate the clue word feature of each text unit according to the position information of the clue word in each text unit. The word meaning of the corresponding position information can be determined from word meanings of all words of the whole text unit based on the position information of the cue words and used as the word meaning of the cue words, and at the moment, because different cue words have different semantics at different positions of different text units, the same cue word can be described by adopting different word characterizations in one text unit.
S204: and generating clue word characteristics of each text unit according to the position information of clue words in each text unit.
In the present application, the position information of the clue word in each text unit may be determined as the clue word feature of each text unit.
It should be noted that if there is no clue word in the text unit, the clue word feature of the text unit is determined to be a preset feature, for example, it may be determined that the clue word semantic feature of the text unit without clue word is-1.
Assuming that the first 3 words in the word sequence are sequentially searched in a preset clue word dictionary, a text unit is a text segmentation system for products to be introduced next, clue words are present at a 1 st position (0 is a starting position), and clue word characteristics of the text unit are {1}; the text unit "then reminds a person to leave the field in order at the end of the presentation," and its clue word features { -1}, since "last" is not a word in the first 3 words in the sequence of text unit words, and no corresponding clue word is found.
Based on the method, the same clue word can be described by adopting different word representations, so that the clue word itself can be considered and the clue word context can be considered when the text is segmented, and the accuracy of text segmentation is improved.
In this application, a specific implementation manner of determining a segmentation boundary of a text to be segmented according to a segmentation feature of each text unit is also disclosed, where the manner may be as follows:
and inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
It should be noted that, the text segmentation model is obtained by training with the segmentation feature of each text unit in the training text as a training sample and the segmentation boundary identification marking information of the training text as a sample label. The segmentation features of each text unit in the training text are word sequences and clue word features of the training text. The input of the text segmentation model is word sequence and clue word characteristics of the training text, and the output is an output result of whether each text unit is a segmentation boundary or not.
In this application, there are various ways of representing the output result of whether each text unit is a segmentation boundary. As an implementation manner, the output result of whether each text unit is a segmentation boundary may be expressed as a probability that each text unit is a segmentation boundary, when the probability is greater than a preset threshold, the text unit is a segmentation boundary, otherwise, the text unit is not a segmentation boundary. As another embodiment, the output result of whether each text unit is a segmentation boundary may be expressed as a classification result, where the text unit is a segmentation boundary when the classification result is a first value, and where the text unit is not a segmentation boundary when the classification result is a second value.
In the present application, the training text used for training the text segmentation model may be all or part of the training text described in S202, or may be the training text redetermined by the method of determining the training text described in S202, which is not limited in any way.
In the application, the segmentation boundary identification marking information of the training text can be a paragraph division marking of the training text, the marking can be manually marked, or can be obtained by identifying the training text, and the marking can be in any mode, so that the application is not limited in any way.
In this application, a specific implementation manner of a text segmentation model is also disclosed, as shown in fig. 2, fig. 2 is a schematic diagram of a text segmentation model disclosed in the embodiment of the present application, and as can be seen from fig. 2, the text segmentation model includes: a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer.
Based on the text segmentation model shown in fig. 2, the present application further discloses a specific implementation manner of inputting the segmentation feature of each text unit into the text segmentation model to obtain the output result of whether the starting position of each text unit is the segmentation boundary of the text to be segmented, where the specific implementation manner is as follows:
S301: and acquiring a segment length characteristic of each text unit by using the text segmentation model, wherein the segment length characteristic is used for representing segment length information from the last segmentation boundary of each text unit to each text unit.
In the present application, the segment length feature of each text unit is used to represent segment length information from the last segmentation boundary predicted by the text segmentation model to the current text unit, and the segment length information may be represented by the number of text units, the number of words, or the number of words included between the last segmentation boundary and the current text unit.
Since the segment length information is a discrete value and the value range is large, the segment length information can be constrained to be in the range of 0 to 1 by a nonlinear mapping sigmoid function in the application.
It should be noted that, in the present application, the text segmentation model is a time sequence structure, and the result of whether all text units before the current text unit in the text are segmentation boundaries is obtained while predicting whether the current text unit is a segmentation boundary, so in the present application, before word encoding is performed on the segmentation feature of the current text unit, for example, after the processing of the last text unit of the current text unit is completed, the segment length feature of the current text unit is obtained immediately, and is stored in the text segmentation model, so that the fusion layer of the text segmentation model is called.
Of course, in the present application, the segment length feature of the current text unit may be obtained at any time before the fusion layer uses the segment length feature after the segmentation feature of the current text unit is encoded, which is not limited in any way. However, since the segment length feature of the current text unit is utilized at the fusion layer, if the segment length feature of the current text unit is obtained before the fusion layer utilizes the segment length feature, it is also necessary to save first.
In addition, in the present application, a module may be added to the text segmentation model, where the module is used to obtain the segment length feature of the current text unit, and of course, a module used to obtain the segment length feature of the current text unit may also be added to the word coding layer, the attention layer or the fusion layer, which is not limited in this application.
S302: and carrying out word coding on the segmentation features of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit.
In the application, the word coding layer of the text segmentation model can perform word coding on word sequences in segmentation features of each text unit to obtain word meaning characterization of each text unit; and obtaining the clue word semantic representation of each text unit based on the word semantic representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the word semantic representation and the cue word semantic representation serve as the semantic representation.
In particular, in the present application, word em basedWord sequence W of text unit by using bedding method i ={w i,1 ,w i,2 ,...,w i,m Processing to obtain word vector of word sequence, and obtaining word meaning representation of word sequence based on bidirectional LSTM structurei is the index of the text unit and m is the word index in the i-th text unit.
Word meaning characterization of word at time tMiddle->Indicating the hidden layer output generated at time t after forward LSTM sequence reads in the current text unit,/>And the hidden layer output generated at the m-t moment after the reverse LSTM reverse sequence reads the current text unit is represented, and the hidden layer output are spliced to be used as word meaning representation of a word at the t moment.
In the application, based on the position information of the clue words in each text unit, the word meaning representation of the corresponding position can be extracted from the word meaning representation of the text unit to serve as the clue word semantic representation of each text unit, and the clue word semantic representation of each text unit has the information quantity of clue words and the information quantity of clue word context.
In particular, from word sense representation of each text unitThe word meaning representation of the corresponding position of the extracted clue word feature is used as clue word meaning representation and is marked as +.>If the clue word characteristic of the current text unit is { -1}, setting +. >u pad Parameters trained for the text segmentation model.
S303: and performing attention calculation on the semantic representation of each text unit by using the attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit.
The attention layer is used for compressing semantic representation of the text unit to obtain sentence representation with fixed length. In the application, attention calculation can be performed on the word semantic representation to obtain a first sentence semantic representation of each text unit; and performing attention computation on the clue word semantic representation to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
In particular, different word tokens may be given different attention weights, and the weighted sum yields a sentence token. The attention mechanism consists of a query, a key name (key), a key value (value). The key value is weighted by calculating the correlation of the query and key name as the attention weight to reach the attention of key value important content. Global vector u is introduced in the scheme w U clue As inquiry, i.e. u w ,u clue Shared for all text units of different text. u (u) w ,u clue Meaning is a simple query for all words in a text unit, respectively representing "which words in the current sentence are important? "and" which clue words are important in the current sentence? ". In the scheme, the key value name and the key value are the same and are the representation of word sequences or clue words in the current text unit
First sentence semantic representationThe calculation method is as follows:
wherein the word meaning of the word at the time t is represented asu w U clue As global vector, W a ,b a Is a model training parameter.
The second sentence semantic representation can be calculated by referring to the mode
S304: and fusing the semantic representation of the sentence of each text unit and the segment length characteristic of each text unit by utilizing a fusion layer of the text segmentation model to obtain the complete word representation of the sentence of each text unit.
In the application, the fusion layer is used for obtaining auxiliary information through clue word characteristics and segment length characteristics when semantic information is ambiguous; meanwhile, as the paragraph length increases, the text segmentation model can obtain corresponding segmentation excitation, so that the uniformity of the whole paragraph is controlled. The clue word features have the function of guiding the model to divide on sentence boundaries with definite clue word information when the paragraph length features are used for paragraph space constraint, so that a more reasonable dividing effect is achieved.
In the application, when the fusion layer fuses the semantic representation of the sentence of each text unit and the segment length feature of each text unit, the adopted fusion strategy can be a Gate structure, and the following calculation is performed:
wherein W is g And b g The parameters are trained for the model and,is a complete word representation of a sentence of text units.
S305: and carrying out sentence coding on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit.
Complete word characterization of each text unit sentence due to S304The correlations between text units may be modeled using LSTM in the present application, only in relation to the current text unit, to learn semantic switching between text units to obtain segmentation boundaries. Complete word characterization of sentences of each text unit +.>Sentence representation of each text unit can be obtained via LSTM structure>
Under different scenes, different LSTM structures can be adopted, for example, in a real-time voice recognition scene, future sentence information can not be acquired, and therefore, the forward LSTM structure is used for extracting deep semantic information. In an offline scene, a bidirectional LSTM structure can be adopted in a question-answering system to obtain richer sentence characterization.
S306: and calculating the sentence representation of each text unit and the sentence representation at the last moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented.
In the application, the output layer of the text segmentation model can calculate the sentence representation of each text unit and the sentence representation at the last moment through a softmax function to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
The text segmentation apparatus disclosed in the embodiments of the present application will be described below, and the text segmentation apparatus described below and the text segmentation method described above may be referred to correspondingly to each other.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a text segmentation device according to an embodiment of the present application. As shown in fig. 3, the text segmentation apparatus may include:
a segmented text acquisition unit 11 for acquiring a text to be segmented;
a segmentation feature acquisition unit 12, configured to acquire a segmentation feature of each text unit in the text to be segmented;
a segmentation boundary determining unit 13, configured to determine a segmentation boundary of the text to be segmented according to a segmentation feature of each text unit;
A segmentation unit 14, configured to segment the text to be segmented based on a segmentation boundary of the text to be segmented.
Optionally, the segmentation feature acquisition unit includes:
the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristic of each text unit in the text to be segmented, and the word sequence and clue word characteristic of each text unit are used as the segmentation characteristic of each text unit.
Optionally, the word sequence and clue word feature acquiring unit includes:
the word segmentation unit is used for segmenting each text unit to obtain a word sequence of each text unit;
a clue word determining unit configured to determine clue words from the word sequence based on a predetermined clue word set;
the clue word position information acquisition unit is used for acquiring the position information of the clue word in the corresponding text unit;
and the clue word characteristic generating unit is used for generating clue word characteristics of each text unit according to the position information of clue words in each text unit.
Optionally, the segmentation boundary determination unit includes:
the model application unit is used for inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by training a training sample by taking segmentation characteristics of each text unit in a training text and a sample label by taking segmentation boundary identification marking information of the training text.
Optionally, the text segmentation model includes:
a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer.
Optionally, the model application unit includes:
a segment length feature obtaining unit, configured to obtain a segment length feature of each text unit using a text segmentation model, where the segment length feature is used to represent segment length information from a last segmentation boundary of each text unit to each text unit;
the word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
the attention calculating unit is used for carrying out attention calculation on the semantic representation of each text unit by utilizing the attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
the fusion unit is used for fusing semantic representation of sentences of each text unit and segment length characteristics of each text unit by utilizing a fusion layer of the text segmentation model to obtain complete word representation of the sentences of each text unit;
the sentence coding unit is used for carrying out sentence coding on the complete word representation of the sentence of each text unit by utilizing the sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
And the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the last moment by using the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
Optionally, the word encoding unit includes:
the first word coding subunit is used for carrying out word coding on word sequences in the segmentation characteristics of each text unit to obtain word meaning representation of each text unit;
the second word coding subunit is used for obtaining the clue word semantic representation of each text unit based on the word semantic representation of each text unit and clue word features in the segmentation features of each text unit; the word semantic representation and the cue word semantic representation serve as the semantic representation.
Optionally, the attention calculating unit includes:
the first attention computing unit is used for performing attention computation on the word semantic representation to obtain a first sentence semantic representation of each text unit;
the second attention calculating unit is used for carrying out attention calculation on the clue word semantic representation to obtain a second sentence semantic representation of each text unit, and the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representation.
Fig. 4 is a block diagram of a hardware structure of a text segmentation device according to an embodiment of the present application, and referring to fig. 4, the hardware structure of the text segmentation device may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
in the embodiment of the application, the number of the processor 1, the communication interface 2, the memory 3 and the communication bus 4 is at least one, and the processor 1, the communication interface 2 and the memory 3 complete communication with each other through the communication bus 4;
processor 1 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention, etc.;
the memory 3 may comprise a high-speed RAM memory, and may further comprise a non-volatile memory (non-volatile memory) or the like, such as at least one magnetic disk memory;
wherein the memory stores a program, the processor is operable to invoke the program stored in the memory, the program operable to:
acquiring a text to be segmented;
obtaining segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
And dividing the text to be divided based on the dividing boundary of the text to be divided.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
The embodiment of the application also provides a storage medium, which may store a program adapted to be executed by a processor, the program being configured to:
acquiring a text to be segmented;
obtaining segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented according to the segmentation characteristics of each text unit;
and dividing the text to be divided based on the dividing boundary of the text to be divided.
Alternatively, the refinement function and the extension function of the program may be described with reference to the above.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
In the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, and identical and similar parts between the embodiments are all enough to refer to each other.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (8)

1. A text segmentation method, comprising:
acquiring a text to be segmented;
obtaining segmentation characteristics of each text unit in the text to be segmented;
determining a segmentation boundary of the text to be segmented by using a text segmentation model according to the segmentation characteristics of each text unit; the text segmentation model comprises: a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer;
Dividing the text to be divided based on the dividing boundary of the text to be divided;
the obtaining the segmentation feature of each text unit in the text to be segmented includes:
acquiring word sequences and clue word characteristics of each text unit in the text to be segmented, wherein the word sequences and clue word characteristics of each text unit are used as segmentation characteristics of each text unit;
the process of determining the segmentation boundary of the text to be segmented comprises the following steps:
acquiring a segment length characteristic of each text unit by using a text segmentation model, wherein the segment length characteristic of each text unit is used for representing segment length information from a last segmentation boundary predicted by the text segmentation model to a current text unit;
word coding is carried out on the segmentation features of each text unit by utilizing a word coding layer of the text segmentation model, so that semantic representation of each text unit is obtained;
performing attention calculation on the semantic representation of each text unit by using an attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
utilizing a fusion layer of the text segmentation model to fuse semantic representation of sentences of each text unit and segment length characteristics of each text unit to obtain complete word representation of the sentences of each text unit;
Performing sentence coding on the complete word representation of the sentence of each text unit by utilizing a sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
and calculating the sentence representation of each text unit and the sentence representation at the last moment by using an output layer of the text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented.
2. The method of claim 1, wherein obtaining word sequences and clue word features for each text unit in the text to be segmented comprises:
word segmentation is carried out on each text unit, and a word sequence of each text unit is obtained;
determining clue words from the word sequence based on a predetermined clue word set;
acquiring position information of the clue words in the corresponding text units;
and generating clue word characteristics of each text unit according to the position information of clue words in each text unit.
3. The method of claim 1, wherein determining the segmentation boundary of the text to be segmented using a text segmentation model based on the segmentation characteristics of each text unit comprises:
inputting the segmentation characteristics of each text unit into a text segmentation model to obtain an output result of whether each text unit is a segmentation boundary of the text to be segmented; the text segmentation model is obtained by training a training sample by taking segmentation characteristics of each text unit in a training text and a sample label by taking segmentation boundary identification marking information of the training text.
4. The method of claim 1, wherein the word encoding the segmentation feature of each text unit using the word encoding layer of the text segmentation model results in a semantic representation of each text unit, comprising:
word coding is carried out on word sequences in the segmentation features of each text unit, so that word meaning characterization of each text unit is obtained;
obtaining the clue word semantic representation of each text unit based on the word semantic representation of each text unit and clue word characteristics in the segmentation characteristics of each text unit; the word semantic representation and the cue word semantic representation serve as the semantic representation.
5. The method of claim 4, wherein performing attention computation on the semantic representation of each text unit using the attention layer of the text segmentation model to obtain the sentence semantic representation of each text unit comprises:
performing attention computation on the word semantic representation to obtain a first sentence semantic representation of each text unit;
and performing attention computation on the clue word semantic representation to obtain a second sentence semantic representation of each text unit, wherein the first sentence semantic representation and the second sentence semantic representation are used as the sentence semantic representations.
6. A text segmentation apparatus, comprising:
the segmentation text acquisition unit is used for acquiring a text to be segmented;
the segmentation feature acquisition unit is used for acquiring segmentation features of each text unit in the text to be segmented;
the segmentation boundary determining unit is used for determining the segmentation boundary of the text to be segmented by utilizing a text segmentation model according to the segmentation characteristics of each text unit; the text segmentation model comprises: a word coding layer, an attention layer, a fusion layer, a sentence coding layer and an output layer;
the segmentation unit is used for segmenting the text to be segmented based on the segmentation boundary of the text to be segmented;
wherein the segmentation feature acquisition unit includes:
the word sequence and clue word characteristic acquisition unit is used for acquiring the word sequence and clue word characteristic of each text unit in the text to be segmented, and the word sequence and clue word characteristic of each text unit are used as the segmentation characteristic of each text unit;
the division boundary determination unit includes:
the text segmentation model comprises a segment length feature acquisition unit, a text segmentation unit and a text segmentation unit, wherein the segment length feature acquisition unit is used for acquiring the segment length feature of each text unit by using the text segmentation model, and the segment length feature of each text unit is used for representing the segment length information from the last segmentation boundary predicted by the text segmentation model to the current text unit;
The word coding unit is used for carrying out word coding on the segmentation characteristics of each text unit by utilizing a word coding layer of the text segmentation model to obtain semantic representation of each text unit;
the attention calculating unit is used for carrying out attention calculation on the semantic representation of each text unit by utilizing the attention layer of the text segmentation model to obtain the semantic representation of the sentence of each text unit;
the fusion unit is used for fusing semantic representation of sentences of each text unit and segment length characteristics of each text unit by utilizing a fusion layer of the text segmentation model to obtain complete word representation of the sentences of each text unit;
the sentence coding unit is used for carrying out sentence coding on the complete word representation of the sentence of each text unit by utilizing the sentence coding layer of the text segmentation model to obtain the sentence representation of each text unit;
and the calculation unit is used for calculating the sentence representation of each text unit and the sentence representation at the last moment by using the output layer of the text segmentation model to obtain an output result of whether each text unit is the segmentation boundary of the text to be segmented.
7. A text segmentation device comprising a memory and a processor;
The memory is used for storing programs;
the processor is configured to execute the program to implement the respective steps of the text segmentation method as set forth in any one of claims 1 to 5.
8. A readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the text segmentation method according to any one of claims 1 to 5.
CN201911398383.1A 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium Active CN111199150B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911398383.1A CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911398383.1A CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Publications (2)

Publication Number Publication Date
CN111199150A CN111199150A (en) 2020-05-26
CN111199150B true CN111199150B (en) 2024-04-16

Family

ID=70744535

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911398383.1A Active CN111199150B (en) 2019-12-30 2019-12-30 Text segmentation method, related device and readable storage medium

Country Status (1)

Country Link
CN (1) CN111199150B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004157337A (en) * 2002-11-06 2004-06-03 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for topic boundary determination
US9141867B1 (en) * 2012-12-06 2015-09-22 Amazon Technologies, Inc. Determining word segment boundaries
CN107229609A (en) * 2016-03-25 2017-10-03 佳能株式会社 Method and apparatus for splitting text
CN107480143A (en) * 2017-09-12 2017-12-15 山东师范大学 Dialogue topic dividing method and system based on context dependence
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101430680B (en) * 2008-12-31 2011-01-19 阿里巴巴集团控股有限公司 Segmentation sequence selection method and system for non-word boundary marking language text
US20140214402A1 (en) * 2013-01-25 2014-07-31 Cisco Technology, Inc. Implementation of unsupervised topic segmentation in a data communications environment
US9734820B2 (en) * 2013-11-14 2017-08-15 Nuance Communications, Inc. System and method for translating real-time speech using segmentation based on conjunction locations

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2004157337A (en) * 2002-11-06 2004-06-03 Nippon Telegr & Teleph Corp <Ntt> Method, device and program for topic boundary determination
US9141867B1 (en) * 2012-12-06 2015-09-22 Amazon Technologies, Inc. Determining word segment boundaries
CN107229609A (en) * 2016-03-25 2017-10-03 佳能株式会社 Method and apparatus for splitting text
CN107480143A (en) * 2017-09-12 2017-12-15 山东师范大学 Dialogue topic dividing method and system based on context dependence
WO2019214145A1 (en) * 2018-05-10 2019-11-14 平安科技(深圳)有限公司 Text sentiment analyzing method, apparatus and storage medium
CN109829151A (en) * 2018-11-27 2019-05-31 国网浙江省电力有限公司 A kind of text segmenting method based on layering Di Li Cray model

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘耀 ; 帅远华 ; 龚幸伟 ; 黄毅 ; .基于领域本体的文本分割方法研究.计算机科学.2018,(第01期),128-132. *

Also Published As

Publication number Publication date
CN111199150A (en) 2020-05-26

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
CN108959246B (en) Answer selection method and device based on improved attention mechanism and electronic equipment
CN107329949B (en) Semantic matching method and system
WO2023060795A1 (en) Automatic keyword extraction method and apparatus, and device and storage medium
CN114020862B (en) Search type intelligent question-answering system and method for coal mine safety regulations
JP6813591B2 (en) Modeling device, text search device, model creation method, text search method, and program
JP6677419B2 (en) Voice interaction method and apparatus
CN111539197B (en) Text matching method and device, computer system and readable storage medium
CN110704621A (en) Text processing method and device, storage medium and electronic equipment
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN111291177A (en) Information processing method and device and computer storage medium
Chen et al. Automatic key term extraction from spoken course lectures using branching entropy and prosodic/semantic features
CN111859940B (en) Keyword extraction method and device, electronic equipment and storage medium
CN113158687B (en) Semantic disambiguation method and device, storage medium and electronic device
JP2019082931A (en) Retrieval device, similarity calculation method, and program
CN114661881A (en) Event extraction method, device and equipment based on question-answering mode
CN108664464B (en) Method and device for determining semantic relevance
CN110852071A (en) Knowledge point detection method, device, equipment and readable storage medium
CN111199150B (en) Text segmentation method, related device and readable storage medium
CN115033683B (en) Digest generation method, digest generation device, digest generation equipment and storage medium
CN111428487A (en) Model training method, lyric generation method, device, electronic equipment and medium
CN113704466B (en) Text multi-label classification method and device based on iterative network and electronic equipment
CN112580365B (en) Chapter analysis method, electronic equipment and storage device
Das et al. Automatic semantic segmentation and annotation of MOOC lecture videos
CN115129843A (en) Dialog text abstract extraction method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant