CN110674635B

CN110674635B - Method and device for dividing text paragraphs

Info

Publication number: CN110674635B
Application number: CN201910927810.4A
Authority: CN
Inventors: 李敏; 吴家鸣
Original assignee: Beijing Miaobi Intelligent Technology Co ltd
Current assignee: Beijing Miaobi Intelligent Technology Co ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2023-04-25
Anticipated expiration: 2039-09-27
Also published as: CN110674635A

Abstract

The application discloses a method and a device for text paragraph division. One embodiment of the method comprises the following steps: calculating the similarity value among the natural sections, calculating the average value of the similarity values, and dividing large sections based on a threshold value; respectively carrying out word characteristic value calculation on the large paragraphs, and calculating entropy of n common words with the largest characteristic values in the large paragraphs; and sliding threshold values based on the average value of the similarity values, respectively calculating the entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of dividing the text paragraphs.

Description

Method and device for dividing text paragraphs

Technical Field

The present application relates to the field of text processing, and in particular, to a method and apparatus for text paragraph segmentation.

Background

With the rapid growth of the information age, information from various channels is growing at a remarkable rate. In processing large amounts of information, one typically needs to divide larger paragraphs on a per-natural segment basis and then perform classification processing.

Traditional large paragraph divisions are typically categorized manually, but are significantly disadvantageous in terms of efficiency and cost. In recent years, the TexTiling algorithm is widely used for calculating the similarity of natural segments, and natural segments with large similarity are gathered in the same large paragraph. However, when the TexTiling algorithm is used for dividing the large paragraph, the accuracy of determining the threshold value of the similarity is not high, and the accuracy of dividing the large paragraph is further affected.

Therefore, in the conventional text paragraph division, there still remains a problem to be solved.

Disclosure of Invention

The purpose of the application is to provide an improved method and device for dividing text paragraphs, which are used for solving the technical problems of low accuracy of dividing large paragraphs, difficult determination of threshold values and the like.

In a first aspect, the present application provides a method for text paragraph segmentation, the method comprising:

s1, calculating a similarity value between natural sections, calculating an average value of the similarity values, and dividing large sections based on a threshold value;

s2, respectively carrying out word characteristic value calculation on the large paragraphs, and calculating entropy of n common words with the largest characteristic values in the large paragraphs;

s3, sliding threshold values based on average values of the similarity values, respectively calculating entropy of the common words through different threshold values, and taking a division result with minimum entropy as an optimal division.

In some embodiments, step S1 of the method further comprises, prior to: s0, preprocessing the text to be processed, removing html tags of the text, and then performing word segmentation and stop word removal on the text to reduce noise interference.

In some embodiments, step S1 specifically further includes: calculating similarity values among the natural sections through a cosine similarity algorithm:

where s is the similarity, |A|X|B| is the vector inner product between the natural segments, and A.B is the vector length of the natural segments.

In some embodiments, step S1 of the method may specifically further include: calculating a similarity value between the natural sections through a simhash algorithm: converting words in the respective natural sections into hash values through a hash algorithm, calculating tf-idf values of the words at the same time, and calculating weight values of the words based on the tf-idf values as weights, wherein when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and merging the hash value and the weight value of the words in each natural segment, converting each bit into 0 and 1 (namely, 0 is greater than 0 and is 1, and 0 is less than or equal to 0), and calculating the Hamming distance between the natural segments.

In some embodiments, the term feature value calculation of step S2 of the method specifically includes:

tfidfi _i ＝tfi _i ·idfi _i (3)

wherein n is _i,j Is the number of the ith words in the jth paragraph, Σ _k n _k,j Is the number of words of the j-th paragraph, |D| is the number of natural paragraphs contained in the divided paragraphs, | { j: t _i ∈d _j The number of natural segments containing the i-th word.

In some embodiments, the step S3 of the method calculates entropy of the common words, specifically including:

p _m ＝∑ _n p _i (4)

E＝-p _m ·logp _m (5)

wherein in formula (4), n is the total number of words having a common meaning, p _i For one of them to have a probability of a common word, in equation (5), E is entropy of the common word, p _m Is the probability of having a common word.

In a second aspect, the present application provides an apparatus for text paragraph segmentation, the apparatus comprising: the similarity calculation module is used for calculating similarity values among the natural sections, calculating an average value of the similarity values, and dividing large sections based on a threshold value; the word processing module is used for respectively carrying out word characteristic value calculation on the large paragraphs and calculating entropy of n common words with the largest characteristic values in the large paragraphs; and the optimal selection module is used for sliding threshold values based on the average value of the similarity values, calculating entropy of the common words respectively through different threshold values, and taking the division result with the minimum entropy as the optimal division.

In some embodiments, the apparatus further comprises: the preprocessing module is used for preprocessing the text to be processed, removing html tags of the text, and then performing word segmentation and stop word removal on the text so as to reduce noise interference.

In some embodiments, the apparatus further comprises: the cosine similarity algorithm module is used for calculating similarity values among the natural sections through a cosine similarity algorithm:

In some embodiments, the apparatus further comprises: the simhash algorithm module is used for calculating similarity values among the natural segments through a simhash algorithm: converting words in the respective natural sections into hash values through a hash algorithm, calculating tf-idf values of the words at the same time, and calculating weight values of the words based on the tf-idf values as weights, wherein when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and merging the hash value and the weight value of the words in each natural segment, converting each bit into 0 and 1 (namely, 0 is greater than 0 and is 1, and 0 is less than or equal to 0), and calculating the Hamming distance between the natural segments.

In some embodiments, the word processing module of the apparatus comprises: the characteristic value calculation module is used for calculating word characteristic values of large paragraphs respectively:

tfidfi _i ＝tf _i ·idfi _i (3)

In some embodiments, the word processing module of the apparatus comprises: the entropy calculation module is used for setting the entropy of n common words with the largest characteristic values in the large paragraph:

E ＝ - p _m ·logp _m (4)

p _m ＝∑ _n p _i (5)

In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

According to the method and the device for dividing the text paragraphs, the large paragraphs are divided by calculating the similarity values among the natural paragraphs, the average value of the similarity values and the threshold value, the word characteristic values and the entropy of the common words are calculated for the large paragraphs respectively, threshold value sliding is carried out on the basis of the average value of the similarity values, the entropy of the common words is calculated respectively through different threshold values, and the division result with the minimum entropy is taken as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of dividing the text paragraphs.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings, in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow chart of one embodiment of a method for text paragraph segmentation according to the present application;

FIG. 3 is a flow chart of yet another embodiment of a method for text paragraph segmentation according to the present application;

FIG. 4 is a schematic structural diagram of one embodiment of an apparatus for text paragraph segmentation according to the present application;

fig. 5 is a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The present application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that, in the case of no conflict, the embodiments and features in the embodiments may be combined with each other. The present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

FIG. 1 illustrates an exemplary system architecture 100 to which the methods for text paragraph segmentation of embodiments of the present application may be applied.

As shown in fig. 1, a system architecture 100 may include a data server 101, a network 102, and a primary server 103. Network 102 is the medium used to provide communication links between data server 101 and primary server 103. Network 102 may include various connection types such as wired, wireless communication links, or fiber optic cables, among others.

The main server 103 may be a server providing various services, such as a data processing server processing information uploaded to the data server 101. The data processing server may process the received event information and store the processing results (e.g., element information sets, tags) in association with the event information base.

It should be noted that, the method for text paragraph splitting provided in the embodiments of the present application is generally performed by the main server 103, and accordingly, the apparatus for text paragraph splitting is generally disposed in the main server 103.

The data server and the main server may be hardware or software. In the case of hardware, the system may be implemented as a distributed server cluster formed by a plurality of servers, or as a single server. In the case of software, it may be implemented as a plurality of software or software modules (e.g., software or software modules for providing distributed services) or as a single software or software module.

It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 is shown for one embodiment of a method for text paragraph segmentation according to the present application. The method comprises the following steps:

step S1, calculating the similarity value among the natural sections, calculating the average value of the similarity values, and dividing the large sections based on a threshold value. The threshold value is a value that moves within a certain range of the average value, that is:

t＝a±σ

where t is a threshold, a is an average value, and σ is a movement constant.

In some optional implementations of the present embodiment, the similarity value between the natural segments is calculated by a cosine similarity algorithm:

In a specific embodiment, cosine similarity refers to measuring the similarity between two vectors by measuring the cosine value of the angle between them. The cosine value of the angle of 0 degree is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine value of the angle between the two vectors thus determines whether the two vectors point approximately in the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the cosine similarity value is 0; when the two vectors point in diametrically opposite directions, the cosine similarity has a value of-1. This results in dependence on the length of the vector, only on the pointing direction of the vector. Cosine similarity is usually used for positive space and thus gives a value between 0 and 1. It should be noted that the upper and lower bounds apply to vector space of any dimension, and cosine similarity is most commonly used for Gao Weizheng space. For example, in information retrieval, each term is assigned a different dimension, and a document is represented by a vector whose values in the respective dimensions correspond to the frequency with which the term appears in the document. Cosine similarity can thus give the similarity of two documents or two words or two paragraphs in terms of their topics.

In some optional implementations of this embodiment, similarity values between the natural segments are calculated by a simhash algorithm: converting words in the respective natural sections into hash values through a hash algorithm, calculating tf-idf values of the words at the same time, and calculating weight values of the words based on the tf-idf values as weights, wherein when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and merging the hash value and the weight value of the words in each natural segment, converting each bit into 0 and 1 (namely, 0 is greater than 0 and is 1, and 0 is less than or equal to 0), and calculating the Hamming distance between the natural segments.

In a particular embodiment, the hash algorithm may map binary values of arbitrary length to shorter fixed length binary values, this small binary value being referred to as a hash value. Hash values are a unique and extremely compact representation of a piece of data, and if a piece of plaintext is hashed and even only one letter of the piece is changed, the subsequent hash will produce a different value. It is computationally infeasible to find two different inputs that hash to the same value, so the hash value of the data can verify the integrity of the data, so hash algorithms are commonly used for fast look-up and encryption algorithms.

And S2, respectively carrying out word characteristic value calculation on the large paragraphs, and calculating entropy of n common words with the largest characteristic values in the large paragraphs.

In this embodiment, the term feature value calculation specifically includes:

tfidfi _i ＝tfi _i ·idfi _i (3)

In particular embodiments, tf-idf is a statistical method used to evaluate how important a word is to one of a set of documents or a corpus of documents. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus. the various forms of tf-idf weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. the main idea of tf-idf is: if a word or phrase appears at a high frequency if in one article and rarely appears in other articles, the word or phrase is considered to have good category discrimination and is suitable for calculating word feature values.

In this embodiment, the calculation of entropy of the common word specifically includes:

p _m ＝∑ _n p _i (4)

E ＝ - p _m ·logp _m (5)

wherein in formula (4), n is the total number of words having a common meaning, p _i One of which is provided withProbability of having a common word, in equation (5), E is entropy of the common word, p _m Probability of having a common word

And S3, sliding threshold values based on the average value of the similarity values, respectively calculating entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

In a specific embodiment, the sliding may be performed within ±20% of the threshold value with 3% of the threshold value as a step length, so that the threshold value after each sliding is used for dividing the large paragraphs again.

According to the method provided by the embodiment of the application, the large paragraphs are divided by calculating the similarity values among the natural paragraphs, the average value of the similarity values and the threshold value, the word characteristic values and the entropy of the common words are calculated for the large paragraphs respectively, meanwhile, threshold value sliding is carried out based on the average value of the similarity values, the entropy of the common words is calculated respectively through different threshold values, and the division result with the minimum entropy is taken as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of dividing the text paragraphs.

With further reference to FIG. 3, a flow 300 of yet another embodiment of a method for text paragraph segmentation according to the present application is shown. The method comprises the following steps:

s0, preprocessing the text to be processed, removing html tags of the text, and then performing word segmentation and stop word removal on the text.

In this embodiment, if the text is obtained from the internet, the text will have an html tag, which will affect extraction of the text abstract, and the text is preprocessed to remove the html tag, so that a subsequent abstract algorithm is convenient to obtain the text abstract.

In this embodiment, the word segmentation of the text is used as the data base of the text abstract, and the word segmentation of the text can be performed based on a dictionary word segmentation algorithm, a statistical-based machine learning algorithm, a combined word segmentation algorithm or the like, wherein the word segmentation algorithm based on the dictionary is the most widely applied and has the highest word segmentation speed. Researchers have been optimizing string-based matching methods for a long time, such as maximum length setting, string storage and lookup, and word list organization, such as using TRIE index trees, hash indexes, etc. The statistical-based machine learning algorithm is used for text word segmentation, and algorithms such as HMM, CRF, SVM and deep learning are commonly used at present, for example, stanford, hanlp word segmentation tools are based on CRF algorithm. Taking CRF as an example, the basic idea is to label and train Chinese characters, and not only considers the occurrence frequency of words, but also considers the context, so that the Chinese character recognition method has good learning ability, and therefore has good effect on the recognition of ambiguous words and unregistered words. The common word segmentation device combines a machine learning algorithm and a dictionary, so that on one hand, the word segmentation accuracy can be improved, and on the other hand, the field adaptability can be improved.

In this embodiment, the stop words refer to certain words or words that are automatically filtered out before or after processing text (or natural language data) in order to save storage space and improve search efficiency in information retrieval. The stop words are manually input and are not automatically generated, and a stop word list is formed by the generated stop words, but no specific stop word list can be applied to all tools, and even some tools can definitely avoid using the stop words to support phrase searching. The application of the stop words is very wide, the stop words are visible everywhere on the Internet, for example, the word Web appears on almost every website, the real related search results cannot be given to the word search engine, the search range is difficult to help to reduce, and the search efficiency is also reduced; meanwhile, the stop words also comprise the mood aid words, adverbs, prepositions, connecting words and the like, generally have no clear meaning, and only have a certain effect when put into a complete sentence, such as common 'in', and the like.

Step S1, calculating the similarity value among the natural sections, calculating the average value of the similarity values, and dividing the large sections based on a threshold value.

In this embodiment, the step S1 is substantially identical to the step S1 in the embodiment corresponding to fig. 2, and will not be described herein.

In this embodiment, the step S2 is substantially identical to the step S2 in the embodiment corresponding to fig. 2, and will not be described herein.

In this embodiment, the step S3 is substantially identical to the step S3 in the embodiment corresponding to fig. 2, and will not be described here again.

As can be seen from fig. 3, the flow 300 of the method for text paragraph segmentation in this embodiment highlights the earlier text processing steps compared to the corresponding embodiment of fig. 2. Therefore, the scheme described in the embodiment can reduce a large amount of noise generated during text division, accurately extract each key word, and is beneficial to improving paragraph division efficiency.

With further reference to fig. 4, as an implementation of the method shown in the foregoing figures, the present application provides an embodiment of an apparatus for text paragraph segmentation, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus is particularly applicable to various electronic devices.

As shown in fig. 4, the apparatus 400 for text paragraph division of the present embodiment includes: the similarity calculation module 401 is configured to calculate a similarity value between the natural segments, calculate an average value of the similarity values, and perform paragraph division based on a threshold value; the word processing module 402 is configured to perform word feature value calculation on the large paragraph, and calculate entropy of n common words with the largest feature values in the large paragraph; the optimal selection module 403 is configured to perform threshold value sliding based on the average value of the similarity values, calculate entropy of the common word according to different threshold values, and take the division result with the minimum entropy as the optimal division.

In some optional implementations of this embodiment, the apparatus 400 may further include: the preprocessing module (not shown in the figure) is used for preprocessing the text to be processed, removing html tags of the text, and then performing word segmentation processing and stop word removal on the text.

In some optional implementations of this embodiment, the apparatus 400 may further include: a cosine similarity algorithm module (not shown in the figure) configured to calculate a similarity value between the natural segments by the cosine similarity algorithm:

In some optional implementations of this embodiment, the apparatus 400 may further include: a simhash algorithm module (not shown in the figure) configured to calculate a similarity value between the natural segments by the simhash algorithm: converting words in the respective natural sections into hash values through a hash algorithm, calculating tf-idf values of the words at the same time, and calculating weight values of the words based on the tf-idf values as weights, wherein when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and merging the hash value and the weight value of the words in each natural segment, converting each bit into 0 and 1 (namely, 0 is greater than 0 and is 1, and 0 is less than or equal to 0), and calculating the Hamming distance between the natural segments.

In some optional implementations of the present embodiment, the generating unit 403 may include: a feature value calculation module (not shown in the figure) configured to perform word feature value calculation on large paragraphs, respectively:

tfidfi _i ＝tfi _i ·idfi _i (3)

In some optional implementations of the present embodiment, the generating unit 403 may include: an entropy calculation module (not shown in the figure) configured to calculate entropy of n common words with the largest eigenvalues in the large paragraph:

E＝-p _m ·logp _m (4)

p _m ＝∑ _n p _i (5)

According to the device provided by the embodiment of the application, the large paragraphs are divided by calculating the similarity values among the natural paragraphs, the average value of the similarity values and the threshold value, the word characteristic values and the entropy of the common words are calculated for the large paragraphs respectively, meanwhile, threshold value sliding is carried out based on the average value of the similarity values, the entropy of the common words is calculated respectively through different threshold values, and the division result with the minimum entropy is taken as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of dividing the text paragraphs.

Referring now to FIG. 5, a schematic diagram of a computer system 500 suitable for use in implementing the electronic device of an embodiment of the present application is shown. The electronic device shown in fig. 5 is only an example and should not impose any limitation on the functionality and scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Liquid Crystal Display (LCD) or the like, a speaker or the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flowcharts. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 501.

It should be noted that the computer readable storage medium described in the present application may be a computer readable signal medium or a computer readable storage medium, or any combination of the two. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples of the computer-readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In the present application, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, with computer-readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable storage medium that may be included in the electronic device described in the above embodiments; or may exist alone without being incorporated into the electronic device. The computer-readable storage medium carries one or more programs that, when executed by the electronic device, cause the electronic device to: calculating the similarity value among the natural sections, calculating the average value of the similarity values, and dividing large sections based on a threshold value; respectively carrying out word characteristic value calculation on the large paragraphs, and calculating entropy of n common words with the largest characteristic values in the large paragraphs; and sliding threshold values based on the average value of the similarity values, respectively calculating the entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

The foregoing description is only of the preferred embodiments of the present application and is presented as a description of the principles of the technology being utilized. It will be appreciated by persons skilled in the art that the scope of the invention referred to in this application is not limited to the specific combinations of features described above, but it is intended to cover other embodiments in which any combination of features described above or equivalents thereof is possible without departing from the spirit of the invention. Such as the above-described features and technical features having similar functions (but not limited to) disclosed in the present application are replaced with each other.

Claims

1. A method for text paragraph segmentation, the method comprising the steps of:

s1: calculating the similarity value among the natural sections, calculating the average value of the similarity values, and dividing large sections based on a threshold value;

s2: respectively carrying out word characteristic value calculation on the large paragraphs, and calculating entropy of n common words with the largest characteristic values in the large paragraphs;

s3: sliding threshold values based on the average value of the similarity values, respectively calculating entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division;

the method for calculating the entropy of the common words specifically comprises the following steps:

p _m ＝∑ _n p _i (4)

E＝-p _m ·logp _m (5)

wherein in formula (4), n is the total number of words having a common meaning, p _i For one of them having a probability of a common word, in equation (5), E is the entropy of the common word, p _m Is the probability of having the common word.

2. The method for text paragraph segmentation according to claim 1, characterized in that the step S1 of the method is preceded by the further step of:

s0: preprocessing a text to be processed, removing html tags of the text, and then performing word segmentation and stop word removal on the text.

3. The method for text paragraph segmentation according to claim 1, wherein the step S1 specifically further comprises:

calculating similarity values among the natural sections through a cosine similarity algorithm:

where s is similarity, |A|×|B| is the vector inner product between the natural segments, and A.B is the vector length of the natural segments.

4. The method for text paragraph segmentation according to claim 1, wherein the step S1 specifically further comprises:

calculating a similarity value between the natural sections through a simhash algorithm:

converting words in each natural section into hash values through a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values;

and merging the hash value and the weight value of the words in each natural segment, and then calculating the Hamming distance between the natural segments.

5. The method for text paragraph segmentation according to claim 1, wherein the word characteristic value calculation in step S2 specifically comprises:

tfidf _i ＝tf _i ·idf _i (3)

wherein n is _i，j Is the number of the ith words in the jth paragraph, Σ _k n _k，j Is the number of words of the j-th paragraph, |d| is the number of natural paragraphs contained in the divided paragraphs, | { j:t _i ∈d _j the number of natural segments containing the i-th word.

6. An apparatus for text paragraph segmentation, the apparatus comprising:

the similarity calculation module is used for calculating similarity values among the natural sections, calculating an average value of the similarity values and dividing large sections based on a threshold value;

the word processing module is used for respectively carrying out word characteristic value calculation on the large paragraphs and calculating entropy of n common words with the largest characteristic values in the large paragraphs;

the optimal selection module is used for sliding threshold values based on the average value of the similarity values, calculating entropy of the common words through different threshold values respectively, and taking the division result with the minimum entropy as optimal division;

the word processing module comprises an entropy calculation module, is used for calculating entropy of common words, and specifically comprises the following steps:

p _m ＝∑ _n p _i (4)

E＝-p _m ·logp _m (5)

7. The apparatus for text paragraph segmentation as defined in claim 6 wherein the apparatus further comprises:

the preprocessing module is used for preprocessing a text to be processed, removing html tags of the text, and then performing word segmentation and stop word removal on the text;

the cosine similarity algorithm module is used for calculating similarity values among the natural sections through a cosine similarity algorithm:

where s is similarity, |A|×|B| is the vector inner product between the natural segments, and A.B is the vector length of the natural segments;

the simhash algorithm module is used for calculating similarity values among the natural segments through a simhash algorithm: converting words in each natural section into hash values through a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values; and merging the hash value and the weight value of the words in each natural segment, and then calculating the Hamming distance between the natural segments.

8. The apparatus for text paragraph segmentation of claim 6 wherein the word processing module further comprises:

the characteristic value calculation module is used for calculating word characteristic values of the large paragraphs respectively:

tfidf _i ＝tf _i ·idf _i (3)

wherein n is _i，j Is the number of the ith words in the jth paragraph, Σ _k n _k，j Is the number of words of the j-th paragraph, |d| is the number of natural paragraphs contained in the divided paragraphs, | { j: t is t _i ∈d _j The number of natural segments containing the i-th word.

9. A computer readable storage medium, on which a computer program is stored, characterized in that the program, when being executed by a processor, implements the method according to any of claims 1-5.