CN110674635A

CN110674635A - Method and device for text paragraph division

Info

Publication number: CN110674635A
Application number: CN201910927810.4A
Authority: CN
Inventors: 李敏; 吴家鸣
Original assignee: Beijing Miaobi Intelligent Technology Co Ltd
Current assignee: Beijing Miaobi Intelligent Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-10
Anticipated expiration: 2039-09-27
Also published as: CN110674635B

Abstract

The application discloses a method and a device for text paragraph division. One embodiment of the method comprises: calculating similarity values among all natural segments, then calculating an average value of the similarity values, and then dividing large segments based on a threshold value; respectively calculating word characteristic values of the large paragraphs, and calculating the entropies of the n common words with the maximum characteristic values in the large paragraphs; and performing threshold value sliding on the basis of the average value of the similarity values, respectively calculating the entropies of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of text paragraph segmentation.

Description

Method and device for text paragraph division

Technical Field

The present application relates to the field of text processing, and in particular, to a method and an apparatus for text paragraph segmentation.

Background

With the explosive growth of the information age, information from various channels is growing at an alarming rate. When dealing with large amounts of information, people often need to divide larger paragraphs on a per-segment basis and then perform a classification process.

The traditional large paragraph division is generally classified in a manual mode, but is obviously poor in efficiency and cost. In recent years, the TexTiling algorithm is widely used for calculating the similarity of natural segments, and then the natural segments with large similarity are gathered in the same large paragraph. However, when a large paragraph is divided by using the TexTiling algorithm, the accuracy of determining the threshold value of the similarity is not high, and the accuracy of dividing the large paragraph is further influenced.

Therefore, there still remains a problem to be solved in the conventional text paragraph division.

Disclosure of Invention

The application aims to provide an improved method and device for text paragraph segmentation to solve the technical problems of low accuracy of large paragraph segmentation, difficulty in determining a threshold value and the like.

In a first aspect, the present application provides a method for text paragraph segmentation, the method comprising:

s1, calculating similarity values among the natural segments, then calculating the average value of the similarity values, and then dividing large segments based on a threshold value;

s2, respectively calculating the word characteristic values of the large paragraphs, and calculating the entropy of the n common words with the maximum characteristic values in the large paragraphs;

and S3, sliding threshold values based on the average value of the similarity values, respectively calculating the entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

In some embodiments, step S1 of the method further comprises, before: s0, preprocessing the text to be processed, removing the html tag of the text, and then performing word segmentation processing and stop word removal on the text to reduce noise interference.

In some embodiments, step S1 specifically further includes: calculating similarity values between the natural segments by a cosine similarity algorithm:

wherein s is the similarity, | A | B | is the vector inner product between the natural segments, and A.B is the vector length of the natural segments.

In some embodiments, step S1 of the method may further include: calculating similarity values among the natural segments by a simhash algorithm: converting the words in the respective natural segments into hash values by a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values, namely when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and combining the hash value and the weight value of the words in each natural segment, then converting each natural segment into 0 and 1 (namely, more than 0 is 1, and less than or equal to 0 is 0), and then calculating the Hamming distance between the natural segments.

In some embodiments, the word feature value calculation in step S2 of the method specifically includes:

tfidf_i＝tfi_i·idf_i(3)

wherein n is_i,jIs the number of the ith word in the jth large paragraph, Σ_kn_k,jIs the number of words in the jth large paragraph, | D | is the number of natural segments contained in the divided large paragraph, | { j: t |_i∈d_jAnd | is the number of natural segments containing the ith word.

In some embodiments, the calculating the entropy of the common word in step S3 of the method specifically includes:

p_m＝∑_np_i(4)

E＝-p_m·logp_m(5)

wherein, in the formula (4), n is the total number of the common words, p_iFor the probability that one has a common word, in equation (5), E is the entropy of the common word, p_mAre probabilities of having a common language.

In a second aspect, the present application provides an apparatus for text paragraph segmentation, the apparatus comprising: the similarity calculation module is used for calculating similarity values among the natural sections, then calculating the average value of the similarity values, and then dividing the large sections based on the threshold value; the word processing module is used for respectively calculating word characteristic values of the large paragraphs and calculating the entropies of the n common words with the maximum characteristic values in the large paragraphs; and the optimal selection module is arranged for carrying out threshold value sliding on the basis of the average value of the similarity numerical values, respectively calculating the entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

In some embodiments, the apparatus further comprises: the preprocessing module is used for preprocessing the text to be processed, removing html tags of the text, and then performing word segmentation processing and stop word removal on the text so as to reduce noise interference.

In some embodiments, the apparatus further comprises: the cosine similarity algorithm module is used for calculating similarity values among the natural segments through a cosine similarity algorithm:

In some embodiments, the apparatus further comprises: the simhash algorithm module is used for calculating the similarity value between the natural segments through the simhash algorithm: converting the words in the respective natural segments into hash values by a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values, namely when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and combining the hash value and the weight value of the words in each natural segment, then converting each natural segment into 0 and 1 (namely, more than 0 is 1, and less than or equal to 0 is 0), and then calculating the Hamming distance between the natural segments.

In some embodiments, the word processing module of the apparatus comprises: and the characteristic value calculation module is used for respectively calculating the word characteristic values of the large paragraphs:

tfidf_i＝tf_i·idf_i(3)

In some embodiments, the word processing module of the apparatus comprises: the entropy calculation module is used for calculating the entropy of the n common words with the maximum characteristic values in the large paragraph:

E＝-p_m·logp_m(4)

p_m＝∑_np_i(5)

In a third aspect, the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method as described in any of the implementations of the first aspect.

According to the method and the device for text paragraph division, large paragraph division is carried out by calculating the similarity numerical values among the natural paragraphs, the average value of the similarity numerical values and the threshold value, then the word characteristic value and the entropy of the common words are respectively calculated for the large paragraphs, meanwhile, the threshold value sliding is carried out based on the average value of the similarity numerical values, the entropy of the common words is respectively calculated through different threshold values, and the division result with the minimum entropy is taken as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of text paragraph segmentation.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram for one embodiment of a method for text paragraph segmentation according to the present application;

FIG. 3 is a flow diagram of yet another embodiment of a method for text paragraph segmentation according to the present application;

FIG. 4 is a block diagram of one embodiment of an apparatus for text paragraph segmentation according to the present application;

FIG. 5 is a schematic block diagram of a computer system suitable for use in implementing an electronic device according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Fig. 1 illustrates an exemplary system architecture 100 to which the method for text paragraph segmentation of embodiments of the present application may be applied.

As shown in FIG. 1, system architecture 100 may include a data server 101, a network 102, and a host server 103. Network 102 serves as a medium for providing a communication link between data server 101 and host server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The main server 103 may be a server that provides various services, such as a data processing server that processes information uploaded by the data server 101. The data processing server can process the received event information and store the processing result (such as element information set and label) in the event information base in an associated manner.

It should be noted that the method for text paragraph segmentation provided in the embodiment of the present application is generally performed by the host server 103, and accordingly, the apparatus for text paragraph segmentation is generally disposed in the host server 103.

The data server and the main server may be hardware or software. When the hardware is used, the hardware can be implemented as a distributed server cluster consisting of a plurality of servers, or can be implemented as a single server. When software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services) or as a single piece of software or software module.

It should be understood that the number of data servers, networks, and host servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method applied to text paragraph segmentation in accordance with the present application is illustrated. The method comprises the following steps:

and step S1, calculating similarity values among the natural segments, then calculating the average value of the similarity values, and then performing large paragraph division based on the threshold value. The threshold value is a value that moves within a certain range of the average value, that is:

t＝a±σ

where t is the threshold, a is the average, and σ is the moving constant.

In some optional implementation manners of this embodiment, the similarity value between the natural segments is calculated by a cosine similarity algorithm:

In a specific embodiment, cosine similarity refers to measuring the similarity between two vectors by measuring the cosine value of the included angle between them. The cosine value of the 0-degree angle is 1, and the cosine value of any other angle is not more than 1; and its minimum value is-1. The cosine of the angle between the two vectors thus determines whether the two vectors point in approximately the same direction. When the two vectors have the same direction, the cosine similarity value is 1; when the included angle of the two vectors is 90 degrees, the value of the cosine similarity is 0; the cosine similarity has a value of-1 when the two vectors point in completely opposite directions. The result is independent of the length of the vector, only the pointing direction of the vector. Cosine similarity is commonly used in the positive space, and thus gives values between 0 and 1. It should be noted that upper and lower bounds apply to any dimension of vector space, and cosine similarity is most often used in high-dimensional space. For example, in information retrieval, each term is assigned a different dimension, and a document is represented by a vector whose values in the respective dimensions correspond to the frequency with which the term appears in the document. The cosine similarity can thus give the similarity of two documents or two words or two paragraphs in terms of their subject matter.

In some optional implementation manners of this embodiment, the similarity value between the natural segments is calculated by a simhash algorithm: converting the words in the respective natural segments into hash values by a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values, namely when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and combining the hash value and the weight value of the words in each natural segment, then converting each natural segment into 0 and 1 (namely, more than 0 is 1, and less than or equal to 0 is 0), and then calculating the Hamming distance between the natural segments.

In a particular embodiment, the hash algorithm may map a binary value of arbitrary length to a shorter, fixed-length binary value, this small binary value being referred to as a hash value. A hash value is a unique and extremely compact representation of a piece of data, and if a piece of plaintext is hashed and even if only one letter of the piece is altered, subsequent hashes will produce different values. It is computationally impossible to find two different inputs for which the hash is the same value, so the hash value of the data can verify the integrity of the data, and thus the hash algorithm is typically used for fast lookup and encryption algorithms.

Step S2, performing word feature value calculation on the large paragraphs, and calculating the entropy of the n common words with the largest feature values in the large paragraphs.

In this embodiment, the word feature value calculation specifically includes:

tfidf_i＝tf_i·idf_i(3)

In particular embodiments, tf-idf is a statistical method for evaluating the importance of words to one of a set of documents or a corpus of documents. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of tf-idf weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. the main idea of tf-idf is: if a word or phrase appears in an article frequently if high and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for calculating a word feature value.

In this embodiment, calculating the entropy of the common word specifically includes:

p_m＝∑_np_i(4)

E＝-p_m·logp_m(5)

wherein, in the formula (4), n is the total number of the common words, p_iFor the probability that one has a common word, in equation (5), E is the entropy of the common word, p_mIs probability of having a common language

And step S3, performing threshold value sliding based on the average value of the similarity values, respectively calculating the entropy of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

In a specific embodiment, the sliding can be performed within ± 20% of the threshold value by using 3% of the threshold value as a step, and the large-segment segmentation is performed again by using the threshold value after each sliding.

In the method provided by the above embodiment of the present application, the similarity values between the natural segments, the average value of the similarity values, and the threshold value are calculated to perform large-segment division, the word feature values and the entropy of the common word are calculated for the large segments, the threshold value sliding is performed based on the average value of the similarity values, the entropy of the common word is calculated by different threshold values, and the division result with the smallest entropy is taken as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of text paragraph segmentation.

With further reference to FIG. 3, a flow 300 of yet another embodiment of a method for text paragraph segmentation in accordance with the present application is illustrated. The method comprises the following steps:

and step S0, preprocessing the text to be processed, removing the html tag of the text, and then performing word segmentation processing and stop word removal on the text.

In this embodiment, if the text is obtained from the internet, the text may have html tags, which may affect the extraction of the text abstract, and the text is preprocessed to remove the html tags, so that the text abstract is conveniently obtained by a subsequent abstract algorithm.

In this embodiment, the text is segmented as a data basis of the text abstract, and the text segmentation can be performed based on a dictionary segmentation algorithm, a statistical-based machine learning algorithm, a combined segmentation algorithm, or the like, where the dictionary-based segmentation algorithm is most widely used and has the fastest segmentation speed. Researchers have been optimizing string-based matching methods for a long time, such as maximum length setting, string storage and lookup, and for word list organization, such as TRIE index trees, hash indexes, and the like. Text segmentation is performed by a machine learning algorithm based on statistics, the algorithms commonly used at present are algorithms such as HMM, CRF, SVM and deep learning, and for example, stanford and Hanlp segmentation tools are based on CRF algorithms. Taking the CRF as an example, the basic idea is to perform labeling training on Chinese characters, not only considering the occurrence frequency of words, but also considering context and context, and having better learning ability, so that the method has good effect on identifying ambiguous words and unknown words. Common word segmenters are combined by using a machine learning algorithm and a dictionary, so that on one hand, the word segmentation accuracy can be improved, and on the other hand, the field adaptability can be improved.

In the present embodiment, stop words refer to some words or phrases that are automatically filtered before or after processing text (or natural language data) in order to save storage space and improve search efficiency in information retrieval. These stop words are manually entered, non-automatically generated, and the generated stop words form a stop word list, but there is no clear stop word list that can be applied to all tools, even some tools that explicitly avoid using stop words to support phrase searching. The application of stop words is very wide and is visible anywhere on the Internet, for example, a word of Web appears on each website, the search engine cannot guarantee that truly relevant search results can be given, the search range is difficult to help to be reduced, and the search efficiency is reduced; meanwhile, the stop words also include the moods, adverbs, prepositions, conjunctions and the like, which generally have no definite meaning and only have a certain function when put into a complete sentence, such as the common words "at" and "in".

And step S1, calculating similarity values among the natural segments, then calculating the average value of the similarity values, and then performing large paragraph division based on the threshold value.

In this embodiment, step S1 is substantially the same as step S1 in the corresponding embodiment of fig. 2, and is not repeated here.

In this embodiment, step S2 is substantially the same as step S2 in the corresponding embodiment of fig. 2, and is not repeated here.

In this embodiment, step S3 is substantially the same as step S3 in the corresponding embodiment of fig. 2, and is not repeated here.

As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the flow 300 of the method for text paragraph segmentation in the present embodiment highlights the previous text processing step. Therefore, the scheme described in the embodiment can reduce a large amount of noise generated during text division, accurately extract each key word and contribute to improving paragraph division efficiency.

With further reference to fig. 4, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for text paragraph segmentation, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus may be applied to various electronic devices in particular.

As shown in fig. 4, the apparatus 400 for text paragraph segmentation of the present embodiment includes: the similarity calculation module 401 is configured to calculate similarity values between the natural segments, calculate an average value of the similarity values, and perform large segment division based on a threshold value; the word processing module 402 is configured to perform word feature value calculation on the large paragraphs, and calculate entropies of n common words with the largest feature values in the large paragraphs; and an optimal selection module 403, configured to perform threshold sliding based on the average value of the similarity values, calculate entropies of the common words respectively through different thresholds, and take the division result with the smallest entropy as the optimal division.

In some optional implementations of this embodiment, the apparatus 400 may further include: and the preprocessing module (not shown in the figure) is used for preprocessing the text to be processed, removing the html tag of the text, and then performing word segmentation processing and stop word removal on the text.

In some optional implementations of this embodiment, the apparatus 400 may further include: a cosine similarity algorithm module (not shown in the figure) configured to calculate similarity values between the natural segments by a cosine similarity algorithm:

In some optional implementations of this embodiment, the apparatus 400 may further include: a simhash algorithm module (not shown in the figure) configured to calculate similarity values between the natural segments by using a simhash algorithm: converting the words in the respective natural segments into hash values by a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values, namely when the hash value is 0, the bit is a negative weight value, and when the hash value is 1, the bit is a positive weight value; and combining the hash value and the weight value of the words in each natural segment, then converting each natural segment into 0 and 1 (namely, more than 0 is 1, and less than or equal to 0 is 0), and then calculating the Hamming distance between the natural segments.

In some optional implementations of this embodiment, the generating unit 403 may include: a feature value calculation module (not shown in the figure) configured to perform word feature value calculation on the large paragraphs respectively:

tfidf_i＝tf_i·idf_i(3)

In some optional implementations of this embodiment, the generating unit 403 may include: an entropy calculation module (not shown in the figure) configured to calculate the entropy of the n common words with the largest eigenvalues in the large paragraph:

E＝-p_m·logp_m(4)

p_m＝∑_np_i(5)

The device provided by the above embodiment of the present application performs large paragraph division by calculating the similarity value between the natural segments, the average value of the similarity value, and the threshold value, then performs calculation of the word feature value and the entropy of the common word for the large paragraph, performs threshold value sliding based on the average value of the similarity value, calculates the entropy of the common word by different threshold values, and takes the division result with the smallest entropy as the optimal division. The embodiment is beneficial to improving the accuracy of determining the threshold value of the paragraph similarity, thereby improving the accuracy of text paragraph segmentation.

Referring now to FIG. 5, shown is a block diagram of a computer system 500 suitable for use in implementing the electronic device of an embodiment of the present application. The electronic device shown in fig. 5 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Liquid Crystal Display (LCD) and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511. The computer program performs the above-described functions defined in the method of the present application when executed by the Central Processing Unit (CPU) 501.

It should be noted that the computer readable storage medium described herein can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable storage medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: calculating similarity values among all natural segments, then calculating an average value of the similarity values, and then dividing large segments based on a threshold value; respectively calculating word characteristic values of the large paragraphs, and calculating the entropies of the n common words with the maximum characteristic values in the large paragraphs; and performing threshold value sliding on the basis of the average value of the similarity values, respectively calculating the entropies of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for text paragraph segmentation, the method comprising the steps of:

s1: calculating similarity values among all natural segments, then calculating an average value of the similarity values, and then dividing large segments based on a threshold value;

s2: respectively calculating word characteristic values of the large paragraphs, and calculating the entropies of the n common words with the maximum characteristic values in the large paragraphs;

s3: and performing threshold value sliding on the basis of the average value of the similarity values, respectively calculating the entropies of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

2. The method for text paragraph segmentation according to claim 1, wherein said step S1 of the method is preceded by the further steps of:

s0: preprocessing a text to be processed, removing an html tag of the text, and then performing word segmentation processing and stop word removal on the text.

3. The method for text paragraph segmentation according to claim 1, wherein the step S1 further includes:

calculating similarity values between the natural segments by a cosine similarity algorithm:

4. The method for text paragraph segmentation according to claim 1, wherein the step S1 further includes:

calculating similarity values among the natural segments by a simhash algorithm:

converting words in respective natural sections into hash values through a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values;

and combining the hash value and the weight value of the words in each natural segment, and calculating the Hamming distance between the natural segments.

5. The method for text paragraph segmentation according to claim 1, wherein the word feature value calculation of step S2 specifically includes:

tfidf_i＝tf_i·idf_i(3)

6. The method for text paragraph segmentation according to claim 1, wherein the calculating entropy of the common words in step S3 specifically includes:

p_m＝∑_np_i(4)

E＝-p_m·logp_m(5)

wherein, in the formula (4),n is the total number of words in common, p_iFor the probability that one has a common word, in equation (5), E is the entropy of the common word, p_mIs the probability of having the common language.

7. An apparatus for text paragraph segmentation, the apparatus comprising:

the similarity calculation module is used for calculating similarity values among all natural segments, then calculating the average value of the similarity values, and then dividing large segments based on a threshold value;

the word processing module is used for respectively calculating word characteristic values of the large paragraphs and calculating the entropies of the n common words with the maximum characteristic values in the large paragraphs;

and the optimal selection module is arranged for performing threshold value sliding on the basis of the average value of the similarity values, respectively calculating the entropies of the common words through different threshold values, and taking the division result with the minimum entropy as the optimal division.

8. The apparatus for text paragraph segmentation as claimed in claim 7, wherein the apparatus further comprises:

and the preprocessing module is used for preprocessing the text to be processed, removing the html tag of the text, and then performing word segmentation processing and stop word removal on the text.

The cosine similarity algorithm module is used for calculating similarity values among the natural segments through a cosine similarity algorithm:

The simhash algorithm module is used for calculating the similarity value between the natural segments through the simhash algorithm: converting words in respective natural sections into hash values through a hash algorithm, simultaneously calculating tf-idf values of the words, and calculating weight values of the words based on the tf-idf values as weight values; and combining the hash value and the weight value of the words in each natural segment, and calculating the Hamming distance between the natural segments.

9. The apparatus of claim 7, wherein the word processing module further comprises:

and the characteristic value calculation module is used for respectively calculating the word characteristic values of the large paragraphs:

tfidf_i＝tf_i·idf_i(3)

An entropy calculation module configured to calculate the entropy of the n common words with the largest feature values in the large paragraph:

E＝-p_m·logp_m(4)

p_m＝∑_np_i(5)

wherein, in the formula (4), n is the total number of the common words, p_iFor the probability that one has a common word, in equation (5), E is the entropy of the common word, p_mIs the probability of having the common language.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-6.