CN110895654A

CN110895654A - Segmentation method, segmentation system and non-transitory computer readable medium

Info

Publication number: CN110895654A
Application number: CN201910105172.8A
Authority: CN
Inventors: 蓝国诚; 詹诗涵
Original assignee: Delta Electronics Inc
Current assignee: Delta Electronics Inc
Priority date: 2018-09-07
Filing date: 2019-02-01
Publication date: 2020-03-20
Anticipated expiration: 2039-02-01
Also published as: CN110891202B; TWI709905B; TW202011231A; TWI725375B; TW202011232A; SG10201906347QA; SG10201905236WA; SG10201905532QA; TWI699663B; TWI700597B; JP2020042771A; TW202011749A; TWI696386B; CN110895654B; JP6829740B2; JP2020042770A; CN110891202A; JP2020042777A; TW202011222A; CN110889034A

Abstract

The present disclosure relates to a segmentation method, a segmentation system, and a non-transitory computer readable medium. The segmentation method comprises the following steps: receiving subtitle information; wherein, the caption information comprises a plurality of caption sentences; selecting caption sentences according to a set value, and dividing the selected caption sentences into a first paragraph; common segmented vocabulary judgment is carried out aiming at the first caption sentence; wherein the first caption sentence is one of the caption sentences; and generating a second paragraph or merging the first caption sentence into the first paragraph according to the judgment result of the common segmented vocabulary judgment.

Description

Segmentation method, segmentation system and non-transitory computer readable medium

Technical Field

The present disclosure relates to a segmentation method, a segmentation system and a non-transitory computer readable medium, and more particularly, to a segmentation method, a segmentation system and a non-transitory computer readable medium for subtitles.

Background

The on-line learning platform is a network service that stores a plurality of learning data in a server, so that a user can connect to the server through the internet to browse the learning data at any time. In the existing various online learning platforms, the types of learning materials provided include films, audios, presentations, documents or forums.

Because the amount of learning materials stored in the online learning platform is huge, in order to facilitate the use of users, the characters of the learning materials need to be automatically segmented and paragraph keywords need to be established. Therefore, how to process according to the difference between contents of the learning movie to achieve the functions of segmenting similar subjects in the learning movie and labeling keywords is a problem to be solved in the field.

Disclosure of Invention

A first aspect of the present disclosure is to provide a segmentation method. The segmentation method comprises the following steps: receiving subtitle information; wherein, the caption information comprises a plurality of caption sentences; selecting caption sentences according to a set value, and dividing the selected caption sentences into a first paragraph; common segmented vocabulary judgment is carried out aiming at the first caption sentence; wherein the first caption sentence is one of the caption sentences; and generating a second paragraph or merging the first caption sentence into the first paragraph according to the judgment result of the common segmented vocabulary judgment.

A second aspect of the present disclosure is to provide a segmentation system, which includes a storage unit and a processor. The storage unit is used for storing the caption information, the segmentation result, the annotation corresponding to the first paragraph and the annotation corresponding to the second paragraph. The processor is electrically connected with the storage unit and used for receiving the caption information; wherein, the subtitle information contains a plurality of subtitle sentences, and the processor comprises: a segmentation unit, a common word detection unit, and a paragraph generation unit. The segmentation unit is used for selecting the caption sentences according to a specific sequence by using a set value and dividing the selected caption sentences into first paragraphs. The common word detection unit is electrically connected with the segmentation unit and is used for judging common segmented words aiming at the first caption sentence; wherein the first caption sentence is one of the plurality of caption sentences. The paragraph generating unit is electrically connected with the common word detecting unit and used for generating a second paragraph or merging the first caption sentence into the first paragraph according to the judgment result of the common segmented word judgment.

In a third aspect, the present application provides a non-transitory computer readable medium containing at least one program of instructions for execution by a processor to perform a segmentation method, comprising: receiving subtitle information; wherein, the caption information comprises a plurality of caption sentences; selecting caption sentences according to a set value, and dividing the selected caption sentences into a first paragraph; common segmented vocabulary judgment is carried out aiming at the first caption sentence; wherein the first caption sentence is one of the caption sentences; and generating a second paragraph or merging the first caption sentence into the first paragraph according to the judgment result of the common segmented vocabulary judgment.

The present disclosure relates to a segmentation method, a segmentation system and a non-transitory computer readable medium, which mainly solve the problem of consuming a lot of labor and time for marking film segments manually. The method comprises the steps of firstly calculating keywords corresponding to each caption sentence, carrying out common segmentation vocabulary judgment on the caption sentences, generating a second paragraph or merging the first caption sentence into the first paragraph according to the judgment result of the common segmentation vocabulary judgment to generate a segmentation result, and achieving the functions of segmenting similar subjects in a learning film and labeling the keywords.

Drawings

In order to make the aforementioned and other objects, features, advantages and embodiments of the present disclosure more comprehensible, the following description is made with reference to the accompanying drawings:

FIG. 1 is a schematic diagram of a segmentation system depicted in accordance with some embodiments of the present application;

FIG. 2 is a flow diagram of a segmentation method according to some embodiments of the present application;

fig. 3 is a flowchart of step S240 according to some embodiments of the present application;

fig. 4 is a flowchart of step S241 according to some embodiments of the present application; and

fig. 5 is a flowchart of step S242 according to some embodiments of the present application.

[ description of reference ]

100: segmentation system

110: memory cell

130: processor with a memory having a plurality of memory cells

DB 1: common segmented vocabulary database

DB 2: course database

131: keyword extraction unit

132: segmentation unit

133: common word detection unit

134: paragraph generation unit

135: annotation generating unit

200: segmentation method

S210 to S250, S241 to S242, S2411 to S2413, S2421 to S2423: step (ii) of

Detailed Description

Reference will now be made in detail to the present embodiments of the present application, examples of which are illustrated in the accompanying drawings. It should be understood, however, that these implementation details should not be used to limit the application. That is, in some embodiments of the disclosure, such practical details are not necessary. In addition, for simplicity, some conventional structures and elements are shown in the drawings in a simple schematic manner.

When an element is referred to as being "connected" or "coupled," it can be referred to as being "electrically connected" or "electrically coupled. "connected" or "coupled" may also be used to indicate that two or more elements are in mutual engagement or interaction. Moreover, although terms such as "first," "second," …, etc., may be used herein to describe various elements, these terms are used merely to distinguish one element or operation from another element or operation described in similar technical terms. Unless the context clearly dictates otherwise, the terms do not specifically refer or imply an order or sequence nor are they intended to limit the invention.

Please refer to fig. 1. Fig. 1 is a schematic diagram of a segmentation system 100 depicted in accordance with some embodiments of the present application. As shown in fig. 1, the segmentation system 100 includes a storage unit 110 and a processor 130. The storage unit 110 is electrically connected to the processor 130, and the storage unit 110 is used for storing the subtitle information, the segmentation result, the common segmentation vocabulary database DB1, the course database DB2, the annotation corresponding to the first paragraph, and the annotation corresponding to the second paragraph.

As described above, the processor 130 includes the keyword extracting unit 131, the segmenting unit 132, the common word detecting unit 133, the paragraph generating unit 134, and the annotation generating unit 135. The segmentation unit 132 is electrically connected to the keyword extraction unit 131 and the common word detection unit 133, the paragraph generation unit 134 is electrically connected to the common word detection unit 133 and the annotation generation unit 135, and the common word detection unit 133 is electrically connected to the annotation generation unit 135.

In various embodiments of the present invention, the storage device 110 can be implemented as a storage device, a hard disk, a portable disk, a memory card, etc. The processor 130 may be implemented as an integrated circuit such as a micro control unit (microcontroller), a microprocessor (microprocessor), a digital signal processor (digital signal processor), an Application Specific Integrated Circuit (ASIC), a logic circuit, or other similar components or combinations thereof.

Please refer to fig. 2. Fig. 2 is a flow diagram of a segmentation method 200 depicted in accordance with some embodiments of the present application. In an embodiment, the segmentation method 200 shown in fig. 2 can be applied to the segmentation system 100 of fig. 1, and the processor 130 is configured to segment the subtitle information according to the following steps of the segmentation method 200 to generate a segmentation result and an annotation corresponding to each paragraph. As shown in fig. 2, the segmentation method 200 first performs step S210 to receive the subtitle information. In one embodiment, the caption information includes a plurality of caption sentences. For example, the subtitle information is a subtitle file of a movie, the subtitle file of the movie divides the content of the movie into a plurality of subtitle sentences according to the playing time of the movie, and the subtitle sentences are also sorted according to the playing time of the movie.

Next, the segmentation method 200 performs step S220 to select a caption sentence according to a setting value, and divides the selected caption sentence into current paragraphs. In an embodiment, the setting value may be any positive integer, where the setting value takes 3 as an example, so that 3 subtitles are selected to form the current paragraph in this step according to the time of playing the movie. For example, if there are N subtitles sentences in total, the 1 st to 3 rd subtitles sentences can be selected to form the current paragraph.

Next, the segmentation method 200 performs step S230 to perform a common segmentation vocabulary determination for the current caption sentence. In one embodiment, the common segmented vocabulary is stored in the common segmented vocabulary database DB1, and the common word detecting unit 133 detects whether the common segmented vocabulary exists. Common segmented words can be divided into common beginning words and common ending words. For example, common beginning words can be "next," "begin description," etc., and common ending words can be "described above to this," "paragraphs from this today to this," etc. In this step, it is detected whether a common segmented vocabulary is present and the common segmented vocabulary type (common beginning vocabulary or common ending vocabulary) is present.

Next, the segmentation method 200 performs step S240 to generate a next paragraph or incorporate the current caption sentence into the current paragraph according to the judgment result of the common segmentation vocabulary judgment. In one embodiment, it is determined whether to generate a new paragraph or to incorporate the currently executed caption sentence into the current paragraph according to the detection result of the aforementioned common word detection unit 133. For example, the current paragraph is composed of the 1 st caption sentence to the 3 rd caption sentence, the currently executed caption sentence may be the 4 th caption sentence, and the 4 th caption sentence may be merged into the current paragraph or the 4 th caption sentence may be used as the start of a new paragraph according to the determination result.

As mentioned above, after the current caption sentence is merged into the current paragraph in step S240, the common segmented vocabulary determination of the next caption sentence is performed, so the determination in step S230 is performed again. For example, if the 4 th caption sentence is merged into the current paragraph, the usual segmentation vocabulary judgment of the 5 th caption sentence is performed. If the next paragraph is generated after the execution of step S240, the following steps are executed to select the caption sentence according to the specific sequence by using the setting value, and the selected caption sentence is divided into the next paragraph, so the operation of step S220 is executed again. For example, if the 4 th caption sentence is classified as being behind the next paragraph, the 5 th caption sentence, the 6 th caption sentence and the 7 th caption sentence are reselected to be added to the next paragraph. Therefore, the segmentation operation is repeatedly executed until the caption sentence is segmented, and finally, a segmentation result is generated.

Next, the step S240 further includes steps S241 to S242, please refer to fig. 3, and fig. 3 is a flowchart of the step S240 according to some embodiments of the present disclosure. As shown in fig. 3, the segmentation method 200 further performs step S241 to segment the current caption sentence into a next paragraph if the current caption sentence is associated with the common segmented vocabulary, and selects the caption sentence according to a specific sequence by using a setting value, and adds the selected caption sentence into the next paragraph. The step S241 further includes steps S2411 to S2413, please further refer to fig. 4, and fig. 4 is a flowchart of the step S241 according to some embodiments of the present disclosure. As shown in fig. 4, the segmentation method 200 further performs step S2411 to determine whether the current subtitle sentence is associated with one of the beginning segmented word and the ending segmented word according to the determination result. Following the above embodiment, according to the determination result of step S230, it can be determined whether the current subtitle sentence is associated with the beginning segmented vocabulary or the ending segmented vocabulary.

In light of the above, the segmentation method 200 further performs step S2412, if the current caption sentence is associated with the beginning segmentation vocabulary, the current caption sentence is used as the starting sentence of the next paragraph. For example, if the 4 th caption sentence has the word "next" in the foregoing judgment result, the 4 th caption is used as the starting sentence of the next paragraph.

In light of the above, the segmentation method 200 further performs step S2413, if the current caption sentence is associated with the ending segmentation vocabulary, the current caption sentence is used as the ending sentence of the current paragraph. For example, if the 4 th caption sentence is detected to have the word "described above, that is, the 4 th caption is used as the final sentence of the current paragraph. After the operation of step S241 is completed, the selection of the caption sentence according to the specific sequence by using the setting value is performed, and the selected caption sentence is divided into the next paragraph, so the operation of step S220 is performed repeatedly, which is not described herein again.

Next, the segmentation method 200 further performs step S242, if the current caption sentence is not associated with the common segmentation vocabulary, the current caption sentence and the current paragraph are subjected to similarity value calculation, and if the current caption sentence is similar to the common segmentation vocabulary, the first caption sentence is merged into the current paragraph. Step S242 further includes steps S2421 to S2423, please further refer to fig. 5, and fig. 5 is a flowchart of step S242 according to some embodiments of the present disclosure. As shown in fig. 5, the segmentation method 200 further performs step S2421 to compare whether a difference between at least one feature corresponding to the current caption sentence and at least one feature corresponding to the current paragraph is greater than a threshold value.

In another embodiment, the method further includes extracting a plurality of keywords from the caption sentence, where the extracted keywords are at least one feature corresponding to the current caption sentence. And calculating the keywords corresponding to the caption sentences by using a TF-IDF statistical method (Term Frequency-Inverse document Frequency). The TF-IDF statistical method is used to evaluate the importance of a word to a document in a database, where the importance of a word increases in direct proportion to the number of times it appears in the document, but decreases in inverse proportion to the frequency with which it appears in the database. In this embodiment, the TF-IDF statistical method may calculate the keywords of the current caption sentence. Then, a similarity value between at least one feature (keyword) of the current caption sentence and at least one feature (keyword) of the current paragraph is calculated, and it can be determined that the current caption sentence is closer to the content of the current paragraph as the calculated similarity value is higher.

In light of the above, the segmentation method 200 further performs step S2422, and if the difference value is smaller than the threshold value, the current subtitle sentence is merged into the current paragraph. In an embodiment, the threshold value is used to filter the similarity value, and when the similarity value is not less than the threshold value, it indicates that the current caption sentence is similar to the content of the current paragraph, so that the current caption sentence can be merged into the current paragraph. For example, if the similarity value between the 4 th caption sentence and the current paragraph is not less than the threshold value, it indicates that the content of the 4 th caption sentence is similar to that of the current paragraph, so the 4 th caption sentence can be added to the current paragraph.

In the above, the segmentation method 200 further performs step S2423, if the difference value is not less than the threshold value, the current caption sentence is used as the starting sentence of the next paragraph, and the caption sentences are selected according to the specific sequence by using the set value, so as to divide the selected caption sentences into the next paragraph. For example, when the similarity value is smaller than the threshold value, it indicates that the current caption sentence is different from the content of the current paragraph, and therefore the current caption sentence is determined as the starting sentence of the second paragraph. For example, if the similarity value between the 4 th caption sentence and the current paragraph is smaller than the threshold value, it indicates that the content of the 4 th caption sentence is different from that of the current paragraph, and therefore the 4 th caption sentence is used as the starting sentence of the next paragraph. After the operation of step S252 is completed, the selection of the caption sentence according to the specific sequence by using the setting value is performed, and the selected caption sentence is divided into the next paragraphs, so the operation of step S230 is performed repeatedly, which is not described herein again.

From the above segmentation operation, it can be known that, after each segmentation calculation of one caption sentence, the segmentation calculation of the next caption sentence is executed, until all caption sentences are executed, if the number of the remaining caption sentences is less than the set value, the segmentation calculation may not be performed on the remaining caption sentences, but the remaining caption sentences are directly merged into the current paragraph, for example, if the number of the remaining caption sentences is 2, the remaining caption sentences is less than the set value (the set value is set to 3), so the remaining 2 caption sentences may be merged into the current paragraph.

Then, after the above-mentioned segmentation step is performed, the segmentation method 200 performs step S250 to generate annotations corresponding to the paragraphs. For example, if the caption sentence is divided into 3 paragraphs after the caption sentence is executed, the annotations of the 3 paragraphs are calculated respectively, and the annotations may be generated according to the keywords corresponding to the caption sentence in the paragraph. Finally, the divided paragraphs and the annotations corresponding to the paragraphs are stored in the course database DB2 of the storage unit 110. For example, if the difference value is smaller than the threshold value, it indicates that the current caption sentence is similar to the current paragraph, and therefore the keyword of the caption sentence can be used as at least one feature corresponding to the current paragraph. If the difference value is not less than the threshold value, it indicates that the current caption sentence is not similar to the current paragraph, so the keyword of the caption sentence can be used as at least one feature corresponding to the next paragraph.

According to the embodiments of the present application, the problem that a lot of labor and time are consumed to mark a film paragraph manually in the past is mainly solved. Firstly, calculating the key words corresponding to each caption sentence, judging common segmented words aiming at the caption sentences, generating the next paragraph or merging the first caption sentence into the current paragraph according to the judgment result of the common segmented words to generate the segmentation result, and achieving the functions of segmenting similar subjects in the learning film and marking the key words

Additionally, the above illustration includes exemplary steps in sequential order, but the steps need not be performed in the order shown. It is within the contemplation of the disclosure that these steps may be performed in a different order. Steps may be added, substituted, changed in order, and/or omitted as appropriate within the spirit and scope of embodiments of the disclosure.

Although the present disclosure has been described with reference to the above embodiments, it should be understood that various changes and modifications can be made by one skilled in the art without departing from the spirit and scope of the disclosure, and therefore, the scope of the disclosure should be determined by that of the appended claims.

Claims

1. A segmentation method, comprising:

receiving subtitle information; wherein, the caption information comprises a plurality of caption sentences;

selecting the plurality of caption sentences according to a set value, and dividing the selected caption sentences into a first paragraph;

performing common segmentation vocabulary judgment on a first caption sentence; wherein the first caption sentence is one of the plurality of caption sentences; and

and generating a second paragraph or merging the first caption sentence into the first paragraph according to a judgment result of the common segmentation vocabulary judgment.

2. The segmentation method according to claim 1, wherein the common segmentation vocabulary determination is performed for a second caption sentence after the first caption sentence is merged into the first segment; wherein the second caption sentence follows the first caption sentence according to a specific sequence.

3. The segmentation method of claim 1, wherein when the second paragraph is generated, the caption sentences are selected according to a specific sequence by using the setting value, and the selected caption sentences are added to the second paragraph.

4. The segmentation method according to claim 1, wherein the generating the second paragraph or incorporating the first caption sentence into the first paragraph according to the judgment result of the common segmented vocabulary judgment further comprises:

if the first caption sentence is associated with the common segmentation vocabulary, performing segmentation processing to generate a second paragraph, selecting the plurality of caption sentences according to a specific sequence by using the set value, and adding the selected caption sentences into the second paragraph; and

and if the first caption sentence is not associated with the common segmented vocabulary, performing similarity value calculation on the first caption sentence and the first paragraph, and if the first caption sentence is similar to the common segmented vocabulary, merging the first caption sentence into the first paragraph.

5. The segmentation method of claim 4, wherein the segmentation process comprises:

determining whether the first caption sentence is associated with one of a beginning segmentation vocabulary and an ending segmentation vocabulary according to the judgment result;

if the first caption sentence is associated with the beginning segmentation vocabulary, taking the first caption sentence as the starting sentence of the second paragraph; and

and if the first caption sentence is associated with the ending segmented word, taking the first caption sentence as the ending sentence of the first segment.

6. The segmentation method of claim 4, wherein the similarity value calculation comprises:

comparing whether a difference value between at least one characteristic corresponding to the first caption sentence and at least one characteristic corresponding to the first paragraph is larger than a threshold value;

if the difference value is smaller than the threshold value, merging the first caption sentence into the first paragraph; and

if the difference value is not less than the threshold value, the first caption sentence is used as the initial sentence of the second paragraph, and the plurality of caption sentences are selected according to the specific sequence by using the set value, so that the selected caption sentences are divided into the second paragraph.

7. The segmentation method of claim 6, wherein a plurality of keywords are extracted from the plurality of caption sentences, the plurality of keywords being at least one feature corresponding to the first caption sentence.

8. The segmentation method of claim 7, wherein the at least one feature corresponding to the first paragraph is generated from the keywords extracted from the caption sentences in the first paragraph.

9. A segmentation system, comprising:

the storage unit is used for storing subtitle information, a segmentation result, a common segmentation vocabulary database, an annotation corresponding to a first paragraph and an annotation corresponding to a second paragraph; and

a processor electrically connected to the memory unit for receiving the caption information; wherein the caption information comprises a plurality of caption sentences, and the processor comprises:

a segmentation unit for selecting the plurality of caption sentences by using a set value and dividing the selected caption sentences into a first segment;

a common word detection unit electrically connected to the segmentation unit for performing a common segmentation vocabulary judgment for a first caption sentence; wherein the first caption sentence is one of the plurality of caption sentences; and

and the paragraph generation unit is electrically connected with the common word detection unit and used for generating a second paragraph or merging the first caption sentence into the first paragraph according to a judgment result judged by the common segmented words.

10. The segmentation system of claim 9, wherein the common word detection unit is further configured to perform the common segmentation vocabulary determination for a second caption sentence after the first caption sentence is merged into the first segment;

wherein the second caption sentence follows the first caption sentence according to a specific sequence.

11. The segmentation system of claim 9, wherein the segmentation unit is further configured to select the plurality of caption sentences according to a specific order by using the setting value when the second paragraph is generated, and add the selected caption sentences to the second paragraph.

12. The segmentation system as claimed in claim 9, wherein the paragraph generation unit is further configured to perform the following steps according to the determination result:

if the first caption sentence is associated with the common segmentation vocabulary, performing segmentation processing to generate a second paragraph, selecting the caption sentences according to a specific sequence by using the set value, and adding the selected caption sentences into the second paragraph; and

13. The segmentation system of claim 12, wherein the segmentation process comprises:

14. The segmentation system of claim 12, wherein the similarity value calculation comprises:

15. The segmentation system of claim 14, further comprising:

and the keyword extraction unit is electrically connected with the segmentation unit and used for extracting a plurality of keywords from the plurality of caption sentences, wherein the keywords are at least one characteristic corresponding to the first caption sentence.

16. The segmentation system of claim 15, wherein the at least one feature corresponding to the first paragraph is generated from the keywords extracted from the caption sentences in the first paragraph.

17. A non-transitory computer readable medium containing at least one program of instructions for execution by a processor to perform a segmentation method, comprising: