CN114445744A

CN114445744A - Education video automatic positioning method, device and storage medium

Info

Publication number: CN114445744A
Application number: CN202210068391.5A
Authority: CN
Inventors: 孙箫宇; 于丹; 王澈; 王宇; 张宾
Original assignee: Dalian Neusoft Education Technology Group Co ltd
Current assignee: Dalian Neusoft Education Technology Group Co ltd
Priority date: 2022-01-20
Filing date: 2022-01-20
Publication date: 2022-05-06

Abstract

The invention provides an educational video automatic positioning method, device and storage medium. The method comprises the following steps: acquiring uploaded education videos and education courseware; extracting video stream features of the educational video based on a depth network model to generate a sequence of video key frames; extracting a text sequence in each key frame image based on an OCR method to generate a key frame text sequence; extracting the structural information of the education courseware and outputting the content of each page of courseware in a text form; and automatically positioning the video explanation position corresponding to each page of education courseware by adopting a courseware-video positioning algorithm for the courseware text sequence and the key frame text sequence. The invention realizes the automatic positioning from the educational courseware to the video, on one hand, the invention can provide the accurate and intelligent video positioning function for the resource manager of the educational resource platform, and reduce the manual marking cost; on the other hand, the blindness and randomness of the retrieval process can be reduced when the user uses the system, the flexibility and the learning efficiency of on-line learning are improved, and therefore the user experience is improved.

Description

Education video automatic positioning method, device and storage medium

Technical Field

The invention relates to the technical field of computer vision, in particular to an automatic positioning method and device for an education video and a storage medium.

Background

In the hybrid teaching, multimedia courseware, course video, academic lectures and the like are more and more common in the educational teaching, and the educational resources are widely applied to an educational resource management platform. Generally, after the course is finished, the instructor uploads the recorded education video and the matched education courseware to the education resource management platform, so that the learner can independently watch and learn after the course. However, the existing platform mainly has the following defects in the aspect of educational resource management:

(1) the video is long in time, and the contained knowledge points are many, so that the targeted review is inconvenient to be performed aiming at certain difficult knowledge. In general, learners are basically in a passively accepted learning state when learning a video class, and places with higher cognitive load and higher knowledge difficulty need to repeatedly review videos. The video belongs to a dynamic medium with strong control, and learners are difficult to accurately position the learning key points through video pictures. The overlong video increases the blindness of the video positioning process, and is not convenient for fragmentation review.

(2) The resources of the education video and the matched courseware are mutually independent, and the education video and the matched courseware have no mapping relation. When a learner reviews courseware, the learner wants to review the explanation content of the courseware in the video because the content of the courseware is not understood, most of the existing platforms do not provide the function of skipping from the courseware to the video, and the learner can only manually search the corresponding position of the video, so that the learning efficiency and the user experience are greatly reduced.

(3) The manual video labeling is time-consuming and labor-consuming, and labeling errors are easy to generate. Some platforms provide the skip function from courseware to video, but mostly manually mark the mapping relation between each page of courseware and the video timestamp, and along with the continuous generation of massive education resources, manual video marking consumes a large amount of labor cost, and the process is tedious and marking errors are easily generated.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides an automatic positioning method, device and storage medium for an education video, aiming at quickly jumping to a corresponding video explanation position through the content of an education courseware.

The technical means adopted by the invention are as follows:

an educational video automatic positioning method comprises the following steps:

s1, obtaining uploaded education resources, wherein the education resources comprise education videos and education courseware, the education videos are curriculum videos recorded in the teaching, academic lectures, academic conferences and scientific research and training processes, the education videos comprise courseware slide playing contents, and the education courseware matched with the explanation contents in the education videos;

s2, extracting the video stream characteristics of the education video based on the depth network model so as to generate a video key frame sequence, wherein the video key frame sequence comprises non-repeated video images;

s3, extracting a text sequence in each key frame image based on an optical character recognition method to generate a key frame text sequence;

s4, extracting the structural information of the education courseware, outputting the content of each page of courseware in a text form, and generating a courseware text sequence;

and S5, automatically positioning the video explanation position corresponding to each page of education courseware by adopting a courseware-video positioning algorithm to the courseware text sequence and the key frame text sequence.

Further, S2, extracting the video stream features of the education video based on the depth network model to generate a video key frame sequence, including:

s201, framing the education video to generate a framed image with a timestamp;

s202, extracting image characteristics of the frame images through a convolutional neural network;

s203, defining the first frame image as a key frame;

s204, calculating the similarity of the features of the adjacent frame images based on the image features, taking the next frame image as a key frame when the similarity is smaller than a preset threshold value, and taking the time point of the key frame corresponding to the video as the time stamp of the key frame; otherwise, moving to the next image to continue calculating the similarity;

and S205, outputting the key frame image sequence number frame _ id and the key frame timestamp frame _ ts.

Further, the convolutional neural network is one of VGG, GoogleNet, ResNet, densneet, MobileNet and ShuffleNet.

Further, S3, extracting the text sequence in each key frame image based on the optical character recognition method, including:

s301, preprocessing the key frame image;

s302, performing character detection on the preprocessed key frame image, and returning the position coordinates of the line where the text is in a rectangular frame form;

s303, carrying out character recognition on the basis of character detection, and converting the rectangular frame area into a text;

s304, sorting all recognized texts in ascending order according to the coordinates of the upper left corner of the positions of all recognized texts, and expressing the text sequence of the ith key frame image as frame _ kw [ i]＝[kw₁,kw₂,...,kw_n]Where i ═ 1, 2., K }, K denotes the number of keyframes, n denotes the number of texts recognized by the current keyframe image, kw denotes the number of texts recognized by the current keyframe image_nThe nth text representing the previous key frame image recognition.

Further, character detection is carried out on the preprocessed key frame image, and the character detection is carried out by one method of fast R-CNN, FCN, RRPN, TextBox, CTPN and SegLink.

Further, S4, extracting the structural information of the educational courseware, and outputting the content of each page of courseware in text form, including:

s401, reading an education courseware document, calling a document analyzer to analyze the education courseware document, and returning the types and text position coordinates of all objects contained in each courseware, wherein the types of the objects comprise text types, images, tables and curves, and the text position coordinates are expressed by text object area rectangular frame coordinates;

s402, performing character recognition on a non-text object in the education courseware by adopting an optical character recognition method, and simultaneously, directly reading text contents of a text object in the education courseware;

s403, sorting all recognized texts in ascending order according to the coordinates of the upper left corner of the positions of the recognized texts, and expressing the text sequence of the jth page of courseware as ppt _ kw [ j]＝[kw₁,kw₂,...,kw_m]Wherein j is {1, 2., P }, P denotes the total number of courseware pages, m denotes the number of texts recognized by the courseware pages, kw denotes the number of texts recognized by the courseware pages_mThe mth text representing the courseware identification for that page.

Further, S5, automatically locating the video explanation position corresponding to each educational course by using a course-video locating algorithm for the course text sequence and the key frame text sequence, including:

s501, obtaining a key frame label list corresponding to the text sequence, wherein the text sequence is frame _ kw [ i [ [ i ]]＝[kw₁,kw₂,...,kw_n]Where i ═ 1, 2., K }, K denotes the number of key frames, n denotes the number of texts recognized by the current key frame image, kw denotes the number of texts recognized by the current key frame image_nThe nth text representing the identification of the previous key frame image, and the key frame reference number list corresponding to the text sequence is frame _ numlist [ i []＝ [1,2,...,n]；

S502, for each text sequence frame _ kw [ i ] of the key frame image]Traversing all education course text sequences ppt _ kw [ j ]]J ═ 1,2,. said, P }, byThe texts are subjected to character string fuzzy matching one by one, if the similarity is greater than a set threshold value, a key frame image sequence number frame _ id [ i ] is output]The serial number of the corresponding element is not output, otherwise, 0 is output, and the number list of the courseware labels of the jth page of the key frame i is obtained as

Wherein ppt _ kw [ j [ ]]＝[kw₁,kw₂,...,kw_m]The text sequence of the courseware on the jth page, j ═ 1, 2.. multidot.p }, P represents the total number of courseware pages, m represents the number of texts recognized on the courseware page, kw represents the number of texts on the page_mThe mth text representing the courseware identification of the page;

s503, list of courseware labels

After-treatment is carried out when

When two or more than 0 appear in the sequence, only one 0 in the sequence is reserved to obtain a new courseware label list

For each frame _ numlist [ i]Go through all new

Calculating the similarity between the two, and returning the courseware corresponding to the maximum similarity as the courseware label matched with the key frame i;

s504, traversing each key frame image, repeatedly executing S503 to obtain courseware pages matched with all the key frame images, sorting the key frames in an ascending order according to the courseware pages, and returning courseware labels matched with all the key frame images;

s505, post-processing courseware labels matched with all the key frame images, and finding the longest increasing subsequence of the ordered key frame sequence number list;

s506, after sequencing according to the longest ascending subsequence, only keeping the key frame with the minimum sequence number in each group as the key frame image matched with the courseware of each page, and using the time stamp of the key frame image as the video starting time for positioning the courseware of the page.

The invention also discloses an automatic positioning device for the education video, which comprises the following components:

the education resource uploading module is used for acquiring uploaded education resources, wherein the education resources comprise education videos and education courseware, the education videos are course videos recorded in the teaching and teaching process, the academic lecture process, the academic meeting process and the research and training process, the education videos comprise courseware slide playing contents, and the education courseware is courseware matched with the explanation contents in the education videos;

a video key frame generation module for extracting video stream features of the educational video based on a depth network model to generate a sequence of video key frames comprising non-repeating video images;

the key frame character recognition module is used for extracting a text sequence in each key frame image based on an optical character recognition method and generating a key frame text sequence;

the courseware structured extraction module is used for extracting the structured information of the education courseware, outputting the content of each page of courseware in a text form and generating a courseware text sequence;

and the courseware-video positioning module is used for automatically positioning the video explanation positions corresponding to each page of education courseware by adopting a courseware-video positioning algorithm on the courseware text sequence and the key frame text sequence.

The invention also discloses a storage medium which comprises a stored program, wherein when the program runs, the automatic positioning method of the education video is executed.

Compared with the prior art, the invention has the following advantages:

1. the automatic positioning method for the education video can replace manual labeling to finish accurate automatic positioning from the education courseware to the education video, effectively improves the flexibility of online learning, and improves the use efficiency of users.

2. The invention provides a video key frame generation technology, which greatly reduces the number of images converted from videos, relieves the calculation pressure of subsequent modules and improves the overall operation efficiency. Meanwhile, the generated key frame images have low similarity and strong representativeness and can be used as a video key content overview.

3. The invention provides a key frame character recognition and courseware structuralized extraction method, which is an important basis for the accuracy of subsequent courseware-video positioning. According to the invention, matching granularity is refined to the comparison between text contents through a character recognition technology and a structured extraction method, rather than simply measuring the pixel-level similarity between images, so that the problem of matching failure caused by larger image difference due to the difference of image resolution, video recording angle, video background and the like between a key frame and a courseware when the key frame and the courseware are matched can be effectively avoided.

4. The invention provides a courseware-video positioning algorithm, and in the process of matching each page of courseware with a video key frame, the accurate matching from courseware to video is realized by considering the influence of multiple factors such as text position relation, OCR fault tolerance, courseware playing animation effect and the like. Specifically, 1) text information and character position information are fully fused in the text position relationship, the matching method of the text information and character position information is more in line with human experience, and matching accuracy is improved by acquiring multi-dimensional information through multi-information fusion; 2) the OCR fault tolerance takes the difficulty difference of the video key frame and the courseware on an OCR task into consideration, the matching standard is relaxed through a character string fuzzy matching mode, and the matching rate of key frame character recognition is improved; 3) the method can match the time point of playing the courseware in the video at the beginning for each courseware, and better meets the user requirements of the actual use scene.

5. The invention provides a key frame LIS post-processing method, introduces the idea of dynamic optimization to improve the influence of an error matching result, and increases the fault tolerance and stability of the algorithm.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to these drawings without creative efforts.

Fig. 1 is a flow chart of an educational video automatic positioning method of the present invention.

Fig. 2 is a courseware-video positioning flowchart of the present invention.

Fig. 3a is example key frame 1 in an embodiment.

FIG. 3b is a key frame OCR diagram corresponding to key frame example 1 in an embodiment.

Fig. 3c is key frame example 2 in the example.

FIG. 3d is a key frame OCR diagram corresponding to key frame example 2 in an embodiment.

Fig. 3e is example key frame 3 in an embodiment.

FIG. 3f is a key frame OCR diagram corresponding to key frame example 3 in an embodiment.

FIG. 4a shows an example of ppt-type courseware resources 1 in the embodiment.

FIG. 4b shows an example of a ppt-type courseware resource 2 in the embodiment.

Fig. 5a is an example of pdf type courseware resources 1 in an embodiment.

Fig. 5b is an example of pdf type courseware resources 2 in an embodiment.

Fig. 6 is an example of a text positional relationship in the embodiment.

FIG. 7a is an example of a key frame in an exemplary OCR fault tolerance example of an embodiment.

Figure 7b is an example of a courseware corresponding to a key frame in an example of OCR fault tolerance in an embodiment.

Fig. 8 is an example of the animation effect of playing the courseware in the embodiment.

Fig. 9 is an example of a key frame 116 in an embodiment.

Fig. 10 is an example of courseware page number 2 in the embodiment.

Fig. 11 is an example of courseware page number 5 in the embodiment.

Fig. 12a is visualization example 1 in the embodiment.

Fig. 12b is visualization example 2 in the embodiment.

Fig. 12c is visualization example 3 in the embodiment.

Fig. 12d is example 4 of visualization in an embodiment.

Fig. 13 is a schematic structural diagram of an automatic positioning device for educational video according to the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood by those skilled in the art, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in fig. 1, the present invention provides an automatic positioning method for education video, comprising the following steps:

s1, obtaining uploaded education resources, wherein the education resources comprise education videos and education courseware, the education videos comprise courseware slide playing contents, and the education courseware is matched with the explanation contents in the education videos.

Specifically, the education video mainly refers to a course video recorded in the course of teaching lectures, academic conferences, scientific research training and the like. Particularly, the video comprises the contents of the courseware slide show, and the video format can be MPEG/AVI/MP4/MKV/FLV, etc.; the education courseware refers to courseware matched with the video interpretation process, and the courseware format can be ppt/pdf and the like.

S2, extracting the video stream characteristics of the education video based on the depth network model so as to generate a video key frame sequence, wherein the video key frame sequence comprises non-repeated video images.

Specifically, the step of extracting the characteristics of the video stream from the input educational video through a depth network model, generating video key frames in a group of image sequence modes which are as non-repetitive as possible, and presenting a quick overview of the video mainly comprises the following steps:

s201, video framing. The input educational video is framed and converted into a time-stamped framed image.

And S202, extracting the characteristics of the frame images. And extracting the feature vector of each frame image by adopting a convolutional neural network. The convolutional neural network may be VGG, GoogleNet, ResNet, densneet, MobileNet, ShuffleNet, etc.

And S203, detecting key frames. Calculating the similarity of the characteristics of the adjacent frame images, and when the similarity is smaller than a preset threshold value, taking the image as a key frame, and taking the time point of the corresponding video as a time stamp of the key frame; otherwise, moving to the next image and continuing to calculate the similarity. Specifically, the first frame image is defined as a key frame.

And S204, outputting the key frame. And outputting the video key frame image sequence number (frame _ id) and the key frame timestamp (frame _ ts) obtained in the step.

And S3, extracting a text sequence in each key frame image based on an optical character recognition method, and generating a key frame text sequence.

Specifically, this step employs Optical Character Recognition (OCR) techniques to extract and return the text sequence in each keyframe image. The method mainly comprises the following steps:

s301, image preprocessing. A common preprocessing method corrects the imaging problem of the key frame image, and includes: graying, distortion correction, blur removal, image enhancement, light correction and the like.

And S302, character detection. And detecting a text area existing in the key frame image, and returning the position coordinates of the line of the text in a rectangular frame form. The word detection algorithm may be fast R-CNN, FCN (full relational Networks), rrpn (rotation Region probable Networks), TextBoxes, CTPN (Connectionist Text probable Networks), seglinks, etc.

And S303, character recognition. The goal of word recognition is to convert text line regions to text based on word detection. The character recognition algorithm can be based on traditional methods, such as template matching, sliding window and the like; or the more popular deep learning-based methods such as CRNN, Seq2Seq, etc.

And S304, outputting the key frame text sequence. Sorting all recognized texts in ascending order according to the coordinates at the upper left corner of the positions of all recognized texts, and then representing the text sequence of the ith key frame image as frame _ kw [ i]＝ [kw₁,kw₂,...,kw_n]. Where, i ═ 1, 2., K }, K denotes the number of key frames, n denotes the number of recognized texts of the image, and kw denotes the number of recognized texts of the image_nThe nth text representing the image recognition.

And S4, extracting the structural information of the education courseware, outputting the content of each page of courseware in a text form, and generating a courseware text sequence.

Specifically, the method is used for extracting courseware structured information and outputting courseware content of each page in a text form, and mainly comprises the following steps:

s401, reading and analyzing the document. And reading the educational courseware document, calling a document analyzer to analyze document objects, and returning the type and position coordinates of all the objects contained in each courseware. The object type includes text type, image, table, curve, etc., and the text position coordinate is expressed by the object area rectangular frame coordinate.

S402, courseware text recognition. Performing character recognition on non-text objects in courseware by adopting an OCR technology; for text type objects, the text content is read directly.

And S403, outputting a courseware text sequence. Sorting all recognized texts in ascending order according to the coordinates of the upper left corner of the positions of all recognized texts, and expressing the text sequence of the courseware of the jth page as ppt _ kw [ j ] j]＝[kw₁,kw₂,...,kw_m]. Wherein, j ═ 1, 2., P }, P denotes the total number of courseware pages, m denotes the number of texts recognized by the courseware pages, kw denotes the number of texts recognized on the pages_mThe mth text representing the courseware identification for that page.

Specifically, in the step, the video explanation position corresponding to each page of courseware is automatically positioned by matching courseware content with the video key frame text, so as to complete accurate and rapid positioning from courseware to video, and a flow chart is shown in fig. 2. In particular, the step considers the influence of multiple factors such as text position relation, OCR fault tolerance, courseware playing animation effect and the like, provides a courseware-video positioning algorithm and realizes accurate matching from courseware to video. Specifically, the method comprises the following steps:

(1) considering text position relation

In the process of matching the key frames with the courseware, when the same text intersection exists between one key frame and a plurality of pages of courseware, in addition to the similarity between texts, the position relation between texts is also considered so as to improve the matching accuracy.

(2) Considering OCR fault tolerance

Because the key frame image is intercepted from the video frame, the key frame image is influenced by factors such as a video lens, a recording angle, a teacher position and the like, the image resolution and definition are not high, the key frame OCR task is more difficult, the character recognition result is inferior to the courseware recognition result, and the key frame image and the courseware recognition result cannot be matched. In order to solve the problems, the invention considers OCR fault tolerance and adopts a character string fuzzy matching method to calculate the similarity of character strings, namely when the similarity of two character strings is larger than a set threshold, the two character strings are considered to be matched with each other, so as to avoid the problem that the key frame and courseware can not be matched due to inaccurate OCR recognition.

(3) Animation effect in courseware playing is considered

Since there may be animation effects (e.g. line-by-line output effects) during the course of playing courseware, the key frame captured by the video during recording may only show a part of the contents of the courseware, but it is desirable to match the time point of playing the courseware in the video for each courseware. Therefore, in courseware-to-video positioning, the animation effect of courseware playing in the video key frame should be considered.

In consideration of the three situations, the invention provides a courseware-video positioning algorithm to realize the page-by-page mapping of courseware to video key frames, which mainly comprises the following steps:

and S501, marking the video key frame (considering the text position relation).

According to the result returned from the step of S3 key frame character recognition, the frame _ kw [ i ] of the text sequence of the ith key frame image is obtained]＝[kw₁,kw₂,...,kw_n]. In order to consider the text position relation, the text sequence frame _ kw [ i]The corresponding key frame index list is denoted as frame _ numlist [ i]＝ [1,2,...,n]. Wherein kw is_nThe nth text representing the image recognition, n representing the number of texts recognized by the image.

S502, courseware labeling (considering OCR fault tolerance).

Obtaining a text sequence ppt _ kw [ j ] of the courseware on the jth page according to a result returned by the courseware structured extraction module]＝[kw₁,kw₂,...,kw_m]. Text sequence frame _ kw [ i ] for each key frame image]Go through all ppt _ kw [ j ]]J ═ 1, 2.. multidot.p }, where OCR fault tolerance is considered, string fuzzy matching is done on a text-by-text basis, and if the similarity is greater than a set threshold, frame _ id [ i ] is output]And the sequence number of the corresponding element, otherwise, 0 is output. Based on the above method, the list of labels for the keyframe i, page j courseware can be expressed as

And S503, post-processing courseware labels (considering courseware playing animation effect).

In order to consider the animation effect of courseware playing, post-processing is carried out on courseware labels, namely when the courseware labels are processed

When two or more consecutive 0 s appear in the sequence, only one 0 in the sequence is retained. This is because

The appearance of 0 in the sequence represents that the courseware content does not appear in the key frame, but when the courseware animation effect exists in the video showing process (such as line-by-line output), the key frame may just display the content of the beginning part of the courseware, and at the moment, the similarity between the key frame and the complete courseware content is made, so that the similarity is caused

Many 0 s are output and the match fails. Therefore, only the courseware label post-processing method is reserved

One of two or more than 0 in the sequence can effectively solve the problem that each courseware is matched with the video key frame and the time point of playing the courseware at the beginning. Post-processed courseware mark we still use

Indicates, for each frame _ numlist [ i ]]Go through all

And calculating the similarity between the two frames, and returning the maximum similarity as the courseware matched with the key frame i.

And S504, matching the whole key frame image with the courseware page number.

And traversing each key frame image, repeating S503 to obtain the courseware page number matched with all the key frame images, grouping and sequencing the key frames in an ascending order according to the courseware page number, and returning to the courseware page numbers matched with all the key frame images.

And S505, key frame LIS post-processing.

And (4) post-processing the returned result of the step four, and finding the Longest Incremental Subsequence (LIS) of the sorted key frame sequence number list, wherein the purpose is to eliminate the influence of a wrong matching result so as to increase the fault tolerance and stability of the algorithm.

And S506, outputting courseware-video positioning information.

And after sequencing according to the longest ascending subsequence, only keeping the key frame with the minimum sequence number in each group as a key frame image matched with each courseware, wherein the time stamp of the key frame image is used as the video starting time for positioning the courseware.

The scheme and effect of the invention are further illustrated by the following specific application examples:

the automatic positioning method for the education video provided by the embodiment comprises the following steps:

and S1, acquiring the uploaded educational resources, wherein the educational resources comprise educational videos and educational courseware.

And S3, extracting a text sequence in each key frame image based on an optical character recognition method, and generating a key frame text sequence. In this embodiment, the recognition results of the key frame character recognition module on the key frame image and the key frame OCR are shown in fig. 3a to 3f, and fig. 3b, 3d and 3f respectively show the recognition effect of the image including the code block, the table and the picture.

And S4, extracting the structural information of the education courseware, outputting the content of each page of courseware in a text form, and generating a courseware text sequence. In this embodiment, the structured extraction process of courseware resources by the courseware structured extraction module is shown in fig. 4-5. Fig. 4a and 4b show the structured extraction result of the courseware with the courseware resource type of ppt. Fig. 5a and 5b show the result of structured extraction of the courseware with the type of courseware resource being pdf. Firstly, analyzing a document and identifying the type of each object in a courseware; secondly, when the object is a Text type (Text), the content of the Text is directly read and output; when the object is non-text type, the object is subjected to secondary OCR recognition, and the result after character recognition is output.

And S5, automatically positioning the video explanation position corresponding to each page of education courseware by adopting a courseware-video positioning algorithm to the courseware text sequence and the key frame text sequence. The embodiment can illustrate the influence of the text position relationship, the OCR fault tolerance and the courseware playing animation effect on the positioning result in the process of matching courseware and video key frames:

(1) considering text position relation

As in fig. 6, the same text of courseware P1, courseware P2, and key frame K1 are all [ "HBase data model", "physical view" ], but it is clear that courseware P2 should match key frame K1. Therefore, when the courseware-video positioning is carried out, the similarity between texts and the position relation between the texts are considered.

(2) Considering OCR fault tolerance

OCR character recognition was performed on the key frame K1 and the courseware P1 in fig. 7a and 7b, respectively, and the results are shown in table 1. By contrast, in the key frame character recognition, the character strings "betweeen", "empty or not", "fuzzy matching", "LIKE" are misrecognized as "beiweeen", "empty in hundreds", "template matching", "LKE". In the embodiment, considering OCR fault tolerance, a fuzzy matching method process of fuzzy wuzzy character strings is adopted, and a threshold score _ cutoff is set to 70, that is, when the similarity of two character strings is greater than 70%, the two character strings are considered to be matched.

(3) Animation effect in courseware playing is considered

As shown in fig. 8, since there is animation effect output line by line during the course showing process, the key K1-key frame K3 are three key frames in the video sequence. And we would like courseware P1 to match to the point in time when the video began teaching the courseware page, i.e. courseware P1 matches key frame K1.

TABLE 1 OCR character recognition results

Next, the specific steps of the courseware-video positioning algorithm in this embodiment are explained:

and S501, marking video key frames.

Taking fig. 9 as an example of a video key frame image 116, according to the module three-key-frame character recognition module, the key frame image text sequence frame _ kw [116] ═ HBase data model, physical view, 00095800] can be obtained, and the corresponding key frame index list frame _ numlist [116] ═ 1,2,3 ].

And S502, courseware marking.

With reference to fig. 10 and fig. 11 as examples of courseware pages 2 and 5 (hereinafter referred to as "courseware 2" and "courseware 5"), respectively, a text sequence ppt _ kw [2 ] of courseware 2 can be obtained according to the module-four courseware structured extraction module]The HBase data model, related concepts, concept view, physical view]Mixing ppt _ kw [2 ]]Each text in the sequence is associated with a frame _ kw [116]The similarity is calculated for each text in the list, and the corresponding label is found. E.g. ppt _ kw [2 ]]Medium "HBase data model" is matched to frame _ kw [116]The "HBase data model" in (1) is marked as 1; ppt _ kw [2 ]]Middle "related concepts" did not match to frame _ kw [116]Middle element, the number is 0; ppt _ kw [2 ]]Middle "physical View" matching to frame _ kw [116]The "physical view" in (1) is numbered 3. Thus, a list of labels for courseware 2 relative to key frames 116 is obtained

Courseware 5 text sequence ppt _ kw [5]The HBase data model, physical view, row key, timestamp.]Mixing ppt _ kw [5 ]]Each text in the sequence is associated with a frame _ kw [116]Calculates similarity for each text in the text, and obtains courseware 5 relative to key frame 116List of reference numerals

And S503, post-processing courseware labels.

When two or more 0's appear in the courseware label list, only one 0 is kept. List of class 2 labels relative to key frames 116 after class label post-processing

List of labels for courseware 5 relative to keyframes 116

And calculating the similarity between the key frame label list and the courseware label list, and returning the courseware page number corresponding to the maximum similarity as the courseware page number matched with the key frame. In this embodiment, the similarity calculation formula of the list _ A, list _ B is defined as shown in formula (1).

Where, edge _ distance represents the edit distance between two lists, and len represents the length of the list. In summary, the similarity between the keyframe image 116 and the courseware page 2 is calculated

Similarity of keyframe image 116 to courseware page 5

Thus, the key-frame image 116 matches to courseware page number 5.

And S504, matching the whole key frame image with the courseware page number.

And traversing each key frame image, repeating S503, and returning courseware page numbers matched with all the key frame images, as shown in Table 2. The keyframe sequence numbers are sorted in ascending order by courseware page number, with the results shown in Table 3.

And S505, LIS post-processing.

A key frame sequence number list is obtained according to table 3:

frame _ list ═ 2,20,31,45,65,116,8,175,197,201,206,200,316], found frame _ list longest ascending subsequence LIS _ frame _ list ═ 2,20,31,45,65,116,175,197,201,206,316], where elements 8 and 200 in the frame _ list are eliminated, with the results after LIS post-processing as shown in table 4.

And S506, outputting courseware-video positioning information.

Only the minimum key frame number in each group in table 4 is reserved as the key frame image corresponding to the courseware of the page. According to the corresponding relationship between the key frame image sequence number and the key frame timestamp output by the video key frame generation module, a courseware-video positioning table is obtained in the lesson, as shown in table 5. Each row in the table represents the key frame image sequence number corresponding to the courseware page number, and the located video timestamp.

In addition, the embodiment also provides a step of displaying the positioning result on a visual interface. 12 a-12 d show a visualization interface presentation with the right rectangle representing the courseware page number clicked and the left being the video location of the jump. Specifically, the courseware shown in fig. 12a has animation effects output line by line in the playing process, and the method can accurately map to the time point when the video starts to explain the courseware; as shown in fig. 12b, when the resources played in the left video go in and out of the courseware resources on the right side, the text recognition technology and the structured extraction method adopted by the invention are applied to compare the texts, so that the video key frames corresponding to the courseware can be accurately matched; fig. 12c and 12d show that the method of the present invention can obtain correct matching results when the courseware contents are in plain text form and in band-diagram form.

The invention also discloses an automatic positioning device for education videos, which comprises the following components as shown in fig. 13:

the education resource uploading module is used for acquiring uploaded education resources, the education resources comprise education videos and education courseware, the education videos are course videos recorded in the teaching and teaching process, the academic lecture process, the academic meeting process and the scientific research training process, the education videos comprise courseware slide playing contents, and the education courseware is courseware matched with the explanation contents in the education videos;

For the embodiments of the present invention, the description is simple because it corresponds to the above embodiments, and for the related similarities, please refer to the description in the above embodiments, and the detailed description is omitted here.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a hardware form, and can also be realized in a software functional unit form.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An automatic positioning method for education videos is characterized by comprising the following steps:

s2, extracting the video stream characteristics of the education video based on the depth network model so as to generate a video key frame sequence, wherein the video key frame sequence comprises non-repetitive video images;

2. The method for automatically positioning educational video according to claim 1, wherein the step S2 of extracting the video stream features of the educational video based on the depth network model to generate the video key frame sequence comprises:

s201, framing the education video to generate a framed image with a timestamp;

s203, defining the first frame image as a key frame;

3. The automatic positioning method for educational video according to claim 2, wherein the convolutional neural network is one of VGG, GoogleNet, ResNet, densneet, MobileNet and ShuffleNet.

4. The automatic positioning method for educational video according to claim 1, wherein S3, extracting text sequence in each key frame image based on optical character recognition method, comprises:

s301, preprocessing the key frame image;

S304. sorting all recognized texts in ascending order according to the coordinates of the upper left corner of the positions of all recognized texts, and then expressing the text sequence of the ith key frame image as frame _ kw [ i]＝[kw₁，kw₂，...，kw_n]Where f ═ 1, 2., K }, K denotes the number of keyframes, n denotes the number of texts recognized by the current keyframe image, kw denotes the number of texts recognized by the current keyframe image_nThe nth text representing the previous key frame image recognition.

5. The method of claim 4, wherein the text detection is performed on the preprocessed key frame image, and comprises text detection by one of FasterR-CNN, FCN, RRPN, TextBox, CTPN, and SegLink.

6. The automatic positioning method for education videos as claimed in claim 1, wherein the step of S4 extracting the structural information of the education courseware and outputting the content of each page of courseware in text form comprises:

s401, reading an education courseware document, calling a document analyzer to analyze the education courseware document, and returning the types and text position coordinates of all objects contained in each course page, wherein the types of the objects comprise text types, images, tables and curves, and the text position coordinates are expressed by text object area rectangular frame coordinates;

s403, sorting all recognized texts in ascending order according to the coordinates of the upper left corner of the positions of the texts, and expressing the text sequence of the courseware of the jth page as ppt _ kw [ j ] j]＝[kw₁，kw₂，...，kw_m]Wherein j is {1, 2., P }, P denotes the total number of courseware pages, m denotes the number of texts recognized by the courseware pages, kw denotes the number of texts recognized by the courseware pages_mThe mth text representing the courseware identification for that page.

7. The automatic positioning method for educational video according to claim 1, wherein the step S5 of automatically positioning the video explanation position corresponding to each educational course page by using a course-video positioning algorithm for the course text sequence and the key frame text sequence comprises:

s501, obtaining a key frame label list corresponding to the text sequence, wherein the text sequence is frame _ kw [ i [ [ i ]]＝[kw₁，kw₂，...，kw_n]Where i ═ 1, 2., K }, K denotes the number of keyframes, n denotes the number of texts recognized by the current keyframe image, kw denotes the number of texts recognized by the current keyframe image_nThe nth text representing the identification of the previous key frame image, and the key frame reference number list corresponding to the text sequence is frame _ numlist [ i []＝[1，2，...，n]；

S502, for each text sequence frame _ kw [ i ] of the key frame image]Traversing all education courseware text sequences ppt _ kw [ j ]]J ═ 1, 2.. multidot.p }, making character string fuzzy matching one by one through text, if the similarity is greater than the set threshold value, outputting key frame image sequence number frame _ id [ i [, i ], then outputting key frame image sequence number frame _ id [ i ],]the serial number of the corresponding element is not output, otherwise, 0 is output, and the number list of the courseware labels of the jth page of the key frame i is obtained as

Wherein ppt _ kw [ j [ ]]＝[kw₁，kw₂，...，kw_m]The text sequence of the courseware on the jth page, j ═ 1, 2.. multidot.p }, P represents the total number of courseware pages, m represents the number of texts recognized on the courseware page, kw represents the number of texts on the page_mThe mth text representing the courseware identification of the page;

s503, list of courseware labels

After-treatment is carried out when

For each frame _ numlist [ i]Go through all new

s504, traversing each key frame image, repeatedly executing S503 to obtain the number of courseware pages matched with all the key frame images, sorting the key frames in an ascending order according to the number of the courseware pages, and returning courseware labels matched with all the key frame images;

8. An educational video automatic positioning apparatus, comprising:

the education resource uploading module is used for acquiring uploaded education resources, the education resources comprise education videos and education courseware, the education videos are course videos recorded in the teaching, academic lectures, academic conferences and scientific research training processes, the education videos comprise courseware slide playing contents, and the education courseware are courseware matched with the explanation contents in the education videos;

9. A storage medium comprising a stored program, wherein the program when executed performs an educational video automatic positioning method according to any one of claims 1 to 7.