CN108268539A

CN108268539A - Video matching system based on text analyzing

Info

Publication number: CN108268539A
Application number: CN201611266235.0A
Authority: CN
Inventors: 李菁菁; 黎哲明; 蔡鸿明; 姜丽红; 步丰林
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2016-12-31
Filing date: 2016-12-31
Publication date: 2018-07-10

Abstract

A kind of video matching system based on text analyzing, including：Caption analysis module, index module and search module, wherein：The time that word content and word content in caption analysis module extraction subtitle file occur in video, word content is segmented using stammerer participle, and using TF IDF algorithms the subtitle keyword of word content obtained to the word content after participle and at the beginning of subtitle keyword occurs in video and the end time, index module establishs or updates video index according to subtitle keyword and its using hash indexing method after starting and end time, subtitle keyword of the search module in search key input by user and video index compares and returns to the list of videos of similitude maximum, the present invention is realized puies forward the automatic process for establishing index according to subtitle, it ensure that the accuracy of search result, user is helped quickly to position search key corresponding time interval in video.

Description

Video matching system based on text analyzing

Technical field

The present invention relates to a kind of technology of field of video retrieval, specifically a kind of video matching based on text analyzing System.

Background technology

Network courses have been widely used in as a kind of educational media under current internet environment, more and more people Knowledge is obtained by online education.Existing instructional video has the characteristics that the time is short but quantity is more.Existing video frequency searching Technology is to be based on course description or video labeling, but course description cannot reflect the knowledge occurred in curriculum video completely Point, in fact it could happen that description and the unmatched situation of content.Automated video mark needs to carry out key-frame extraction and right to video Content in key frame is identified, but recognition effect is bad for instructional video, and accuracy rate is not high, and artificial Mark corresponds to the video dependent on mark person the familiarity and abstract ability of course, while annotation results cannot equally be contained All knowledge contents in lid curriculum video.

Invention content

The present invention is directed to deficiencies of the prior art, proposes a kind of video matching system based on text analyzing, It can effectively ensure that the accuracy of search result.

The present invention is achieved by the following technical solutions：

The present invention includes：Caption analysis module, index module and search module, wherein：Caption analysis module extracts subtitle The time that word content and word content in file occur in video divides word content using stammerer participle Word, and the subtitle keyword of word content is obtained using TF-IDF algorithms to the word content after participle and subtitle keyword exists In video occur at the beginning of and the end time, index module is according to subtitle keyword and its after starting and end time Video index is establishd or updated using hash indexing method, search module is according to search key input by user and video index In subtitle keyword compare and return to the list of videos of similitude maximum.

The stammerer participle is a kind of powerful Chinese word segmentation component, including accurate model, syntype and search engine Three kinds of participle patterns of pattern, the present invention carry out text analyzing using accurate model.Stammerer participle is realized efficient based on prefix dictionary The scanning of word figure, generate Chinese character in sentence and be possible into the directed acyclic graph (DAG) that word situation is formed, and using dynamically rule Draw and search maximum probability path, find out the maximum cutting word combination based on word frequency, and for unregistered word, using based on Chinese character into word The HMM model of ability, while use Viterbi algorithm.

The TF-IDF algorithms, i.e. term frequency-inverse document frequency algorithm occur in a document by a keyword The weight of number and inverse document frequency, i.e. keyword in a document obtains the TF-IDF values of the keyword, some keyword pair The importance of document is higher, then its TF-IDF values are bigger.The present invention obtains the pass in subtitle file using TF-IDF algorithms Keyword and its keyword value, at the beginning of determining the keyword in corresponding video according to its sequence and the end time.

The hash indexing method is to carry out Hash operation using subtitle keyword as index key, by Hash operation result It is deposited in a Hash table with corresponding line pointer information.The retrieval of hash index can be avoided multiple with one-time positioning I/O is accessed, and improves search efficiency.

The present invention relates to a kind of matching process according to above system, include the following steps：

Step 1) reads the subtitle file in video by caption analysis module, in the word of the subtitle in current video Appearance is analyzed, and extracts subtitle keyword；

Step 2) sends the subtitle set of keywords of acquisition to index module, and index module passes through training subtitle keyword Video index or the existing video index of update are established, and will be in new index storage to database；

Step 3) obtains index file, and pass through input by user search when user inputs search key in systems Rope keyword to carry out Similarity measures with the subtitle keyword in video index, obtains the highest set of keywords of similitude, Return to the corresponding list of videos of user and temporal information that search key occurs in video.

Technique effect

Compared with prior art, the present invention realizes education video and corresponds to subtitle keyword according to caption recognition establishes automatically The process of index builds term vector set by training word2vec, so as to by calculating the cosine similarity between word, Search key with subtitle keyword is matched, is effectively guaranteed the accuracy of search result, it is crucial in extraction subtitle Subtitle keyword is obtained during word, the time interval occurred is corresponded in video, user is helped quickly to position search key and is existed Corresponding time interval in video.

Description of the drawings

Fig. 1 is present system structure diagram.

Specific embodiment

The experiment of the present invention is deployed on Ali's cloud host of 18 core 16G memory.It is downloaded after Telnet host first The binary versions of word2vec then by Chinese and English corpus totally 370 ten thousand parts of articles of training wikipedia, take three Hour, obtain an output file for containing all term vectors, size 8G.Meanwhile for a video caption file First by being cut into the form of several small documents, 8 processes are opened simultaneously equally on this machine to above-mentioned small documents Parallel carries out participle operation and is output in file.This step improves 65% place compared to the processing mode of individual process Manage speed.Finally the index for building completion is stored in memory database Redis, reduces disk I/O, improves inquiry velocity about 20%.

As shown in Figure 1, the present embodiment includes：Caption analysis module, index module and search module, wherein：Caption analysis The time that word content and word content in module extraction subtitle file occur in video, using stammerer participle to word Content is segmented, and obtains the subtitle keyword and word of word content using TF-IDF algorithms to the word content after participle At the beginning of curtain keyword occurs in video and the end time, index module according to subtitle keyword and its time started and Video index is establishd or updated using hash indexing method after end time, search module is according to search key input by user It is compared with the subtitle keyword in video index and returns to the list of videos of similitude maximum.

The caption analysis module is the basis of index module, and curriculum video file includes video file and subtitle text Part, caption analysis module receive the subtitle file in curriculum video, the word content in subtitle file are analyzed, extraction text The time point that subtitle keyword and subtitle keyword in word content occur in curriculum video.

The caption analysis module by write script obtain subtitle file in word content and its it is corresponding It is the time relationship occurred in video, word content is segmented using stammerer participle later, to the word after the completion of participle Content using TF-IDF algorithms obtain subtitle keyword and its video occur at the beginning of and the end time.

Memory module is preferably further provided in the present apparatus, memory module storage video index and video file.

The index module is mainly used for building video index, if there are no establishing video index in current system, Video index is then established according to the subtitle keyword obtained in caption analysis module, otherwise will update existing video index, it Afterwards by new video index storage to the database in memory module, facilitate query video.

The method that the index module uses hash index, read first video file subtitle keyword and its Section is gathered, and reversely establishes subtitle keyword to the relationship of video and time interval, that is, builds subtitle keyword, video file With the content item of time interval, cryptographic Hash then is calculated to subtitle keyword, corresponding entry is written in Hash bucket.For There is hash-collision situation, we are solved in a manner that Hash table adds chained list.It can be added by the way of hash index The structure and renewal process indexed soon.The update method of index uses original place more new strategy, i.e., directly in existing index structure On modify.After increasing new curriculum video in video library and completing subtitle keyword extraction, directly in existing index Increase corresponding subtitle keyword, video file and time interval entry in structure newly.Original place update can pass through cryptographic Hash Whether directly positioning subtitle keyword has existed in original video index, so as to determine the mistake of additional entry or newly-increased entry Journey.

The search module is the module that query result is generated to search key input by user.Detailed process is root The search key provided according to user is matched with the subtitle keyword in video index, and the subtitle calculated in video index closes Similitude between key word and the search key of inquiry returns to the corresponding list of videos of similitude maximum, and from database Middle reading correlated curriculum video returns to user.Specific matching process is to build term vector set, meter by training word2vec The cosine similarity between search key and index key is calculated to obtain best between subtitle keyword and search key With result.Word2vec reads in the word in sliding window by constructing double-deck neural network, in input layer, their vector is added Together, the node of hidden layer is formed.Output layer is then a binary tree built by Hofman tree algorithm, in hidden layer Each node and the node of binary tree have the company sides of Weighted Coefficients.In given context, for a word w to be predicted, At this moment it just allows the binary coding maximum probability of prediction word, then solves parameter by using the method that gradient declines.It is logical Cross network struction into term vector model there is very high linguistics to evaluate, the relationship between two term vectors, can directly from It is embodied in the difference of the two vectors.Such as C (king)-C (queen)=C (man)-C (woman).

The memory module includes file system and database, to store video index and all curriculum videos, side Video index is obtained when being inquired after just, is also used in postorder keyword match, query video result is provided.

When system works, caption analysis module reads the subtitle file in video, to the word of the subtitle in current video Content is analyzed, and is extracted subtitle keyword, is sent the subtitle set of keywords of acquisition to index module, index module passes through Training subtitle keyword establishes video index or the existing video index of update, and will be in new index storage to database.When When user inputs search key in systems, obtain index file, and by search key input by user come and video Subtitle keyword in index carries out Similarity measures, obtains the highest set of keywords of similitude, it is corresponding to return to user The temporal information that list of videos and search key occur in video.

Compared with prior art, the present invention realizes education video and corresponds to subtitle keyword according to caption recognition and build automatically The process that lithol draws builds term vector set, so as to similar by calculating the cosine between word by training word2vec Degree, search key with subtitle keyword is matched, is effectively guaranteed the accuracy of search result, is closed in extraction subtitle Subtitle keyword is obtained during key word, the time interval occurred is corresponded in video, user is helped quickly to position search key Corresponding time interval in video.It is compared with existing method, search time reduces about 8%, and search accuracy rate improves About 6%.

Above-mentioned specific implementation can by those skilled in the art under the premise of without departing substantially from the principle of the invention and objective with difference Mode carry out local directed complete set to it, protection scope of the present invention is subject to claims and not by above-mentioned specific implementation institute Limit, each implementation within its scope is by the constraint of the present invention.

Claims

1. a kind of video matching system based on text analyzing, which is characterized in that including：Caption analysis module, index module and Search module, wherein：What word content and word content in caption analysis module extraction subtitle file occurred in video Time segments word content, and obtain text using TF-IDF algorithms to the word content after participle using stammerer participle At the beginning of the subtitle keyword and subtitle keyword of word content occur in video and the end time, index module according to Subtitle keyword and its video index, search module are establishd or updated using hash indexing method after starting and end time Subtitle keyword in search key input by user and video index compares and returns to the video row of similitude maximum Table.

2. the video matching system according to claim 1 based on text analyzing, it is characterized in that, stammerer participle packet Three kinds of accurate model, syntype and search engine pattern participle patterns are included, it is efficient to be based on the realization of prefix dictionary for wherein accurate model The scanning of word figure, generate Chinese character in sentence and be possible into the directed acyclic graph that word situation is formed, and use Dynamic Programming is looked into Maximum probability path is looked for, finds out the maximum cutting word combination based on word frequency.

3. the video matching system according to claim 2 based on text analyzing, it is characterized in that, the TF-IDF is calculated Method, i.e. term frequency-inverse document frequency algorithm, the number and inverse document frequency occurred in a document by a keyword, i.e., The weight of keyword in a document is obtained the TF-IDF values of the keyword, can be obtained in subtitle file using TF-IDF algorithms Keyword and its keyword value, at the beginning of determining the keyword in corresponding video according to its sequence and at the end of Between.

4. the video matching system according to claim 1 based on text analyzing, it is characterized in that, the hash index side Method is to carry out Hash calculation to the subtitle keyword extracted to obtain cryptographic Hash, and Hash operation result and corresponding row are referred to Needle information is deposited in a Hash table, and video caption key word index is established with this.

5. a kind of matching process of the system according to any of the above-described claim, which is characterized in that include the following steps：

Step 1) reads the subtitle file in video by caption analysis module, to the word content of the subtitle in current video into Row analysis, extracts subtitle keyword；

Step 2) sends the subtitle set of keywords of acquisition to index module, and index module is established by training subtitle keyword Video index or the existing video index of update, and will be in new index storage to database；

Step 3) obtains index file when user inputs search key in systems, and passes through search input by user and close Key word to carry out Similarity measures with the subtitle keyword in video index, obtains the highest set of keywords of similitude, returns The temporal information occurred in video to the corresponding list of videos of user and search key.