CN110019817A

CN110019817A - A kind of detection method, device and the electronic equipment of text in video information

Info

Publication number: CN110019817A
Application number: CN201811473997.7A
Authority: CN
Inventors: 曹绍升; 孙晓军; 周俊
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2018-12-04
Filing date: 2018-12-04
Publication date: 2019-07-16

Abstract

Subject description discloses detection method, device and the electronic equipments of a kind of text in video information.The inspection method of the text in video information includes: to extract Target Photo from video to be detected, including the key frame in video to be detected；Text information is extracted from Target Photo, and the sentence after text sentence is segmented is carried out to the text information；Vector conversion further is carried out to the sentence after participle, obtains the term vector segmented in sentence；The term vector of sentence and conversion acquisition after participle is finally inputted into textual classification model, semantics recognition is carried out by textual classification model and whether is exported in characterization text information comprising with the semantics recognition result for presetting semantic text, to realize the detection of violation video text, the extraction of key frame and the semantics recognition of text information avoid not detecting violation video text because of the variation of simple literal expression mode i.e. in video, improve the accuracy of violation video text detection and check efficiency.

Description

A kind of detection method, device and the electronic equipment of text in video information

Technical field

This specification is related to software technology field, in particular to a kind of detection method of text in video information, device and Electronic equipment.

Background technique

With the continuous development of network technique, multimedia resource is in explosive growth, wherein the growth of video resource is especially Rapidly, the supervision of video quality is particularly important.Video quality includes the supervision of picture material and the supervision of video text, figure As the interception of content supervised mainly to violation picture material, have been able to be blocked well by image recognition technology It cuts, but the supervision for video text, due to the variability and the uncertainty of appearance position in video of text, video text Violation be difficult to be detected, a kind of detection method of text in video information is needed, to realize the inspection of violation video text It surveys.

Summary of the invention

This specification embodiment provides detection method, device and the electronic equipment of a kind of text in video information, for real Now to the detection of violation video text, the accuracy of violation video text detection is improved.

In a first aspect, this specification embodiment provides a kind of detection method of text in video information, comprising:

Target Photo is extracted from video to be detected, wherein the Target Photo includes the pass in the video to be detected Key frame；

Text information is extracted from the Target Photo；

Text sentence participle is carried out to the text information, the sentence after being segmented；

Vector conversion is carried out to the sentence after the participle, obtains the term vector segmented in the sentence；

By after the participle sentence and the term vector input textual classification model, by the textual classification model into Row semantics recognition simultaneously exports semantics recognition as a result, whether the semantics recognition result is for characterizing in the text information comprising tool There is the text for presetting semanteme.

Second aspect, this specification embodiment provide a kind of detection device of text in video information, comprising:

Picture extraction unit, for extracting Target Photo from video to be detected, wherein the Target Photo includes described Key frame in video to be detected；

Word Input unit, for extracting text information from the Target Photo；

Participle unit, for carrying out text sentence participle to the text information, the sentence after being segmented；

Vector transduced cell is obtained and is segmented in the sentence for carrying out vector conversion to the sentence after the participle Term vector；

Recognition unit, for by after the participle sentence and the term vector input textual classification model, by described Textual classification model carries out semantics recognition and exports semantics recognition as a result, the semantics recognition result is for characterizing the text letter Whether comprising with the text for presetting semanteme in breath.

The third aspect, this specification embodiment provide a kind of computer readable storage medium, are stored thereon with computer journey Sequence, the program perform the steps of when being executed by processor

Text information is extracted from the Target Photo；

Fourth aspect, this specification embodiment provide a kind of electronic equipment, include memory and one or one Above program, one of them perhaps more than one program be stored in memory and be configured to by one or one with It includes the instruction for performing the following operation that upper processor, which executes the one or more programs:

Text information is extracted from the Target Photo；

Said one or multiple technical solutions in this specification embodiment, at least have the following technical effect that

This specification embodiment provides a kind of detection method of text in video information, extracts target from video to be detected Picture, the Target Photo include the key frame in the video to be detected；Text information is extracted from Target Photo；Text is believed Breath carries out text sentence participle, the sentence after being segmented；Word in vector conversion acquisition sentence is carried out to the sentence after participle Vector；By after participle sentence and term vector input textual classification model, semantics recognition and defeated is carried out by textual classification model Characterize in text information whether the semantics recognition comprising the text with default semanteme is as a result, and by believing text in video out The semantics recognition of breath, to realize the detection of violation video text, avoid because simple literal expression mode variation and can not It detects violation video text, improves the accuracy of violation video text detection.Further, this illustrates that embodiment provides upper Method is stated, it is obvious due to covering in key frame by extracting key frame in video when carrying out the detection of text in video information The video content of variation extracts text information from key frame and is identified, can greatly reduce the calculating of video text identification Amount reaches the beneficial effect for improving violation video text detection efficiency.

Detailed description of the invention

In order to illustrate more clearly of the technical solution in this specification embodiment, embodiment or the prior art will be retouched below Attached drawing needed in stating is briefly described, it should be apparent that, the accompanying drawings in the following description is the one of this specification A little embodiments for those of ordinary skill in the art without any creative labor, can also be according to this A little attached drawings obtain other attached drawings.

Fig. 1 is a kind of flow diagram of the detection method for text in video information that this specification embodiment provides；

Fig. 2 is a kind of schematic diagram of the detection device for text in video information that this specification embodiment provides；

Fig. 3 is the schematic diagram for a kind of electronic equipment that this specification embodiment provides.

Specific embodiment

To keep the purposes, technical schemes and advantages of this specification embodiment clearer, below in conjunction with this specification reality The attached drawing in example is applied, the technical solution in this specification embodiment is clearly and completely described, it is clear that described reality Applying example is this specification a part of the embodiment, instead of all the embodiments.The embodiment of base in this manual, this field are general Logical technical staff every other embodiment obtained without creative efforts belongs to this specification protection Range.

Detection method, device and the electronic equipment of a kind of text in video information are provided in this specification embodiment, is used for It realizes the detection to violation video text, improves the accuracy of violation video text detection.

With reference to the accompanying drawing to the main realization principle of this specification embodiment technical solution, specific embodiment and its right The beneficial effect that should be able to reach is explained in detail.

Embodiment

Referring to FIG. 1, this specification embodiment provides a kind of detection method of text in video information, this method comprises:

S10: extracting Target Photo from video to be detected, and the Target Photo includes the key in the video to be detected Frame；

S12: text information is extracted from the Target Photo；

S14: text sentence participle is carried out to the text information, the sentence after being segmented；

S16: vector conversion is carried out to the sentence after the participle, obtains the term vector segmented in the sentence；

S18: by the sentence and term vector input textual classification model after the participle, pass through the text classification mould Type carries out semantics recognition and exports semantics recognition as a result, whether the semantics recognition result wraps for characterizing in the text information Semantic text is preset containing having.

It in specific implementation process, executes S10 and extracts Target Photo from video to be detected, which is in video Part picture, for improving the detection efficiency of violation text in video.Wherein, Target Photo may include in video to be detected Hot spot frame and/or random frame in key frame and video to be detected.

One video is made of more than ten supreme thousand sheets pictures, and the picture in video is otherwise known as frame, in video role or It is any that frame locating for key operations in person's object of which movement or variation is referred to as a definition in key frame or video The frame of beginning and end smooth transition is key frame, and a series of key frames define the motion process that viewer will be seen that.Relatively Normal frames in video, the change information amount that key frame includes is larger, can quickly and effectively obtain video by key frame In key message, repetition meter in video detection can be effectively reduced by extracting key frame to carry out the detection of video violation text It calculates, and can be avoided omission key message, effectively improve the efficiency and accuracy rate of video violation text detection.

Specifically, the pass in video to be detected can be extracted according to the similarity between frame picture every in video to be detected Key frame is as Target Photo.For example, the average similarity between the frame and frame in video to be detected in one section of duration can be calculated； Then, the frame for being less than the presupposition multiple of average value in this section of duration with the similarity of former frame is found out；If without such frame, The intermediate frame in this section of duration can be extracted as key frame；If there is such frame, then can extract in this section of duration with The similarity of former frame is less than the frame of the presupposition multiple of average value as key frame.For the key frame of extraction, it is small to exclude brightness In the frame that the frame of a certain threshold value is excessively dark.By the above-mentioned means, the present embodiment dynamic carries out key-frame extraction, according to video In in one section of duration picture variation severe degree, current image variation is more violent, more key frames is extracted, even if duration is simultaneously It does not grow；On the contrary, even one section of very long video also only extracts less key frame if video pictures are essentially unchanged；Such as Fruit video pictures be it is completely black, then any key frame is not extracted, to improve the quality of key frame.

The present embodiment considers the uncertainty that violation text occurs, and further extracts hot spot frame supplementary target picture, mentions The accuracy of high video violation text detection.Hot spot frame refers to that comment amount is greater than the frame of given threshold.Wherein, hot spot frame mentions It takes, the point of video playback time corresponding to the time point of comment and/or barrage can be obtained, count the comment of each frame picture And/or barrage item number is the comment amount of the frame.Comment amount is greater than given threshold characterization user to its picture material, word content Attention rate it is bigger, a possibility that there are violation texts in this kind of frames, is larger, for this purpose, extracting comment amount is greater than given threshold Frame as hot spot frame, be used to supplementary target picture.Certainly, in the specific implementation process, can also from video to be detected with Machine extracts frame, that is, random frame of preset quantity, also regard random frame as Target Photo, improves the popularity of the distribution of Target Photo, Further increase the accuracy of video violation text detection.

After S10 extracts Target Photo, continues to execute S12 and extract text information from Target Photo.

Specifically, can use OCR (Optical Character Recognition, optical character identification) technology will Text in Target Photo is converted to text structural information, to extract text information, facilitates the subsequent participle to text and identification. Wherein, S12 can carry out text information extraction to each Target Photo, but may be without corresponding text in certain Target Photos Word does not do subsequent processing for this kind of Target Photo.

S14 carries out text sentence participle to the text information that S12 is extracted, the sentence after being segmented.Wherein, this implementation Example is not intended to limit the specific algorithm of text sentence participle, existing any segmentation methods can be used, such as mechanical Chinese word segmentation algorithm, base Segmentation methods in n-gram, segmentation methods based on hidden Markov model etc..

After S14, further executes S16 and vector conversion is carried out to the sentence after participle, obtain the word segmented in sentence Vector.Specifically, can by Chinese word vector algorithm such as cw2vec or natural language vector transfer algorithm such as word2vec, Vector conversion is carried out to the sentence after participle, obtains the term vector segmented in sentence.Preferably, text can be inputted using cw2vec This sequence exports in text sequence and respectively segments corresponding semantic vector, more accurate for the semantic meaning representation of Chinese word segmentation.

Sentence after the participle that the term vector and S14 obtained based on S16 conversion is obtained, executes S18 for the sentence after participle Textual classification model is inputted with term vector, semantics recognition is carried out by textual classification model and exports semantics recognition result.Specifically , it can be by the sentence and term vector input textual classification model after participle, textual classification model can be according to the sentence after participle Whether each term vector is formed vector matrix, and carries out semantics recognition to vector matrix by minor structure, detect and wrap in vector matrix Containing the default semantic i.e. corresponding semanteme of violation text, to obtain recognition result and export.Pass through the combination of sentence and term vector The semantics recognition of progress, can be more accurate identify it is each participle and sentence expression semanteme, so as to more accurate The detection for carrying out violation text avoids because variation, the use of near synonym etc. of literal expression mode cause violation text detection Missing inspection.

Textual classification model in the present embodiment can be two disaggregated model of text, in violation detection scene, such as Fruit part text in violation of rules and regulations, is then considered as in violation of rules and regulations, output can also be further labeled to violation text.Textual classification model is specific It can be instructed for the textual classification model Text-CNN based on convolutional neural networks training acquisition or based on shot and long term memory network Practice the textual classification model LSTM+Softmax obtained.Preferably, cw2vec can be used when term vector is converted in the present embodiment, Text classification is carried out in conjunction with Text-CNN, on the basis of promoting the accuracy of semantic conversion, in conjunction with the powerful office of Text-CNN Portion's information detection capability reaches the beneficial effect for improving the accuracy of violation text detection in video.

In the above-described embodiments, Target Photo is extracted from video to be detected, which includes the view to be detected Key frame in frequency；Text information is extracted from Target Photo；Text sentence participle is carried out to text information, after being segmented Sentence；Term vector in vector conversion acquisition sentence is carried out to the sentence after participle；By after participle sentence and term vector input Whether textual classification model carries out semantics recognition by textual classification model and exports default comprising having in characterization text information The semantics recognition of semantic text as a result, and by the semantics recognition to text in video information, to realize violation video text Detection, avoid because simple literal expression mode variation and can not detect violation video text, improve violation video The accuracy of text detection.Further, this illustrates the above method that embodiment provides, in the inspection for carrying out text in video information When survey, by extracting key frame in video, due to covering the video content of significant change in key frame, text is extracted from key frame Word information is identified, the calculation amount of video text identification can be greatly reduced, and is reached and is improved violation video text detection efficiency Beneficial effect.

A kind of detection method of text in video information is provided based on the above embodiment, and the present embodiment also correspondence provides one kind The detection device of text in video information, referring to FIG. 2, the device further include:

Picture extraction unit 20, for extracting Target Photo from video to be detected, wherein the Target Photo includes institute State the key frame in video to be detected；

Word Input unit 22, for extracting text information from the Target Photo；

Participle unit 24, for carrying out text sentence participle to the text information, the sentence after being segmented；

Vector transduced cell 26 is obtained and is segmented in the sentence for carrying out vector conversion to the sentence after the participle Term vector；

Recognition unit 28, for passing through institute for the sentence and term vector input textual classification model after the participle Textual classification model is stated to carry out semantics recognition and export semantics recognition as a result, the semantics recognition result is for characterizing the text Whether comprising with the text for presetting semanteme in information.

As an alternative embodiment, the picture extraction unit 20 can use any one or more following side Formula extracts Target Photo:

Mode one, according to the similarity between frame picture every in video to be detected, extract the pass in the video to be detected Key frame is as the Target Photo；

Mode two extracts the hot spot frame that comment amount is greater than given threshold from the video to be detected, by the hot spot frame As the Target Photo；

Mode three, the random frame for extracting preset quantity at random from the video to be detected, using the random frame as institute State Target Photo.

Wherein, relative to the normal frames in video, the change information amount that key frame includes is larger, can be fast by key frame Speed, the effective key message obtained in video, extracting key frame can effectively reduce to carry out the detection of video violation text Computing repeatedly in video detection, and can be avoided omission key message, effectively improve video violation text detection efficiency and Accuracy rate.

As an alternative embodiment, the vector transduced cell 26, can by Chinese word vector algorithm or Natural language vector transfer algorithm carries out vector conversion to the sentence after the participle, obtain the word that is segmented in the sentence to Amount.Preferably, be for text information it is Chinese, can be obtained by Chinese word vector algorithm cw2vec semantic more accurate Term vector.

As an alternative embodiment, the textual classification model 28 can be to be obtained based on convolutional neural networks training The textual classification model obtained or the textual classification model obtained based on the training of shot and long term memory network.Preferably, in term vector Cw2vec is used when conversion, carries out text classification in conjunction with Text-CNN, on the basis of promoting the accuracy of semantic conversion, in conjunction with Text-CNN powerful local message detectability reaches the beneficial effect for improving the accuracy of violation text detection in video.

About the device in above-described embodiment, wherein each unit executes the concrete mode of operation in method It is described in detail in embodiment, no longer elaborates herein.

Referring to FIG. 3, being that a kind of electronics for realizing data query method shown according to an exemplary embodiment is set Standby 700 block diagram.For example, electronic equipment 700 can be computer, database console, tablet device, personal digital assistant Deng.

Referring to Fig. 3, electronic equipment 700 may include following one or more components: processing component 702, memory 704, Power supply module 706, multimedia component 708, the interface 710 and communication component 712 of input/output (I/O).

The integrated operation of the usual controlling electronic devices 700 of processing component 702 is such as grasped with display, data communication, and record Make associated operation.Processing element 702 may include one or more processors 720 to execute instruction, above-mentioned to complete The all or part of the steps of method.In addition, processing component 702 may include one or more modules, it is convenient for 702 He of processing component Interaction between other assemblies.

Memory 704 is configured as storing various types of data to support the operation in equipment 700.These data are shown Example includes the instruction of any application or method for operating on electronic equipment 700, contact data, telephone directory number According to, message, picture, video etc..Memory 704 can by any kind of volatibility or non-volatile memory device or they Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) is erasable Programmable read only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, quick flashing Memory, disk or CD.

Power supply module 706 provides electric power for the various assemblies of electronic equipment 700.Power supply module 706 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 700 generate, manage, and distribute the associated component of electric power.

I/O interface 710 provides interface between processing component 702 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and lock Determine button.

Communication component 712 is configured to facilitate the communication of wired or wireless way between electronic equipment 700 and other equipment. Electronic equipment 700 can access the wireless network based on communication standard, such as WiFi, 2G or 3G or their combination.Show at one In example property embodiment, communication component 712 receives broadcast singal or broadcast from external broadcasting management system via broadcast channel Relevant information.In one exemplary embodiment, the communication component 712 further includes near-field communication (NFC) module, short to promote Cheng Tongxin.For example, radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band can be based in NFC module (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 700 can be by one or more application specific integrated circuit (ASIC), number Word signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 704 of instruction, above-metioned instruction can be executed by the processor 720 of electronic equipment 700 to complete the above method.Example Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..

A kind of non-transitorycomputer readable storage medium, when the instruction in the storage medium is by the processing of mobile terminal When device executes, so that electronic equipment is able to carry out a kind of data query method, which comprises

Target Photo is extracted from video to be detected, wherein the Target Photo includes the pass in the video to be detected Key frame；Text information is extracted from the Target Photo；Text sentence participle is carried out to the text information, after being segmented Sentence；Vector conversion is carried out to the sentence after the participle, obtains the term vector segmented in the sentence；After the participle Sentence and the term vector input textual classification model, carry out semantics recognition by the textual classification model and export semantic knowledge Not as a result, whether the semantics recognition result is used to characterize in the text information comprising with default semantic text.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of detection method of text in video information, comprising:

Target Photo is extracted from video to be detected, wherein the Target Photo includes the key frame in the video to be detected；

Text information is extracted from the Target Photo；

By the sentence and term vector input textual classification model after the participle, passes through the textual classification model and carry out language Justice identifies and exports semantics recognition as a result, whether the semantics recognition result is pre- comprising having in the text information for characterizing If semantic text.

2. the method as described in claim 1 extracts Target Photo from video to be detected, comprising:

The hot spot frame that comment amount is greater than given threshold is extracted from the video to be detected, using the hot spot frame as the target Picture；And/or

The random frame for extracting preset quantity at random from the video to be detected, using the random frame as the Target Photo.

3. method according to claim 2, the sentence to after the participle carries out vector conversion, obtains in the sentence The term vector of participle, comprising:

By Chinese word vector algorithm or natural language vector transfer algorithm, vector is carried out to the sentence after the participle and is turned It changes, obtains the term vector segmented in the sentence.

4. the method as described in claims 1 to 3 is any, the textual classification model is to be obtained based on convolutional neural networks training Textual classification model.

5. a kind of detection device of text in video information, comprising:

Picture extraction unit, for extracting Target Photo from video to be detected, wherein the Target Photo includes described to be checked Survey the key frame in video；

Word Input unit, for extracting text information from the Target Photo；

Vector transduced cell, for carrying out vector conversion to the sentence after the participle, obtain the word that is segmented in the sentence to Amount；

Recognition unit, for passing through the text for the sentence and term vector input textual classification model after the participle Disaggregated model carries out semantics recognition and exports semantics recognition as a result, the semantics recognition result is for characterizing in the text information Whether comprising with the text for presetting semanteme.

6. device as claimed in claim 5, the picture extraction unit, are also used to:

7. device as claimed in claim 6, the vector transduced cell, are used for:

8. the device as described in claim 5~7 is any, the textual classification model is to be obtained based on convolutional neural networks training Textual classification model.

9. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that the program is held by processor It is performed the steps of when row

Text information is extracted from the Target Photo；

10. a kind of electronic equipment, which is characterized in that include memory and one or more than one program, wherein one A perhaps more than one program is stored in memory and is configured to execute described one by one or more than one processor A or more than one program includes the instruction for performing the following operation:

Text information is extracted from the Target Photo；