WO2018028583A1 - 字幕提取方法及装置、存储介质 - Google Patents

字幕提取方法及装置、存储介质 Download PDF

Info

Publication number
WO2018028583A1
WO2018028583A1 PCT/CN2017/096509 CN2017096509W WO2018028583A1 WO 2018028583 A1 WO2018028583 A1 WO 2018028583A1 CN 2017096509 W CN2017096509 W CN 2017096509W WO 2018028583 A1 WO2018028583 A1 WO 2018028583A1
Authority
WO
WIPO (PCT)
Prior art keywords
subtitle
video
video frame
region
video frames
Prior art date
Application number
PCT/CN2017/096509
Other languages
English (en)
French (fr)
Inventor
王星星
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018028583A1 publication Critical patent/WO2018028583A1/zh
Priority to US16/201,386 priority Critical patent/US11367282B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/434Disassembling of a multiplex stream, e.g. demultiplexing audio and video streams, extraction of additional data from a video stream; Remultiplexing of multiplex streams; Extraction or processing of SI; Disassembling of packetised elementary stream
    • H04N21/4345Extraction or processing of SI, e.g. extracting service information from an MPEG stream
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • G06T5/94Dynamic range modification of images or parts thereof based on local image properties, e.g. for local contrast enhancement
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44008Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics in the video stream
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4828End-user interface for program selection for searching program descriptors
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20224Image subtraction

Definitions

  • the present invention relates to image processing technology, and in particular, to a caption extraction method and device, and a storage medium.
  • subtitles in video files are recorded in various ways, such as embedded subtitles, internal subtitles, and external subtitles.
  • the embedded subtitle is integrated with the video frame of the video when the subtitle is overlaid on the video frame of the video.
  • the size, position and effect of the subtitle are unchanged regardless of the video format.
  • the internal subtitles are a package of video files and subtitle files into at least two audio tracks and at least two subtitle tracks, and dubbing and subtitles can be selected during playback.
  • the subtitle file corresponding to the external subtitle is independent of the video file. When the video needs to be played, the subtitle file to be used by the video player is loaded on the video.
  • Embodiments of the present invention provide a caption extraction method and device, and a storage medium, which are capable of A subtitle extraction method extracts various forms of subtitles from the video.
  • an embodiment of the present invention provides a method for extracting a subtitle, where the method includes:
  • the subtitles are extracted by blending the color enhancement contrast extreme regions of at least two channels.
  • the embodiment of the present invention further provides a storage medium, where the executable program is stored, and when the executable program is executed by the processor, the above-mentioned caption extraction method provided by the embodiment of the present invention is implemented.
  • an embodiment of the present invention provides a caption extraction device, where the device includes:
  • a decoding unit configured to decode the video to obtain a video frame
  • a connectivity unit configured to perform a connectivity operation in a subtitle arrangement direction on pixels in the video frame to obtain a connected domain in the video frame
  • a positioning unit configured to determine a video frame including the same subtitle based on the connected domain in the video frame, and determine, in the video frame including the same subtitle, based on a distribution position of the connected domain in the video frame including the same subtitle Subtitle area;
  • An extracting unit configured to construct a component tree corresponding to at least two channels of the subtitle region, and extract a contrast extreme value region corresponding to each channel by using the constructed component tree;
  • An enhancement unit configured to perform a contrast extreme region of the at least two channels of the fusion Color enhancement processing to form a color enhanced contrast extreme value area that filters out redundant pixels and noise;
  • a fusion unit configured to extract subtitles by fusing a contrast extreme region of at least two channels.
  • an embodiment of the present invention provides a caption extraction device, where the device includes:
  • processors and storage medium storing executable instructions for causing the processor to perform the following operations:
  • the subtitles are extracted by blending the color enhancement contrast extreme regions of at least two channels.
  • the subtitle area (the image corresponding to the connected domain) can be extracted for any form of the subtitle, and is not affected by the form of the subtitle used by the video; meanwhile, the subtitle area is extracted from the subtitle area.
  • the contrast extreme value area is color-enhanced and fused, effectively filtering out the image of the subtitle area to eliminate the illumination, and the clothing strongly interferes with the background, so as to better separate the background and the subtitle, which is beneficial to improve the efficiency and precision of the subsequent text recognition.
  • FIG. 1A is an optional schematic diagram of a pixel relationship in an embodiment of the present invention
  • FIG. 1B is an optional schematic diagram of a pixel relationship in an embodiment of the present invention.
  • 1C is an optional schematic diagram of a pixel relationship in an embodiment of the present invention.
  • FIG. 1D is an optional schematic diagram of a pixel relationship in an embodiment of the present invention.
  • FIG. 1E is an optional schematic diagram of a pixel relationship in an embodiment of the present invention.
  • FIG. 1F is an optional schematic diagram of a pixel relationship in an embodiment of the present invention.
  • FIG. 2 is a schematic diagram showing an optional hardware structure of a caption extraction apparatus in an embodiment of the present invention
  • FIG. 3 is a schematic diagram of an optional scenario of subtitle extraction in an embodiment of the present invention.
  • 4A is an optional schematic flowchart of a method for extracting a caption in an embodiment of the present invention.
  • 4B is an optional schematic flowchart of a method for extracting a caption according to an embodiment of the present invention.
  • FIG. 5A is an optional schematic diagram of a subtitle area in an embodiment of the present invention.
  • FIG. 5B is an optional schematic diagram of a subtitle area in an embodiment of the present invention.
  • FIG. 6 is a schematic flow chart of an overlay of a subtitle area in an embodiment of the present invention.
  • FIG. 7 is an optional schematic diagram of forming a region of contrast extreme value in an embodiment of the present invention.
  • FIG. 9A is a schematic diagram of an optional scenario of caption extraction in an embodiment of the present invention.
  • FIG. 9B is a schematic diagram of an optional scenario of subtitle extraction in the embodiment of the present invention.
  • FIG. 10 is a schematic diagram showing an optional functional structure of a caption extraction apparatus in an embodiment of the present invention.
  • the terms "including”, “including” or any other variant thereof are intended to cover a non-exclusive inclusion, such that a method comprising a series of elements or The device includes not only the elements that are explicitly recited, but also other elements that are not explicitly listed, or the elements that are inherent to the method or device.
  • an element defined by the phrase “comprising a " does not exclude the presence of additional related elements in the method or device including the element (eg, a step in the method or a unit in the device) ).
  • Gray value indicates the integer number of pixels, for example, the gray value of the pixel ranges from 0 to 255, which is called the image of 256 gray levels.
  • Erode delete some pixels of the object boundary and have the function of shrinking the image.
  • the etching algorithm uses an nXn structure element to scan each pixel in the image, and uses the binary image covered by the nXn structural element and the nXn structural element to do “ With "Operation, if both are 1, the pixel of the image is 1, otherwise 0. After etching, the image boundary shrinks inward.
  • Adjacency Two pixels are in contact, then they are contiguous. A pixel is in contact with a pixel in its neighborhood. Adjacency only considers the spatial relationship of pixels.
  • the neighborhood includes the following types:
  • D neighborhood As shown in Fig. 1B, the D neighborhood of the pixel p(x, y) is the pixel on the diagonal (x+1, y+1); the DD (p) represents the D neighbor of the pixel p Domain: (x+1, y-1); (x-1, y+1); (x-1, y-1).
  • the 8 neighborhood of the pixel p(x, y) is: the image of the 4 neighborhood
  • the pixel of the prime + D neighborhood represents the 8 neighborhood of the pixel p by N8(p).
  • Connectivity includes the following types:
  • the trend of data distribution concentration trend that is, the field in which the numbers in the array are concentrated, usually using the mode method, the median method and the mean method to determine the distribution; the method of counting is to measure the number of repeated occurrences in the array.
  • the method of (the mode), the median method is the method of measuring the middle value (median) in the array, and the mean method is the method of measuring the numerical mean in the array.
  • the interframe difference method subtracts the gray value of the pixel corresponding to the adjacent video frame, and if the ambient brightness does not change much, if the gray level difference of the corresponding pixel is small (the threshold is not exceeded), It can be considered that the object represented by the pixel is still; if the gray level of the image area changes a lot (super Out of the threshold), it can be considered that this is caused by the motion of the object in the image.
  • These still regions and the moving pixel regions are marked, and the position of the moving object and the still object in the video frame can be obtained by using the marked pixel regions.
  • the frame difference method employs pixel-based temporal differentiation and thresholding between 2 or 3 adjacent frames in a continuous sequence of images to extract moving object regions in a video frame.
  • the moving object real-time tracking system uses three-frame difference to perform moving target detection. This method can not only improve the speed of moving object detection, but also improve the integrity of the detected video frame.
  • Scale-Invariant Feature Transform (SIFT) feature matching algorithm is used to detect local features in video frames, that is, features of some local appearance feature points on the object. Features are independent of the size and rotation of the object's image.
  • the feature points found by the scale-invariant feature-transformation feature matching algorithm are some prominent points that do not change due to factors such as illumination, affine transformation and noise, such as corner points, edge points, bright areas and bright areas. Dark spots, etc.
  • Contrasting Extremal Region an area of a video frame that has a certain contrast (out of contrast threshold) with the surrounding background, so that it can be at least perceived by the human eye.
  • Color-enhanced CER is to enhance the CER by using the color information in the PII (Perception-based Illumination Invariant) color space, and use the color information to filter out the redundant pixels in the CER. Or noise, resulting in a Color-enhanced CER, which has visual perception consistency and is insensitive to illumination, closer to the human eye's judgment of color.
  • the color model of the PII color space including: hue H, saturation S, and brightness V.
  • Embodiments of the present invention provide a caption extraction method, a caption extraction device that applies a caption extraction method, and a storage medium (an executable instruction for executing a caption extraction method is stored in a storage medium).
  • the caption extraction device provided by the embodiment of the present invention may be implemented in various forms. For example, it may be implemented as a mobile terminal such as a smart phone, a tablet computer, or a vehicle terminal, or may be implemented as a desktop computer, a smart TV, a set top box, or the like.
  • the fixed terminal is either a similar computing device or a server on the network side.
  • FIG. 2 exemplarily shows an optional hardware structure diagram of the caption extraction device 10.
  • the hardware structure shown in FIG. 2 is merely an example and does not constitute a limitation on the device structure.
  • more components than FIG. 2 may be provided according to implementation requirements, or the partial components may be omitted according to implementation needs.
  • an optional hardware structure of the caption extraction device 10 includes at least one processor 11, memory 12, at least one network interface 14, and a user interface 13.
  • the various components in the caption extraction device 10 are coupled together by a bus system 15. It will be appreciated that the bus system 15 is used to implement connection communication between these components.
  • the bus system 15 includes, in addition to the data bus, a power bus, a control bus, and a status signal bus. However, for clarity of description, various buses are labeled as the bus system 15 in FIG.
  • the user interface 13 may include a display, a keyboard, a mouse, a trackball, a click wheel, a button, a button, a touch panel, or a touch screen.
  • the network interface 14 provides the processor 11 with access capabilities of external data such as a remotely located memory 12.
  • the network interface 14 may be based on Near Field Communication (NFC) technology, Bluetooth technology, Zigbee ( ZigBee) technology for short-range communication, and can also implement communication systems such as Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), and its evolution system. Communication.
  • NFC Near Field Communication
  • Bluetooth Bluetooth technology
  • ZigBee ZigBee
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LSDMA Wideband Code Division Multiple Access
  • the memory 12 can be either volatile memory or non-volatile memory, as well as both volatile and non-volatile memory.
  • the non-volatile memory can be read-only storage (ROM, Read Only Memory), Programmable Read-Only Memory (PROM), Erasable Programmable Read-Only Memory (EPROM), electrically erasable and programmable only EEPROM (Electrically Erasable Programmable Read-Only Memory), Ferromagnetic random access memory (FRAM), Flash Memory, Magnetic Surface Memory, Optical Disk, or Read Only Disc (CD-ROM) , Compact Disc Read-Only Memory);
  • the magnetic surface memory can be a disk storage or a tape storage.
  • the volatile memory can be a random access memory (RAM) that acts as an external cache.
  • RAM random access memory
  • many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access (SSRAM).
  • DRAM Dynanamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhance Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM Synchronous Dynamic Random Access Memory
  • DRRAM Direct Memory Bus Random Access Memory
  • the memory 12 described in the embodiments of the present invention is intended to include, but is not limited to, these and any other suitable types of memory.
  • the memory 12 in the embodiment of the present invention is used to store various types of data to support the operation of the caption extraction device 10.
  • Examples of such data include any computer program for operating on the caption extraction device 10, such as the operating system 121 and the application 122; contact data; phone book data; messages; pictures;
  • the operating system 121 includes various system programs, such as a framework layer, a core library layer, a driver layer, etc., for implementing various basic services and processing hardware-based. Task.
  • the application 122 can include various applications, such as a Media Player, a Browser, etc., for implementing various application services.
  • a program implementing the method of the embodiment of the present invention may be included in the application 122.
  • the method disclosed in the foregoing embodiment of the present invention may be applied to the processor 11 or implemented by the processor 11.
  • the processor 11 may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the foregoing method may be completed by an integrated logic circuit of hardware in the processor 11 or an instruction in a form of software.
  • the processor 11 described above may be a general purpose processor, a digital signal processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, or the like.
  • DSP digital signal processor
  • the processor 11 can implement or perform the various methods, steps, and logic blocks disclosed in the embodiments of the present invention.
  • a general purpose processor can be a microprocessor or any conventional processor or the like.
  • the steps of the method disclosed in the embodiment of the present invention may be directly implemented as a hardware decoding processor, or may be performed by a combination of hardware and software modules in the decoding processor.
  • the software module can reside in a storage medium located in memory 12, and processor 11 reads the information in memory 12, in conjunction with its hardware, to perform the steps of the foregoing method.
  • the caption extraction device 10 may be configured by one or at least two Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), and Complex Programmable Logic Devices.
  • ASICs Application Specific Integrated Circuits
  • DSPs Programmable Logic Devices
  • PLDs Programmable Logic Devices
  • Complex Programmable Logic Devices CPLD, Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • general-purpose processor controller, microcontroller (MCU, Micro Controller Unit), microprocessor (Microprocessor), or Other electronic components are implemented to perform the aforementioned methods.
  • the terminal 30 logs in a video application request to run an online video playing application on the online video application server 40, according to the identifier of the video selected by the user in the online video playing application.
  • Serial number (ID) to video server 50
  • the data of the video is requested, and the video server 50 obtains the video from the video database, extracts the subtitle area from the video and recognizes it as a text form, and sends it to the terminal 30 by the online video application server 40 together with the video.
  • the video server 50 may recognize the subtitles in text form from the video of the video database in advance and store them in the video database along with the video, thus responding at least two in concurrent
  • the terminal acquires video for example, the terminal 30 requests subtitles of different videos or requests subtitles of the same video
  • the subtitles in text form can be delivered in real time to avoid delay.
  • the video server 50 can only send the data of the video requested by the terminal 30 to the corresponding terminal for playing.
  • the terminal 30 needs the subtitle in the form of text in the video
  • the corresponding text form of the subtitle is sent to the terminal 30. .
  • the terminal 30, the online video application server 40, and the video server 50 transmit video data of the streaming media in a real-time streaming (Realtime Streaming) or a sequential streaming (Progressive Streaming) manner.
  • the streaming media server is used, or the application is transmitted by, for example, Real Time Streaming Protocol (RTSP).
  • RTSP Real Time Streaming Protocol
  • HTTP HyperText Transfer Protocol
  • the video data of the streaming media is sent by sequential streaming. Which transmission method is used depends on the real-time requirements of video playback.
  • the terminal 30 can also download all the data of the video to the local area for playback.
  • FIG. 4A An optional flow diagram that can be applied to the caption extraction in FIG. 3 is illustrated in conjunction with FIG. 4A. As shown in FIG. 4A, the following steps are included:
  • step 101 the video is decoded to obtain a video frame.
  • Step 102 Perform a connection operation of the subtitle arrangement direction on the pixels in the video frame to obtain a connected domain in the video frame.
  • video frames of different time points are sampled according to the duration of the video, for example,
  • the frame rate based on the video is corresponding to extracting video frames at different time points.
  • the sampling rate when the video is extracted is greater than the frame rate of the video.
  • Corrosion and expansion operations are performed on pixels in the extracted video frame, and the video frames after the etching and expanding operations are connected to the subtitle arrangement direction.
  • the subtitles are arranged in the video from left to right. Therefore, the left and right communication operations are performed on the pixels in the video frame.
  • the characters of the subtitle area in the video frame can be made to form a connected domain.
  • the communication operation can be specifically performed for the subtitle arrangement direction of the video.
  • Step 103 Determine a video frame including the same subtitle based on the connected domain in the video frame.
  • the pixels of the connected adjacent video frames are deviated, such as differences in different channels of the RGB space, or in the PII space.
  • the difference between the different channels when the difference is higher than the difference threshold, it indicates that the difference of the pixels of the connected domain in the adjacent video frame is too large, and the difference of the pixels if the subtitles of the connected area in the adjacent video frame are the same It is inevitably small (below the difference threshold), therefore, it is determined that the extracted adjacent video frames include the same subtitle, and when the difference is higher than the difference threshold, it is determined that the extracted adjacent video frames include different Subtitles.
  • the feature points are extracted from the corresponding connected domains based on the scale-invariant feature-transformed feature matching algorithm, since the extracted feature points have no cause The characteristics of position, scale and rotation change. Therefore, if the subtitles in adjacent video frames are the same, the feature points extracted from the connected fields in the adjacent video frames must be matched, and correspondingly, through the adjacent video frames. Whether the feature points of the domain match or not can determine whether adjacent video frames include the same subtitle.
  • the above two ways of determining whether adjacent video frames include the same subtitle may be used in combination, thereby further improving the accuracy of identifying video frames including different subtitles. For example, performing a difference on the pixels of the connected domain in the extracted adjacent video frames, when the difference is lower than the difference threshold, and The feature points are extracted from the corresponding connected domain based on the scale invariant feature transform feature matching algorithm, and when the feature points extracted in the connected domain of the adjacent video frames are matched, it is determined that the extracted adjacent video frames include the same subtitle Otherwise, it is determined that the extracted adjacent video frames include different subtitles.
  • Step 104 Determine a subtitle area in a video frame including the same subtitle based on a distribution position of the connected domain in the video frame including the same subtitle.
  • a distribution concentration trend characteristic of the corresponding distribution location is determined, for example, based on the public The number method, the median method or the mean method determines the distribution trend characteristics of the distribution position.
  • the region formed by the distribution position with the most occurrences (that is, the region where the distribution position is the edge position) is determined as the subtitle.
  • the region formed by the distribution position of the intermediate value (that is, the region where the distribution position is the edge position) is determined as the subtitle region.
  • the region formed by the mean value of the distribution position (that is, the region where the mean value of the distribution position is the edge position) is determined as the subtitle region based on the distribution position of the connected domains in each video frame including the same subtitle.
  • Step 105 Construct a component tree corresponding to at least two channels of the subtitle area, and extract a contrast extreme value area corresponding to each channel by using the constructed component tree.
  • a component tree formed by nested nodes, a node of the component tree and subtitles are constructed from at least two channels, such as a grayscale image, a color tone channel of PII, and a saturation channel of PII, corresponding to a subtitle region of the video frame.
  • the character of the region corresponds; the contrast of the node and the adjacent background is characterized by the area change rate of the node relative to the adjacent node. Since the extreme region and the adjacent background have at least a contrast that can be perceived by the human eye, when the node is relative to the adjacent node When the area change rate is less than the area change rate threshold, it is determined that the node belongs to the contrast extreme value area of the corresponding channel.
  • Step 106 performing color enhancement processing on the contrast extreme value regions of the at least two channels.
  • the main color of the contrast extreme value region is determined, and pixels corresponding to the main color are satisfied from the contrast extreme value region, based on the extracted pixel.
  • the color-enhanced contrast extremes that make up the corresponding channel is determined, and pixels corresponding to the main color are satisfied from the contrast extreme value region, based on the extracted pixel.
  • the pixels in the subtitle area are sorted according to the size of the gray value, and the gray value is ranked in the front predetermined proportion of the pixel set, if the main color of the pixel and the set in the set
  • the color distance is less than the color distance threshold (the color distance threshold is the minimum color distance when the human eye perceives the difference in color), and a color enhanced contrast extreme value region is formed based on the pixel.
  • Step 107 merging the color enhanced contrast extremum regions of at least two channels to form a color enhanced contrast extremum region that filters out redundant pixels and noise.
  • the color enhanced contrast extreme region is fused from at least two channels: a grayscale image; a perception-based illumination-invariant PII tone channel; and a PII saturation channel. Since the color enhancement contrast extreme value area of the image of the subtitle area has filtered out the noise and the background, the image of the subtitle area can be effectively filtered out to eliminate the illumination and the clothing strongly interferes with the background, so as to better separate the background and the subtitle, and promote the character. The efficiency and accuracy of the identification.
  • Step 108 performing character recognition on the color enhanced contrast extreme value area.
  • the image of the color enhancement contrast extreme region has filtered out the noise and the background, the difficulty in character recognition of the image is significantly reduced, and the character recognition can be performed on the subtitle region using the related character recognition technology.
  • Step 201 The terminal runs an online video playing application.
  • Step 202 The terminal requests video data from the server according to a video identifier selected by the user in the online video playing application, such as a serial number (ID).
  • a video identifier selected by the user in the online video playing application such as a serial number (ID).
  • Step 203 The server acquires a video from the video database based on the video identifier, such as a serial number (ID);
  • Step 204 The server decodes the video to obtain a video frame, and performs left-right and right-direction communication operations on the pixels in the video frame to obtain a connected domain in the video frame.
  • Step 205 The server performs a difference on the pixels of the connected domain (the connected domain and the subtitle corresponding to the text line) in the extracted adjacent video frames; when the difference is less than the difference threshold, determines the extracted adjacent Video frames include the same subtitles;
  • Step 206 The server determines, according to the number of occurrences of the distribution position of the connected domain in each video frame including the same subtitle, an area formed by the distribution position with the most occurrences is a subtitle area;
  • Step 207 The server constructs a component tree corresponding to at least two channels of the subtitle area, and uses the constructed component tree to extract a contrast extreme value area corresponding to each channel;
  • Step 208 The server determines a main color of the contrast extreme value area for the contrast extreme value area of each channel, and extracts, from the contrast extreme value area, a pixel whose similarity degree to the main color satisfies a preset condition, and correspondingly based on the extracted pixel composition The color of the channel enhances the contrast extreme value area;
  • Step 209 The server combines the color enhanced contrast extreme value regions of at least two channels to form a color enhanced contrast extreme value region for filtering redundant pixels and noise, and performs character recognition on the color enhanced contrast extreme value region to extract characters;
  • Step 210 The server delivers the character to the terminal.
  • steps 204-206 in this example are the process of subtitle positioning in the video
  • steps 207-209 are processes for extracting the subtitles from the complex background.
  • the server extracts the connected domain of the corresponding caption from the video frame, so that the subtitle region (the image corresponding to the connected domain) can be extracted for any form of the subtitle, and is not affected by the form of the video.
  • the effect of subtitles; at the same time, the extraction from the subtitle area The contrast extreme value area is color-enhanced and fused, effectively filtering out the image of the subtitle area to eliminate the illumination, and the clothing strongly interferes with the background, so as to better separate the background and the subtitle, which is beneficial to improve the efficiency and precision of the subsequent text recognition.
  • Locate subtitle locations from video files in complex backgrounds and extract clean subtitle images It mainly consists of two parts of processing: firstly, the subtitle positioning in the video, and secondly, the subtitles that are located are extracted from the complex background.
  • Video subtitle positioning extract video frames at different time points according to the video duration, perform morphological erosion and expansion operations on these video frames, and combine the left and right connected domain operations to obtain the subtitle area of the video frame, and the video frames at different times.
  • the above positioning operation is performed to obtain the position of a series of subtitle regions in the video, and the accurate position information of the subtitle region in the accurate video is obtained by the method of the majority.
  • Video caption extraction On the basis of video caption positioning, the text of the caption area needs to be separated from the background information, and the frame difference method and the SIFT feature matching algorithm are used to distinguish whether the subtitle information in the time domain is the same subtitle, if the same Subtitles are superimposed on the image of the subtitle area of the same subtitle, and the average value is used to eliminate the interference of some backgrounds such as illumination and clothes, and the color filtering of the subtitle area is performed by merging the CER of the subtitle area on the multi-channel ( Contrast extreme value area) to find subtitles. Finally, the color-enhanced CER is used to get the image of the last clean subtitle area.
  • An optional hardware environment for video server 50 in Figure 3 is as follows:
  • Memory 1GB or more
  • Hard disk 120GB or more.
  • An optional software operating environment for video server 50 in Figure 3 is as follows:
  • the video is decoded to obtain a video frame, and the image is etched and expanded, and then left and right joint operations are performed to obtain a subtitle target area of each frame.
  • the subtitle area is located by taking N frames of images at different times of the same video, and finally the final text line height is obtained for all (x1, y1), (x2, y2) coordinate modes.
  • the original image of the video frame, as well as an alternative schematic for positioning the subtitle region in the video frame, is shown in Figures 5A and 5B.
  • the frame difference method on the video time domain and the SIFT feature matching are used to distinguish the two. Whether the subtitles in the video frames are the same subtitle.
  • FIG. 6 An optional flow diagram for judging whether two or more video frames include the same subtitles by using the frame difference method is shown in FIG. 6.
  • the text line image 1 and the text line are recorded.
  • Image 2 in combination with two ways to determine whether text line image 1 and text line image 2 are the same subtitle:
  • Method 1 By comparing the pixel difference values of adjacent text lines, according to horizontal projection and vertical projection (generally for binary images, the horizontal projection is the number of non-zero pixel values per line, here is 1 or 255, the vertical projection is the number of non-zero pixel values in each column of image data) to determine whether the text line image 1 and the text line image 2 are the same subtitle.
  • horizontal projection is the number of non-zero pixel values per line, here is 1 or 255
  • the vertical projection is the number of non-zero pixel values in each column of image data
  • Method 2 extract the SIFT features of the text line 1 and the text line image 2 to match, obtain the similarity according to the matching result, and combine the results of the frame difference method and the similarity of the SIFT features to comprehensively determine whether the same subtitle is obtained. If the same subtitle is the same, the text line image 1 and the text line image 2 are superimposed and averaged to form a new text line image.
  • a component tree is constructed to extract the CER area.
  • (N, i), (N, i+1), ... is a string of nodes/extreme values corresponding to the Chinese character "official" (represented by S). Regions, nested sequentially from bottom to top on the component tree.
  • S(N,i), S(N,i+1),... denote the area of (N,i),(N,i+1),..., respectively, then the node (N,i) and its ancestors
  • the area change rate of (N, i + ⁇ ) is:
  • the area change rate R ⁇ S (n i , n i+ ⁇ ) can be used to measure the contrast of the node (N, i) with its adjacent background. Assume that the binarization threshold corresponding to an extreme value region in the image of the subtitle region is level. When the threshold is decreased, the extreme region will expand outward or merge with other extreme regions, and the area will increase, and R ⁇ S ( n i , n i+ ⁇ ) is used to describe the rate of area growth. If the extreme region has a high contrast with its adjacent background, the smaller the area that expands outward, the slower the rate of area growth.
  • R ⁇ S (n i , n i+ ⁇ ) is inversely proportional to the contrast of the node n and its adjacent background, and the larger the R ⁇ S (n i , n i+ ⁇ ), the lower the contrast.
  • the comparative extreme value area CER can be defined as follows.
  • CER CER
  • the parameter ⁇
  • the parameter ⁇
  • the above parameter setting in the embodiment of the present invention is based on the visual perception of the human eye, the principle is that the minimum contrast of the CER is required to be perceived by the human eye, through experiments, ⁇ and They are set to 3 and 0.5 respectively.
  • the number of CERs extracted from the component tree will be much lower than the number of nodes on the original component tree. For example, for an image of the order of megapixels, the extracted CERs are usually only a few hundred to several thousand. .
  • the enhancement algorithm mainly includes two steps: 1) estimating the main color of the CER; 2) extracting a color-enhanced CER from the CER that is similar in color to the main color.
  • the noise pixel is located at the edge of the CER, so its gray value is small. Therefore, in order to estimate the main color of a CER (denoted as CERc), the pixels contained in CERc can be sorted according to the size of their gray values, so that S med indicates that the gray value in CERc is ranked in the top 50%.
  • the set of pixels, N pi is the number of pixels in the set S med
  • F pi is the color of the pixel p i
  • F dc is the main color of CERc
  • the pixel p i is similar to the main color F dc and can be defined as: If the color distance d(F pi , F dc ) ⁇ T dc , (T dc is a constant), the F pi is similar to the F dc color. The minimum degree of similarity between F pi and F dc is described here. Based on human perception and experiment, T dc is set to
  • the color space variation is involved in the fusion of CERs extracted from at least two channels, which is illustrated below in conjunction with an alternative schematic of the color space variation illustrated in FIG.
  • RGB color space to PII color space transformation so that the vector (R rgb , G rgb , B rgb ) represents a certain color in the RGB color space, (R rgb , G rgb , B rgb ) has a value range of 0 to 1 If (R rgb , G rgb , B rgb ) has a value range other than 0 to 1, it should be linearly normalized to the interval of 0 to 1. Then (R rgb , G rgb , B rgb ) is transformed as follows:
  • the A matrix is trained to get:
  • the B matrix is trained to get:
  • any form of caption in the video can be extracted for character recognition, which is exemplarily described in combination with an application scenario:
  • the video subtitles are extracted, character recognition is performed, and the subtitles in the text form are analyzed to determine the type, attributes, and the like of the video, and the user's preferences are analyzed.
  • the user's Preference database As the number of videos viewed by the user is accumulated, the user's Preference database, based on user preferences The user recommends a new online video.
  • the content index database of the video is created according to the subtitle of the text form of the video, and the video matching the content and the keyword is searched according to the keyword input by the user, thereby overcoming the defect that the related technology can only search based on the category and name of the video.
  • the side-by-side sharing function of the video allows the user to extract and recognize the subtitles of the current video playing interface as a text form through the one-click recognition function when viewing the video, and automatically fill in the dialog box for sharing the testimonials, thereby improving the smooth operation of the sharing operation. Degree and degree of automation.
  • An optional logical function structure of the foregoing caption extraction device is described. Referring to an optional logical function structure diagram of the caption extraction device illustrated, it is pointed out that the logical function structure of the caption extraction device is illustrated. Illustratively, based on the illustrated logical functional structure, those skilled in the art may further merge or split the units therein to perform various modifications to the logical function structure of the caption extraction device.
  • the caption extraction device includes:
  • the decoding unit 110 is configured to decode the video to obtain a video frame.
  • the connectivity unit 120 is configured to perform a connectivity operation of subtitles in a video frame to obtain a connected domain in the video frame.
  • the positioning unit 130 is configured to determine a video frame including the same subtitle based on the connected domain in the video frame, and determine a subtitle area in the video frame including the same subtitle based on a distribution position of the connected domain in the video frame including the same subtitle;
  • the extracting unit 140 is configured to construct a component tree corresponding to at least two channels of the subtitle region, and extract a contrast extreme value region corresponding to each channel by using the constructed component tree;
  • the enhancement unit 150 is configured to perform color enhancement processing on the contrast extreme value regions of the fused at least two channels to form a color enhanced contrast extreme value region for filtering redundant pixels and noise;
  • the merging unit 160 is configured to extract by combining the contrast extreme regions of the at least two channels subtitle.
  • the connectivity unit 120 is further configured to extract video frames at different time points according to the duration of the video, perform etching and expanding operations on the extracted video frames, and perform leftward on the video frames after performing the etching and expanding operations. And the right-handed connection operation.
  • the connectivity unit 120 samples video frames at different time points according to the duration of the video, for example, corresponding to extracting video frames at different time points based on the frame rate of the video, in order to avoid missing the subtitles in a certain video frame, the sampling rate when the video is extracted. Greater than the frame rate of the video. Corrosion and expansion operations are performed on pixels in the extracted video frame, and the video frames after the etching and expanding operations are connected to the subtitle arrangement direction. Generally, the subtitles are arranged in the video from left to right. Therefore, the left and right communication operations are performed on the pixels in the video frame. The characters of the subtitle area in the video frame can be made to form a connected domain. Of course, if it is predicted that the arrangement direction of the subtitles in the video is different from the conventional arrangement direction, the communication operation can be specifically performed for the subtitle arrangement direction of the video.
  • the positioning unit 130 is further configured to perform a difference on the pixels of the connected domain in the extracted adjacent video frames, and determine the extracted adjacent video when the difference is lower than the difference threshold.
  • the frame includes the same subtitle, and when the difference is above the difference threshold, it is determined that the extracted adjacent video frames include different subtitles.
  • the pixels of the connected adjacent video frames are deviated, for example, the difference between different channels in the RGB space, or the difference between different channels in the PII space. Value; when the difference is higher than the difference threshold, it indicates that the difference of the pixels of the connected domain in the adjacent video frame is too large, and if the subtitles of the connected area in the adjacent video frame are the same, the difference of the pixels is necessarily small ( Below the difference threshold), therefore, it is determined that the extracted adjacent video frames include the same subtitle, and when the difference is higher than the difference threshold, it is determined that the extracted adjacent video frames include different subtitles.
  • the positioning unit 130 is further configured to extract feature points for the connected domains in the extracted adjacent video frames, and determine the extracted points when the feature points extracted in the connected domain in the adjacent video frames match. Adjacent video frames include the same subtitles, and when not matched, it is determined that the extracted adjacent video frames include different subtitles.
  • the feature invariant feature-based feature matching algorithm extracts feature points from the corresponding connected domains, since the extracted feature points have no position, scale, and The feature of rotation and change, therefore, if the subtitles in adjacent video frames are the same, the feature points extracted from the connected domains in the adjacent video frames must be matched, and correspondingly, the features of the connected domains in the video frames are adjacent. Whether the points match, it can be determined whether the adjacent video frames include the same subtitles.
  • the above two ways of determining whether adjacent video frames include the same subtitle may be used in combination, thereby further improving the accuracy of identifying video frames including different subtitles. For example, performing a difference on the pixels of the connected domain in the extracted adjacent video frames, when the difference is lower than the difference threshold, and extracting the feature points from the corresponding connected domain based on the scale-invariant feature-transform feature matching algorithm When the feature points extracted in the connected domain in the adjacent video frames match, it is determined that the extracted adjacent video frames include the same subtitle; otherwise, it is determined that the extracted adjacent video frames include different subtitles.
  • the positioning unit 130 is further configured to determine the number of occurrences of the distribution position of the edge region of the connected domain in each video frame including the same subtitle, and determine that the region formed by the distribution position with the most occurrences is the subtitle region.
  • the distribution position herein refers to a distribution position of an edge region of the connected domain
  • determining a distribution concentration trend characteristic of the corresponding distribution location for example, based on a mode method
  • the bit number method or the mean method determines the distribution trend characteristics of the distribution position.
  • the region formed by the distribution position with the most occurrences based on the number of occurrences of the distribution position of the connected domains in each video frame including the same subtitle (that is, the distribution bit)
  • the area set to the edge position is the subtitle area.
  • the region formed by the distribution position of the intermediate value that is, the region where the distribution position is the edge position
  • the region formed by the mean value of the distribution position (that is, the region where the mean value of the distribution position is the edge position) is determined as the subtitle region based on the distribution position of the connected domains in each video frame including the same subtitle.
  • the enhancement unit 150 is further configured to determine a contrast extremity region for each channel in a manner that constructs a component tree formed by nested nodes corresponding to a subtitle region of the video frame from each of the following channels: Degree map; a tone channel based on perceptual illumination invariant PII; a saturation channel of PII; wherein the nodes of the component tree correspond to characters of the subtitle region.
  • Degree map a tone channel based on perceptual illumination invariant PII
  • a saturation channel of PII wherein the nodes of the component tree correspond to characters of the subtitle region.
  • the pixels in the subtitle area are sorted according to the size of the gray value from large to small, and the gray value is ranked in a predetermined proportion of the pixel set, if the color distance between the pixel in the set and the main color of the set is less than the color distance threshold (
  • the color distance threshold is a minimum color distance when the human eye can perceive the difference in color, and a color enhanced contrast extreme value region is formed based on the pixel, and a color enhanced contrast extreme value region is formed based on the pixel.
  • the enhancement unit 150 is further configured to form a color enhanced contrast extremum region of the respective channel in the following manner: determining a primary color of the contrast extremum region of each channel; A pixel whose degree of similarity to the main color satisfies a preset condition is extracted from the contrast extreme value region of each channel, and the color enhancement contrast extreme value region of the corresponding channel is composed based on the extracted pixels.
  • the caption extraction device 10 further includes:
  • the identification unit 170 is configured to perform character recognition on the color enhanced contrast extreme value area
  • the response unit 180 is configured to respond to the recognized text by at least one of a video search, a video recommendation, a video tag classification, and a subtitle sharing.
  • the video subtitles are extracted, character recognition is performed, and the subtitles in the text form are analyzed to determine the type, attributes, and the like of the video, and the user's preferences are analyzed.
  • the user's A preference database that recommends new online videos to users based on their preferences.
  • the content index database of the video is created according to the subtitle of the text form of the video, and the video matching the content and the keyword is searched according to the keyword input by the user, thereby overcoming the defect that the related technology can only search based on the category and name of the video.
  • the video side-by-side sharing function allows the user to extract and recognize the subtitles of the current video playing interface as a text form through a one-click recognition function while watching the video, and automatically fill in the dialog box for sharing the testimonials to enhance sharing.
  • the fluency and automation of the operation allows the user to extract and recognize the subtitles of the current video playing interface as a text form through a one-click recognition function while watching the video, and automatically fill in the dialog box for sharing the testimonials to enhance sharing.
  • the subtitle area can be extracted for any subtitle (corresponding to the connected domain) Image), regardless of the form of subtitles used by the video;
  • video personalized recommendation that is, by analyzing video subtitles to understand video attributes, recommending according to video content attributes; and extracting video subtitles Can be used for video content-based searches, making it easy for users to find the video they want.
  • the foregoing may be completed by a program instruction related hardware, where the foregoing program may be stored in a computer readable storage medium, and when executed, the program includes the steps of the foregoing method embodiment; and the foregoing storage medium includes: mobile storage A device that can store program code, such as a device, a random access memory (RAM), a read-only memory (ROM), a magnetic disk, or an optical disk.
  • program code such as a device, a random access memory (RAM), a read-only memory (ROM), a magnetic disk, or an optical disk.
  • the above-described integrated unit of the present invention may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a standalone product.
  • the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product, which is stored in a storage medium and includes a plurality of instructions for making
  • a computer device which may be a personal computer, server, or network device, etc.
  • the foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a RAM, a ROM, a magnetic disk, or an optical disk.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本发明实施例公开一种字幕提取方法及装置、存储介质;方法包括:对视频解码得到视频帧,对视频帧中的像素进行字幕排布方向的连通操作,得到视频帧中的连通域;基于视频帧中的连通域确定包括相同字幕的视频帧,并基于包括相同字幕的视频帧中连通域的分布位置,确定包括相同字幕的视频帧中的字幕区域;针对字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;对至少两个通道的对比度极值区域进行颜色增强处理,形成颜色增强对比度极值区域;融合至少两个通道的颜色增强对比度极值区域。

Description

字幕提取方法及装置、存储介质
相关申请的交叉引用
本申请基于申请号为201610643390.3、申请日为2016年08月08日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本发明涉及图像处理技术,尤其涉及一种字幕提取方法及装置、存储介质。
背景技术
目前,视频文件中字幕的记载方式多样,例如,内嵌式字幕、内挂字幕和外挂字幕等。其中,内嵌式字幕是将字幕覆盖在视频的视频帧上时,与视频的视频帧融为一体,不论视频格式如何进行变化,字幕的大小、位置、效果都是不变的。内挂字幕是将视频文件和字幕文件封装为至少两个音轨和至少两个字幕轨,在播放时可选择配音和字幕。外挂字幕对应的字幕文件与视频文件相互独立,在需要播放视频的时候,由视频播放器待用字幕文件在视频上加载。
针对终端的视频播放器支持各种形式的字幕的情况,难以使用统一的字幕提取方式来实现对所有形式的字幕进行提取和识别,导致在视频播放过程中无法自动提取文本形式的字幕以供用户进行分享或记录。
发明内容
本发明实施例提供一种字幕提取方法及装置、存储介质,能够通过统 一的字幕提取方式从视频中提取各种形式的字幕。
本发明实施例的技术方案是这样实现的:
第一方面,本发明实施例提供一种字幕提取方法,所述方法包括:
对视频解码得到视频帧,对所述视频帧中的像素进行字幕排布方向的连通操作,得到所述视频帧中的连通域;
基于所述视频帧中的连通域确定包括相同字幕的视频帧,并基于所述包括相同字幕的视频帧中连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域;
针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
对所述至少两个通道的对比度极值区域进行颜色增强处理,形成颜色增强对比度极值区域;
通过融合至少两个通道的颜色增强对比度极值区域提取出字幕。
本发明实施例还提供一种存储介质,存储有可执行程序,所述可执行程序被处理器执行时,实现本发明实施例提供的上述字幕提取方法。
第二方面,本发明实施例提供一种字幕提取装置,所述装置包括:
解码单元,配置为对视频解码得到视频帧;
连通单元,配置为对所述视频帧中的像素进行字幕排布方向的连通操作,得到所述视频帧中的连通域;
定位单元,配置为基于所述视频帧中的连通域确定包括相同字幕的视频帧,并基于所述包括相同字幕的视频帧中连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域;
提取单元,配置为针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
增强单元,配置为对所述融合的至少两个通道的对比度极值区域进行 颜色增强处理,形成滤除冗余像素和噪声的颜色增强对比度极值区域;
融合单元,配置为通过融合至少两个通道的对比度极值区域提取出字幕。
第三方面,本发明实施例提供一种字幕提取装置,所述装置包括:
处理器和存储介质;所述存储介质中存储有可执行指令,所述可执行指令用于引起所述处理器执行以下的操作:
对视频解码得到视频帧,对所述视频帧中的像素进行字幕排布方向的连通操作,得到所述视频帧中的连通域;
基于所述视频帧中的连通域确定包括相同字幕的视频帧,并基于所述包括相同字幕的视频帧中连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域;
针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
对所述至少两个通道的对比度极值区域进行颜色增强处理,形成颜色增强对比度极值区域;
通过融合至少两个通道的颜色增强对比度极值区域提取出字幕。
本发明实施例具有以下有益效果:
从视频帧中提取对应字幕的连通域,从而对于任意形式的字幕都能够提取字幕区域(与连通域对应的图像),不受视频使用何种形式的字幕的影响;同时,对从字幕区域提取的对比度极值区域进行颜色增强处理并进行融合,有效滤除字幕区域的图像中消除光照、衣物强干扰背景,以便更好的分离背景与字幕,有利于提升后续文字识别的效率和精度。
附图说明
图1A是本发明实施例中像素关系的一个可选的示意图;
图1B是本发明实施例中像素关系的一个可选的示意图;
图1C是本发明实施例中像素关系的一个可选的示意图;
图1D是本发明实施例中像素关系的一个可选的示意图;
图1E是本发明实施例中像素关系的一个可选的示意图;
图1F是本发明实施例中像素关系的一个可选的示意图;
图2是本发明实施例中字幕提取装置的一个可选的硬件结构示意图;
图3是本发明实施例中字幕提取的一个可选的场景示意图;
图4A是本发明实施例中字幕提取方法的一个可选的流程示意图;
图4B是本发明实施例中字幕提取方法的一个可选的流程示意图;
图5A是本发明实施例中字幕区域的一个可选的示意图;
图5B是本发明实施例中字幕区域的一个可选的示意图;
图6是本发明实施例中字幕区域叠加的一个可选的流程示意图;。
图7是本发明实施例中形成对比度极值区域的一个可选的示意图;
图8是本发明实施例中颜色空间转换的一个可选的示意图;
图9A是本发明实施例中字幕提取的一个可选的场景示意图;
图9B是本发明实施例中字幕提取的一个可选的场景示意图;
图10是本发明实施例中字幕提取装置的一个可选的功能结构示意图。
具体实施方式
以下结合附图及实施例,对本发明进行进一步详细说明。应当理解,此处所提供的实施例仅仅用以解释本发明,并不用于限定本发明。另外,以下所提供的实施例是用于实施本发明的部分实施例,而非提供实施本发明的全部实施例,在本领域技术人员不付出创造性劳动的前提下,对以下实施例的技术方案进行重组所得的实施例、以及基于对发明所实施的其他实施例均属于本发明的保护范围。
需要说明的是,在本发明实施例中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的方法或 者装置不仅包括所明确记载的要素,而且还包括没有明确列出的其他要素,或者是还包括为实施方法或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的方法或者装置中还存在另外的相关要素(例如方法中的步骤或者装置中的单元)。
本发明实施例中涉及的名词和术语适用于如下的解释。
1)灰度值:表示像素明暗程度的整数量,例如:像素的灰度取值范围为0-255,就称该图像为256个灰度级的图像。
2)腐蚀(Erode):删除对象边界某些像素,具有收缩图像作用,腐蚀算法使用一个nXn结构元素去扫描图像中的每一个像素,用nXn结构元素与nXn结构元素覆盖的二值图像做“与”操作,如果都为1,图像的该像素为1,否则为0。腐蚀之后,图像边界向内收缩。
3)膨胀(Dilate):添加对象边界某些像素,具有扩大图像作用;膨胀算法使用一个nXn结构元素去扫描图像中的每一个像素。用nXn结构元素与nXn结构元素覆盖的二值图像做“与”操作,如果都为0,图像的该像素为0,否则为1。膨胀之后,图像边界向外扩大。
4)邻接:两个像素接触,则它们是邻接的。一个像素和它的邻域中的像素是接触的。邻接仅考虑像素的空间关系。
邻域包括以下几种类型:
4.1)4邻域:如图1A所示,像素p(x,y)的4邻域是邻接的像素:(x+1,y);(x-1,y);(x,y+1);(x,y-1)。
4.2)D邻域:如图1B所示,像素p(x,y)的D邻域是对角上的像素(x+1,y+1);用ND(p)表示像素p的D邻域:(x+1,y-1);(x-1,y+1);(x-1,y-1)。
4.3)8邻域:如图1C所示,像素p(x,y)的8邻域是:4邻域的像 素+D邻域的像素,用N8(p)表示像素p的8邻域。
5)连通,两个像素连接(1)是邻接的;(2)灰度值(或其他属性)满足某个特定的相似准则(灰度相等或在某个集合中等条件)。
连通包括以下几种类型:
5.1)4连通
如图1D所示,对于具有灰度值V的像素p和q,如果q在集合N4(p)中,则称这两个像素是4连通。
5.2)8连通
如图1E所示,对于具有值V的像素p和q,如果q在集合N8(p)中,则称这两个像素是8连通的。
如图1F所示,对于具有值灰度值V的像素p和q,如果:
I.q在集合N4(p)中,或,
II.q在集合ND(p)中,并且N4(p)与N4(q)的交集为空(没有灰度值V的像素),则像素p和q是m连通的,即4连通和D连通的混合连通。
6)连通区域,彼此连通(上述的任意一种连通方式)的像素形成一个区域,而不连通的点形成不同的区域。这样的一个所有的点彼此连通点构成的集合,称为连通域。
7)数据分布集中趋势特征,也就是数组中的数字集中分布的字段,通常利用众数法、中位数法和均值法等确定分布情况;众数法就是测算数组中重复出现次数最多的数字(众数)的方法,中位数法就是测算数组中中间取值(中位数)的方法,均值法就是测算数组中数字均值的方法。
8)帧间差分法(帧差法),将相邻视频帧对应像素的灰度值相减,在环境亮度变化不大的情况下,如果对应像素灰度相差很小(未超出阈值),可以认为像素代表的对象是静止的;如果图像区域某处的灰度变化很大(超 出阈值),可以认为这是由于图像中对象运动引起的,将这些静止区域和运动的像素区域标记下来,利用这些标记的像素区域,就可以得到运动对象以及静止对象在视频帧中的位置。
示例性地,帧差法是在连续的图像序列中2个或3个相邻帧间采用基于像素的时间差分并且阈值化来提取视频帧中运动对象区域。该运动对象实时跟踪***是采用三帧差分来进行运动目标检测,这种方法不仅能提高运动对象检测的速度,而且提高所检测视频帧的完整性。
9)尺度不变特征转换(SIFT,Scale-Invariant Feature Transform)特征匹配算法,用来侦测视频帧中的局部性特征,也就是对象上的一些局部外观的特征点的特征,这些特征点的特征与对象成像的大小和旋转无关。
在空间尺度中寻找特征点,并提取出特征点的特征描述:位置、尺度和旋转不变量。基于不同视频帧得到的特征点的特征的描述,对特征点进行匹配,可以得到视频帧中是否包括相同的特征点。
尺度不变特征转换特征匹配算法所查找到的特征点是一些十分突出,不会因光照,仿射变换和噪音等因素而变化的点,如角点、边缘点、暗区的亮点及亮区的暗点等。
10)对比度极值区域(CER,Contrasting Extremal Region),视频帧中跟周围的背景有一定对比度(超出对比度阈值)的区域,从而至少能够被人眼感知。
11)颜色增强CER(color-enhanced CER),是采用基于感知的光照不变(PII,Perception-based Illumination Invariant)颜色空间中的颜色信息去增强CER,利用颜色信息滤除CER中的冗余像素或者噪声,从而得到Color-enhanced CER,该颜色空间具有视觉感知一致性,而且对光照不敏感,更接近人眼对颜色的判断。PII颜色空间的颜色模型,包括:色调H,饱和度S和明度V。
本发明实施例提供字幕提取方法、应用字幕提取方法的字幕提取装置以及存储介质(存储介质中存储有用于执行字幕提取方法的可执行指令)。
本发明实施例提供的字幕提取装置可以以各种形式来实施,示例性地,可以实施为智能手机、平板电脑、车载终端等移动终端,也可以实施为台式机电脑、智能电视、机顶盒等形式的固定终端,或者是类似的运算装置,又或者是网络侧的服务器。
图2示例性示出字幕提取装置10一个可选的硬件结构示意图,图2示出的硬件结构仅为示例,并不构成对设备结构的限定。例如,可以根据实施需要设置较图2更多的组件,或者根据实施需要省略设置部分组件。
在图2中,字幕提取装置10的一个可选的硬件结构包括:至少一个处理器11、存储器12、至少一个网络接口14和用户接口13。字幕提取装置10中的各个组件通过总线***15耦合在一起。可理解,总线***15用于实现这些组件之间的连接通信。总线***15除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线***15。
其中,用户接口13可以包括显示器、键盘、鼠标、轨迹球、点击轮、按键、按钮、触感板或者触摸屏等。
网络接口14向处理器11提供外部数据如异地设置的存储器12的访问能力,示例性地,网络接口14可以基于近场通信(NFC,Near Field Communication)技术、蓝牙(Bluetooth)技术、紫蜂(ZigBee)技术进行的近距离通信,另外,还可以实现如基于码分多址(CDMA,Code Division Multiple Access)、宽带码分多址(WCDMA,Wideband Code Division Multiple Access)等通信制式及其演进制式的通信。
可以理解,存储器12可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储 器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本发明实施例描述的存储器12旨在包括但不限于这些和任意其它适合类型的存储器。
本发明实施例中的存储器12用于存储各种类型的数据以支持字幕提取装置10的操作。这些数据的示例包括:用于在字幕提取装置10上操作的任何计算机程序,如操作***121和应用程序122;联系人数据;电话簿数据;消息;图片;视频等。其中,操作***121包含各种***程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件 的任务。应用程序122可以包含各种应用程序,例如媒体播放器(Media Player)、浏览器(Browser)等,用于实现各种应用业务。实现本发明实施例方法的程序可以包含在应用程序122中。
上述本发明实施例揭示的方法可以应用于处理器11中,或者由处理器11实现。处理器11可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器11中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器11可以是通用处理器、数字信号处理器(DSP,Digital Signal Processor),或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器11可以实现或者执行本发明实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本发明实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器12,处理器11读取存储器12中的信息,结合其硬件完成前述方法的步骤。
在示例性实施例中,字幕提取装置10可以被一个或至少两个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、DSP、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)、通用处理器、控制器、微控制器(MCU,Micro Controller Unit)、微处理器(Microprocessor)、或其他电子元件实现,用于执行前述方法。
结合图3示出的字幕提取的一个可选的场景示意图,终端30登录视频应用请求,以在线视频应用服务器40上运行在线视频播放应用,根据用户在在线视频播放应用中选中的视频的标识如序列号(ID)向视频服务器50 请求视频的数据,视频服务器50从视频数据库获取视频,从视频中提取字幕区域并识别为文本形式,由在线视频应用服务器40连同视频下发至终端30。
作为在向终端下发视频时从视频中提取字幕的替代方案,视频服务器50可以预先从视频数据库的视频中识别文本形式的字幕,并连同视频在视频数据库中存储,这样在并发响应至少两个终端获取视频(例如,终端30请求不同视频的字幕,或者请求相同视频的字幕)时,可以实时下发文本形式的字幕以避免延迟。
当然,视频服务器50也可以只将终端30所请求的视频的数据下发至相应终端进行播放,在终端30需要视频中的文本形式的字幕时,才下发相应的文本形式的字幕至终端30。
示例性地,终端30、在线视频应用服务器40与视频服务器50之间以实时流式传输(Realtime Streaming)或顺序流式传输(Progressive Streaming)的方式传输流媒体的视频数据。一般说来,如视频为需要实时播放的,则使用流式传输的媒体服务器,或应用如实时流传输协议(RTSP,Real Time Streaming Protocol)传输。如使用超文本传输协议(HTTP,HyperText Transfer Protocol)服务器,流媒体的视频数据即通过顺序流发送。采用何种传输方式依赖于视频播放的实时性的需求。当然,终端30也可以将视频的全部的数据下载到本地再进行播放。
下面结合图4A示出的可以应用于图3中字幕提取的一个可选的流程示意图进行说明,如图4A所示,包括以下步骤:
步骤101,对视频解码得到视频帧。
步骤102,对视频帧中的像素进行字幕排布方向的连通操作,得到视频帧中的连通域。
在一个实施例中,根据视频的时长采样不同时间点的视频帧,例如, 基于视频的帧速率对应提取不同时间点的视频帧,为了避免遗漏某一视频帧中的字幕,抽取视频时的采样速率大于视频的帧速率。对于所提取的视频帧中的像素进行腐蚀和扩张操作,对于进行腐蚀和扩张操作后的视频帧进行与字幕排布方向的连通操作,通常,字幕在视频中以从左至右的方向排布,因此对视频帧中的像素进行左向和右向的连通操作。使得视频帧中字幕区域的字符能够形成一个连通域。当然,如果预知视频中字幕的排布方向与常规的排布方向不同,可以针对视频的字幕排布方向有针对性地进行连通操作。
步骤103,基于视频帧中的连通域确定包括相同字幕的视频帧。
在一个实施例中,对所提取的相邻的视频帧中连通域(连通域与文本行形式的字幕对应)的像素作差,例如在RGB空间的不同通道的差值,或在PII空间的不同通道的差值;当所述差值高于差值阈值时,说明相邻视频帧中连通域的像素的差异过大,而相邻视频帧中连通区域的字幕如果相同则像素的差值必然很小(低于差值阈值),因此,判定所提取的相邻的视频帧包括相同的字幕,当所述差值高于差值阈值时,判定所提取的相邻的视频帧包括不同的字幕。
在一个实施例中,对于所提取的在时间上相邻的视频帧中的连通域,基于尺度不变特征转换特征匹配算法从相应连通域中提取特征点,由于所提取的特征点具有不因位置、尺度和旋转而改变的特点,因此,如果相邻视频帧中的字幕相同,则从相邻视频帧中连通域提取的特征点必然是匹配的,相应地,通过相邻视频帧中连通域的特征点是否匹配,可以判断相邻视频帧是否包括相同的字幕。
另外,上述的两种判断相邻视频帧是否包括相同字幕的方式可以结合使用,从而进一步提升识别包括不同字幕的视频帧的精度。例如,对所提取的相邻的视频帧中连通域的像素作差,当所述差值低于差值阈值,且, 基于尺度不变特征转换特征匹配算法从相应连通域中提取特征点,当所述相邻的视频帧中连通域中提取的特征点匹配时,判定所提取的相邻的视频帧包括相同的字幕;否则,判定所提取的相邻的视频帧包括不同的字幕。
步骤104,基于包括相同字幕的视频帧中连通域的分布位置,确定包括相同字幕的视频帧中的字幕区域。
在一个实施例中,对于包括相同字幕的视频帧中连通域的分布位置(这里的分布位置是指连通域的边缘区域的分布位置),确定相应分布位置的分布集中趋势特征,例如,基于众数法、中位数法或均值法确定分布位置的分布趋势特征。
以众数法为例,基于包括相同字幕的各视频帧中的连通域的分布位置的出现次数,确定出现次数最多的分布位置形成的区域(也就是该分布位置为边缘位置的区域)为字幕区域。以中位数法为例,基于包括相同字幕的各视频帧中的连通域的分布位置,确定中间取值的分布位置形成的区域(也就是该分布位置为边缘位置的区域)为字幕区域。再以均值法为例,基于包括相同字幕的各视频帧中的连通域的分布位置,确定分布位置的均值形成的区域(也就是分布位置的均值为边缘位置的区域)为字幕区域。
步骤105,针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域。
在一个实施例中,从至少两个通道如灰度图、PII的色调通道;PII的饱和度通道对视频帧的字幕区域对应构造由嵌套的节点形成的组件树,组件树的节点与字幕区域的字符对应;节点与邻接背景的对比度采用节点相对于邻接节点的面积变化率来表征,由于极值区域与邻接背景至少有能被人眼感知到的对比度,因此,当节点相对于邻接节点的面积变化率小于面积变化率阈值时,则确定节点属于相应通道的对比度极值区域。
步骤106,对至少两个通道的对比度极值区域进行颜色增强处理。
在一个实施例中,对于每个通道的对比度极值区域,确定对比度极值区域的主要颜色,从对比度极值区域中提取出跟主要颜色相似程度满足预设条件的像素,基于所提取的像素组成相应通道的颜色增强对比度极值区域。
例如,对于任一通道的字幕区域,将字幕区域中的像素按照灰度值的大小从大到小排序,取灰度值排在前预定比例的像素集合,若集合中像素与集合的主要颜色的颜色距离小于颜色距离阈值(颜色距离阈值是人眼所能感知到颜色的区别时的最小颜色距离),则基于该像素形成颜色增强对比度极值区域。
步骤107,融合至少两个通道的颜色增强对比度极值区域,形成滤除冗余像素和噪声的颜色增强对比度极值区域。
通过颜色增强处理并进行融合,能够实现对字幕区域的噪点去除,并分离字幕区域中的字符与背景的效果。
如前所述,示例性地,从以下的至少两个通道对颜色增强对比度极值区域进行融合:灰度图;基于感知的光照不变PII的色调通道;PII的饱和度通道。由于字幕区域的图像形成的颜色增强度对比度极值区域已经滤除噪点和背景,因此能够有效滤除字幕区域的图像中消除光照、衣物强干扰背景,以便更好的分离背景与字幕,提升字符识别的效率和精度。
步骤108,对颜色增强对比度极值区域进行字符识别。
由于颜色增强对比度极值区域的图像已经滤除噪点和背景,因此对图像进行字符识别的难度将显著降低,可以使用相关的字符识别技术对字幕区域进行字符识别。
下面结合图4B示出的可以应用于终端和服务器一个可选的流程示意图进行说明,如图4B所示,包括以下步骤:
步骤201,终端运行在线视频播放应用;
步骤202,终端根据用户在在线视频播放应用中选中的视频标识如序列号(ID)向服务器请求视频数据;
步骤203,服务器基于所述视频标识如序列号(ID)从视频数据库中获取视频;
步骤204,服务器对视频解码得到视频帧,对视频帧中的像素进行左向和右向的连通操作,得到视频帧中的连通域;
步骤205,服务器对所提取的相邻的视频帧中连通域(连通域与文本行形式的字幕对应)的像素作差;当所述差值小于差值阈值时,判定所提取的相邻的视频帧包括相同的字幕;
步骤206,服务器基于包括相同字幕的各视频帧中的连通域的分布位置的出现次数,确定出现次数最多的分布位置形成的区域为字幕区域;
步骤207,服务器针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
步骤208,服务器对于每个通道的对比度极值区域,确定对比度极值区域的主要颜色,从对比度极值区域中提取出跟主要颜色相似程度满足预设条件的像素,基于所提取的像素组成相应通道的颜色增强对比度极值区域;
步骤209,服务器融合至少两个通道的颜色增强对比度极值区域,形成滤除冗余像素和噪声的颜色增强对比度极值区域,对颜色增强对比度极值区域进行字符识别,提取出字符;
步骤210,服务器将所述字符下发至终端。
需要说明的是,本示例中步骤204~206为视频中字幕定位的过程,步骤207~209为将定位到的字幕从复杂的背景中提取出来的过程。
本发明实施例所述字幕提取方法,服务器从视频帧中提取对应字幕的连通域,从而对于任意形式的字幕都能够提取字幕区域(与连通域对应的图像),不受视频使用何种形式的字幕的影响;同时,对从字幕区域提取的 对比度极值区域进行颜色增强处理并进行融合,有效滤除字幕区域的图像中消除光照、衣物强干扰背景,以便更好的分离背景与字幕,有利于提升后续文字识别的效率和精度。
再结合字幕提取的一个示例进行说明。
从复杂背景的视频文件中定位字幕位置,以及提取出干净的字幕图像。主要包括两个部分的处理:首先进行是视频中字幕定位,其次是将定位到的字幕从复杂的背景中提取出来。
视频字幕定位:根据视频时长提取不同时间点出的视频帧,对这些视频帧做形态学上的腐蚀、膨胀操作,同时结合左右连通域操作得到该视频帧的字幕区域,对不同时刻的视频帧执行上述的定位的操作,得到视频中一系列的字幕区域的位置,通过众数法,获取准确的视频中字幕区域的准确位置信息。
视频字幕提取:在视频字幕定位完成的基础上,需要将字幕区域的文字与背景信息进行分离,通过帧差法以及SIFT特征匹配算法来区分时域上的字幕信息是否为同一字幕,若为同一字幕,则对同一字幕的字幕区域的图像进行叠加,求均值,以此来消除部分光照、衣服等复杂背景的干扰,另外对均值字幕区域进行颜色过滤,通过融合多通道上字幕区域的CER(对比度极值区域)来寻找字幕。最后通过color-enhanced CER来得到最后干净的字幕区域的图像。
图3中视频服务器50的一个可选的硬件环境如下:
CPU:Genuine Intel(R)@1.73GHz或以上;
内存:1GB或以上;
硬盘:120GB以上。
图3中视频服务器50的一个可选的软件运行环境如下:
操作***:64bit的tlinux 1.2以上版本
数据库:redis以及mysql
对视频服务器50使用上述硬件环境以及软件环境进行字幕提取的处理过程进行说明。
一、定位字幕区域
对视频进行解码得到视频帧,对其图像进行腐蚀、膨胀操作,再进行左右联调操作得到每个帧的字幕目标区域。通过对同一视频不同时刻取N帧图像进行字幕区域定位,最后对所有的(x1,y1),(x2,y2)坐标众数,得到最终的文本行高度。视频帧的原始图像,以及对视频帧中的字幕区域定位之后可选的示意图如图5A和图5B所示。
在定位视频帧中的字幕区域,也就是字幕的文本行上下边界之后,为消除文字分离过程中强光照、衣物等事物的干扰,使用视频时域上的帧差法以及SIFT特征匹配来区分两个视频帧中的字幕是否为同一字幕。
利用帧差法判断两个视频帧中是否包括相同字幕的一个可选的流程示意图如图6所示,对于从视频中连续提取的视频帧中的字幕区域,记为文本行图像1和文本行图像2,结合采用两种方式判断文本行图像1和文本行图像2是否为相同的字幕:
方式1)通过比较相邻文本行的像素差值,根据水平投影和垂直投影(一般是对二值图像而用的,水平方向的投影就是每行的非零像素值的个数,在这里就是1或者255,垂直投影就是每列图像数据中非零像素值的个数)来判断文本行图像1和文本行图像2是否为相同的字幕。
方式2)提取文本行1和文本行图像2的SIFT特征进行匹配,根据匹配的结果得到相似度,综合帧差法以及SIFT特征的相似度两者的结果,来综合判断是否为同一字幕,如果是相同的字幕则叠加文本行图像1和文本行图像2并求均值形成新的文本行图像。
后续提取的视频帧中的字幕区域与新的文本行图像重复进行上述的判 断处理,以继续叠加相同的文本行图像,直至提取的视频帧中的文本行图像发生变化,针对新的文本行图像继续进行叠加处理。
二、字幕提取
针对每个字幕形成的文本行图像(字幕区域),构造组件树,提取CER区域。
组件树构造流程图
参见图7示出的组件树的一个可选的结构示意图,(N,i),(N,i+1),…,是一串对应汉字“官”(用S表示)的节点/极值区域,且在组件树上从下往上依次嵌套。令S(N,i),S(N,i+1),…,分别表示(N,i),(N,i+1),…,的面积,则节点(N,i)与其祖先节点(N,i+Δ)的面积变化率为:
Figure PCTCN2017096509-appb-000001
面积变化率RΔS(ni,ni+Δ)可以用来度量节点(N,i)与其邻接背景的对比度。假设字幕区域的图像中某极值区域对应的二值化阈值为level,当减小阈值的时候,该极值区域会往外扩张或者与其他极值区域合并,面积会增大,而RΔS(ni,ni+Δ)用于描述面积增长速率。极值区域与其邻接背景对比度高,则其往外扩张的面积就会越小,面积增长速率也会越慢。所以RΔS(ni,ni+Δ)反比于节点n与其邻接背景的对比度,RΔS(ni,ni+Δ)越大,对比度越低。基于面积变化率,对比极值区域CER可以定义如下。
如果
Figure PCTCN2017096509-appb-000002
(
Figure PCTCN2017096509-appb-000003
为常数),则节点rii就是一个CER。
CER的定义虽然非常简单,但是却有着非常清晰的物理含义:它是一类特殊的极值区域,这些极值区域与它们的邻接背景至少有能被人眼感知到的对比度。CER提取条件的严格与否取决于参数Δ和
Figure PCTCN2017096509-appb-000004
例如,如果固定参数Δ,
Figure PCTCN2017096509-appb-000005
越大,则对CER的对比度要求越低,即可以提取出对比度更 低的极值区域,所以提取出来的CER的数量就会越多。在实际的自然场景图像中,确实会遇到一些文字区域对比度很低的情况,为了能处理这些情况,Δ和
Figure PCTCN2017096509-appb-000006
需要设置得较为保守,即对CER的最低对比度要求很低。本发明实施例中上述的参数设定是基于人眼的视觉感知,原则是要求CER的最低对比度能被人眼感知到,通过实验,Δ和
Figure PCTCN2017096509-appb-000007
分别被设置为3和0.5。通常情况下,从组件树上提取出来的CER的数量会远低于原始组件树上节点的个数,例如对一张百万像素数量级的图像,提取出来的CER通常只有几百到几千个。
一般视频字幕噪点较多,背景和文字融合的情况常常出现,因此还需要针对字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的CER,融合至少两个通道的CER,对融合的至少两个通道的CER进颜色增强处理,尽可能滤除CER中的冗余像素或者噪声。增强算法主要包含两个步骤:1)估计CER的主要颜色;2)从CER中提取出颜色跟主要颜色相近的像素组成Color-enhanced CER。
一般来说,噪声像素位于CER的边缘位置,所以其灰度值较小。所以,为了估计某个CER(记为CERc)的主要颜色,可以将CERc中包含的像素按照其灰度值的大小从大到小排序,令Smed表示CERc中灰度值排在前50%的像素集合,Npi为集合Smed中像素的个数,Fpi为像素pi的颜色,Fdc为CERc的主要颜色,则Fdc可以计算为:
Figure PCTCN2017096509-appb-000008
像素pi与主要颜色Fdc相近可定义为:如果颜色距离d(Fpi,Fdc)<Tdc,(Tdc为常数),则称Fpi与Fdc颜色相近。此处描述Fpi和Fdc与的最低相似程度,基于人眼感知和实验,Tdc被设为
Figure PCTCN2017096509-appb-000009
为了使得图像中更多的文字满足极值区域的定义,在多通道(灰度图, PII的H通道,PII的S通道)上面提取CER,最后融合提取的CER区域,最终达到分离文字和复杂背景的目的。
对从至少两个通道提取的CER进行融合时涉及到颜色空间变化,下面结合图8示出的颜色空间变化的一个可选的示意图进行说明。
下面是RGB颜色空间到PII颜色空间变换,令向量(Rrgb,Grgb,Brgb)表示RGB颜色空间中某个颜色,(Rrgb,Grgb,Brgb)的取值范围为0到1,如果(Rrgb,Grgb,Brgb)的取值范围不在0到1,则应先线性规整到0到1的区间。接着对(Rrgb,Grgb,Brgb)作如下变换:
Figure PCTCN2017096509-appb-000010
此处C代表最后对(Rrgb,Grgb,Brgb)作如下线性变换:
Figure PCTCN2017096509-appb-000011
从而得到(Rrgb,Grgb,Brgb)在CIE XYZ颜色空间中的值(X,Y,Z)。
接下来令
Figure PCTCN2017096509-appb-000012
表示CIE XYZ空间中的三刺激值,再令
Figure PCTCN2017096509-appb-000013
表示从CIEXYZ空间到PII颜色空间的变换方程,则
Figure PCTCN2017096509-appb-000014
的推导过程可以概括如下:当颜色被投影到某些特定的基向量上的时候,对颜色添加光照的效果等同于对每个颜色通道乘以一个标量系数。此处,用矩阵B表示对特定基的线性变换,光照对颜色的影响可以被写为如下形式:
Figure PCTCN2017096509-appb-000015
此处D为仅仅与光照相关的对角阵。可以得到如下等式:
Figure PCTCN2017096509-appb-000016
任意两个颜色f在PII空间中的视觉距离应该定义为
Figure PCTCN2017096509-appb-000017
此处符号||·||表示欧氏距离。
经过推导,可以证明
Figure PCTCN2017096509-appb-000018
必须有如下形式:
Figure PCTCN2017096509-appb-000019
其中A矩阵为训练得到:
Figure PCTCN2017096509-appb-000020
其中B矩阵为训练得到:
Figure PCTCN2017096509-appb-000021
综上,给定RGB颜色空间中任意的颜色向量,(Rrgb,Grgb,Brgb),先通过公式(3)和(4)将其变换到CIE XYZ颜色空间,再通过公式(6)将其变换到PII颜色空间即可。
可以看出基于上述的字幕提取方案可对视频中任意形式的字幕进行提取从而进行字符识别,示例性地结合应用场景进行说明:
参见图9A示出的字幕提取的应用场景1)
例如,视频字幕提取完以后,进而进行字符识别,基于对文本形式的字幕进行分析以确定视频的类型、属性等,分析出用户的偏好,随着用户观看视频的数量的累积,可以建立用户的偏好数据库,根据用户的偏好向 用户推荐新上线的视频。
再例如,根据视频的文本形式的字幕建立视频的内容索引数据库,根据用户输入的关键字搜索内容与关键字匹配的视频,克服相关技术仅能够基于视频的类别以及名称进行搜索的缺陷。
参见图9B示出的字幕提取的应用场景2)
视频的边看边分享功能,用户在观看视频时通过一键识别功能,对当前视频播放界面的字幕进行提取并识别为文本形式,并自动填充到分享感言的对话框中,提升分享操作的流畅度和自动化程度。
对前述字幕提取装置的一个可选的逻辑功能结构进行说明,参见图示出的字幕提取装置的一个可选的逻辑功能结构示意图,需要指出的是,图示出的字幕提取装置的逻辑功能结构仅仅是是示例性地,基于图示出的逻辑功能结构,本领域技术人员可以其中的单元进行进一步合并或者拆分,从而对字幕提取装置的逻辑功能结构进行各种变形。
在图10中,字幕提取装置包括:
解码单元110,配置为对视频解码得到视频帧;
连通单元120,配置为对视频帧中的像素进行字幕排布方向的连通操作,得到视频帧中的连通域;
定位单元130,配置为基于视频帧中的连通域确定包括相同字幕的视频帧,并基于包括相同字幕的视频帧中连通域的分布位置,确定包括相同字幕的视频帧中的字幕区域;
提取单元140,配置为针对字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
增强单元150,配置为对融合的至少两个通道的对比度极值区域进行颜色增强处理,形成滤除冗余像素和噪声的颜色增强对比度极值区域;
融合单元160,配置为通过融合至少两个通道的对比度极值区域提取出 字幕。
在一个实施例中,连通单元120,还配置为根据视频的时长提取不同时间点的视频帧,对所提取的视频帧进行腐蚀和扩张操作;对进行腐蚀和扩张操作后的视频帧进行左向和右向的连通操作。
例如,连通单元120根据视频的时长采样不同时间点的视频帧,例如,基于视频的帧速率对应提取不同时间点的视频帧,为了避免遗漏某一视频帧中的字幕,抽取视频时的采样速率大于视频的帧速率。对于所提取的视频帧中的像素进行腐蚀和扩张操作,对于进行腐蚀和扩张操作后的视频帧进行与字幕排布方向的连通操作,通常,字幕在视频中以从左至右的方向排布,因此对视频帧中的像素进行左向和右向的连通操作。使得视频帧中字幕区域的字符能够形成一个连通域。当然,如果预知视频中字幕的排布方向与常规的排布方向不同,可以针对视频的字幕排布方向有针对性地进行连通操作。
在一个实施例中,定位单元130,还配置为对所提取的相邻的视频帧中连通域的像素作差,当所述差值低于差值阈值时,判定所提取的相邻的视频帧包括相同的字幕,当所述差值高于差值阈值时,判定所提取的相邻的视频帧包括不同的字幕。
例如,对所提取的相邻的视频帧中连通域(连通域与文本行形式的字幕对应)的像素作差,例如在RGB空间的不同通道的差值,或在PII空间的不同通道的差值;当所述差值高于差值阈值时,说明相邻视频帧中连通域的像素的差异过大,而相邻视频帧中连通区域的字幕如果相同则像素的差值必然很小(低于差值阈值),因此,判定所提取的相邻的视频帧包括相同的字幕,当所述差值高于差值阈值时,判定所提取的相邻的视频帧包括不同的字幕。
在一个实施例中,定位单元130,还配置为对所提取的相邻的视频帧中连通域提取特征点,当相邻的视频帧中连通域中提取的特征点匹配时,判定所提取的相邻的视频帧包括相同的字幕,当不匹配时,判定所提取的相邻的视频帧包括不同的字幕。
例如,对于所提取的在时间上相邻的视频帧中的连通域,基于尺度不变特征转换特征匹配算法从相应连通域中提取特征点,由于所提取的特征点具有不因位置、尺度和旋转而改变的特点,因此,如果相邻视频帧中的字幕相同,则从相邻视频帧中连通域提取的特征点必然是匹配的,相应地,通过相邻是视频帧中连通域的特征点是否匹配,可以判断相邻视频帧是否包括相同的字幕。
另外,上述的两种判断相邻视频帧是否包括相同字幕的方式可以结合使用,从而进一步提升识别包括不同字幕的视频帧的精度。例如,对所提取的相邻的视频帧中连通域的像素作差,当所述差值低于差值阈值时,且,基于尺度不变特征转换特征匹配算法从相应连通域中提取特征点,当相邻的视频帧中连通域中提取的特征点匹配时,判定所提取的相邻的视频帧包括相同的字幕;否则,判定所提取的相邻的视频帧包括不同的字幕。
在一个实施例中,定位单元130,还配置为确定包括相同字幕的各视频帧中连通域的边缘区域的分布位置的出现次数,并确定出现次数最多的分布位置形成的区域为字幕区域。
例如,对于包括相同字幕的视频帧中连通域的分布位置(这里的分布位置是指连通域的边缘区域的分布位置),确定相应分布位置的分布集中趋势特征,例如,基于众数法、中位数法或均值法确定分布位置的分布趋势特征。
以众数法为例,基于包括相同字幕的各视频帧中的连通域的分布位置的出现次数,确定出现次数最多的分布位置形成的区域(也就是该分布位 置为边缘位置的区域)为字幕区域。以中位数法为例,基于包括相同字幕的各视频帧中的连通域的分布位置,确定中间取值的分布位置形成的区域(也就是该分布位置为边缘位置的区域)为字幕区域。再以均值法为例,基于包括相同字幕的各视频帧中的连通域的分布位置,确定分布位置的均值形成的区域(也就是分布位置的均值为边缘位置的区域)为字幕区域。
在一个实施例中,增强单元150,还配置为采用以下方式确定每个通道的对比度极值区域:从以下每个通道对视频帧的字幕区域对应构造由嵌套的节点形成的组件树:灰度图;基于感知的光照不变PII的色调通道;PII的饱和度通道;其中,组件树的节点与字幕区域的字符对应。当节点的面积变化率相对于邻接节点的面积变化率的小于面积变化率阈值时,则确定节点属于相应通道的对比度极值区域。
例如,将字幕区域中的像素按照灰度值的大小从大到小排序,取灰度值排在前预定比例的像素集合,若集合中像素与集合的主要颜色的颜色距离小于颜色距离阈值(颜色距离阈值是人眼所能感知到颜色的区别时的最小颜色距离),则基于该像素形成颜色增强对比度极值区域,基于该像素形成颜色增强对比度极值区域。通过颜色增强处理,能够实现对字幕区域的噪点去除,并分离字幕区域中的字符与背景的效果。
在一个实施例中,对于每个通道的对比度极值区域,增强单元150,还配置为采用以下方式形成相应通道的颜色增强对比度极值区域:确定每个通道的对比度极值区域的主要颜色;从每个通道的对比度极值区域中提取出跟主要颜色相似程度满足预设条件的像素,基于所提取的像素组成相应通道的颜色增强对比度极值区域。
在一个实施例中,参见图10,字幕提取装置10还包括:
识别单元170,配置为对颜色增强对比度极值区域进行字符识别;
响应单元180,配置为对所识别出的文本响应视频搜索、视频推荐、视频标记分类和字幕分享至少之一的操作。
例如,视频字幕提取完以后,进而进行字符识别,基于对文本形式的字幕进行分析以确定视频的类型、属性等,分析出用户的偏好,随着用户观看视频的数量的累积,可以建立用户的偏好数据库,根据用户的偏好向用户推荐新上线的视频。
再例如,根据视频的文本形式的字幕建立视频的内容索引数据库,根据用户输入的关键字搜索内容与关键字匹配的视频,克服相关技术仅能够基于视频的类别以及名称进行搜索的缺陷。
又例如,视频的边看边分享功能,用户在观看视频时通过一键识别功能,对当前视频播放界面的字幕进行提取并识别为文本形式,并自动填充到分享感言的对话框中,提升分享操作的流畅度和自动化程度。
综上,本发明实施例具有以下有益效果:
从视频帧中提取对应字幕的连通域,由于是从加载有字幕的视频帧层面进行包括字幕的潜在区域(连通区域)的提取,因此对于任意形式的字幕都能够提取字幕区域(与连通域对应的图像),不受视频使用何种形式的字幕的影响;
从至少两个通道利用颜色增强的方式对从字幕区域提取的对比度极值区域进行调整,有效滤除字幕区域的图像中的噪点和背景,降低后续从字幕区域识别字符的难度,有利于提升后续字符识别的效率和精度;
通过提取视频字幕,方便后面对字幕进行识别,识别的字幕信息会用于做视频个性化推荐,即是通过分析视频字幕来了解视频属性,根据视频内容属性进行推荐;另外提取出来的视频字幕可以用于基于视频内容的搜索,方便用户寻找自己想要的视频。
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步 骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、随机存取存储器(RAM,Random Access Memory)、只读存储器(ROM,Read-Only Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
或者,本发明上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本发明实施例的技术方案本质上或者说对相关技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本发明各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、RAM、ROM、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种字幕提取方法,包括:
    对视频解码得到视频帧,对所述视频帧中的像素进行字幕排布方向的连通操作,得到所述视频帧中的连通域;
    基于所述视频帧中的连通域确定包括相同字幕的视频帧,并基于所述包括相同字幕的视频帧中连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域;
    针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
    对所述至少两个通道的对比度极值区域进行颜色增强处理,形成颜色增强对比度极值区域;
    通过融合至少两个通道的颜色增强对比度极值区域提取出字幕。
  2. 如权利要求1所述的方法,其中,所述对所述视频帧中的像素进行字幕排布方向的连通操作,包括:
    根据所述视频的时长提取不同时间点的视频帧,对所提取的视频帧进行腐蚀和扩张操作;
    对进行腐蚀和扩张操作后的视频帧进行左向和右向的连通操作。
  3. 如权利要求1所述的方法,其中,所述基于所述视频帧中的连通域确定包括相同字幕的视频帧,包括:
    对所提取的相邻的视频帧中连通域的像素作差;
    当所述差值低于差值阈值时,判定所提取的相邻的视频帧包括相同的字幕。
  4. 如权利要求1所述的方法,其中,所述基于所述视频帧中的连通域确定包括相同字幕的视频帧,包括:
    对所提取的相邻的视频帧中连通域提取特征点;
    当所述相邻的视频帧中连通域中提取的特征点匹配时,判定所提取的相邻的视频帧包括相同的字幕。
  5. 如权利要求1所述的方法,其中,所述基于所述包括相同字幕的视频帧中的连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域,包括:
    在所述包括相同字幕的各视频帧中,确定连通域的边缘区域的不同分布位置分别出现的次数,并确定出现次数最多的所述分布位置形成的区域为所述字幕区域。
  6. 如权利要求1所述的方法,其中,所述针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域,包括:
    从以下通道对所述视频帧的字幕区域对应构造由嵌套的节点形成的组件树:
    灰度图;基于感知的光照不变PII的色调通道;PII的饱和度通道;其中,所述组件树的节点与所述字幕区域的字符对应;
    当所述节点的面积变化率相对于邻接节点的面积变化率的小于面积变化率阈值时,确定所述节点属于相应通道的对比度极值区域。
  7. 如权利要求1所述的方法,其中,所述对所述至少两个通道的对比度极值区域进行颜色增强处理,形成颜色增强对比度极值区域,包括:
    确定每个通道的对比度极值区域的主要颜色;
    从每个通道的对比度极值区域中提取出跟所述主要颜色相似程度满足预设条件的像素,基于所提取的像素组成相应通道的颜色增强对比度极值区域。
  8. 如权利要求1所述的方法,其中,所述方法还包括:
    对所述融合的颜色增强对比度极值区域进行文本识别;
    对所识别出的文本响应视频搜索、视频推荐、视频标记分类和字幕分享至少之一的操作。
  9. 一种字幕提取装置,包括:
    解码单元,配置为对视频解码得到视频帧;
    连通单元,配置为对所述视频帧中的像素进行字幕排布方向的连通操作,得到所述视频帧中的连通域;
    定位单元,配置为基于所述视频帧中的连通域确定包括相同字幕的视频帧,并基于所述包括相同字幕的视频帧中连通域的分布位置,确定所述包括相同字幕的视频帧中的字幕区域;
    提取单元,配置为针对所述字幕区域的至少两个通道对应构造组件树,利用所构造的组件树提取对应每个通道的对比度极值区域;
    增强单元,配置为对所述融合的至少两个通道的对比度极值区域进行颜色增强处理,形成滤除冗余像素和噪声的颜色增强对比度极值区域;
    融合单元,配置为通过融合至少两个通道的对比度极值区域提取出字幕。
  10. 如权利要求9所述的装置,其中,
    所述连通单元,还配置为根据所述视频的时长提取不同时间点的视频帧,对所提取的视频帧进行腐蚀和扩张操作;对进行腐蚀和扩张操作后的视频帧进行左向和右向的连通操作。
  11. 如权利要求9所述的装置,其中,
    所述定位单元,还配置为对所提取的相邻的视频帧中连通域的像素作差;当所述差值低于差值阈值,判定所提取的相邻的视频帧包括相同的字幕。
  12. 如权利要求9所述的装置,其中,
    所述定位单元,还配置为对所提取的相邻的视频帧中连通域提取特征 点;当所述相邻的视频帧中连通域中提取的特征点匹配时,判定所提取的相邻的视频帧包括相同的字幕。
  13. 如权利要求9所述的装置,其中,
    所述定位单元,还配置为在所述包括相同字幕的各视频帧中,确定连通域的边缘区域的不同分布位置分别出现的次数,并确定出现次数最多的所述分布位置形成的区域为所述字幕区域。
  14. 如权利要求9所述的装置,其中,
    所述增强单元,还配置为从以下通道对所述视频帧的字幕区域对应构造由嵌套的节点形成的组件树:
    灰度图;基于感知的光照不变PII的色调通道;PII的饱和度通道;其中,所述组件树的节点与所述字幕区域的字符对应;
    当所述节点的面积变化率相对于邻接节点的面积变化率的小于面积变化率阈值时,确定所述节点属于相应通道的对比度极值区域。
  15. 如权利要求9所述的装置,其中,
    所述增强单元,还配置为确定每个通道的对比度极值区域的主要颜色;从每个通道的对比度极值区域中提取出跟所述主要颜色相似程度满足预设条件的像素,基于所提取的像素组成相应通道的颜色增强对比度极值区域。
  16. 如权利要求9所述的装置,其中,所述装置还包括:
    识别单元,配置为对所述融合的颜色增强对比度极值区域进行文本识别;
    响应单元,配置为对所识别出的文本响应视频搜索、视频推荐、视频标记分类和字幕分享至少之一的操作。
  17. 一种字幕提取装置,所述装置包括:
    处理器和存储介质;所述存储介质中存储有可执行指令,所述可执行指令用于引起所述处理器执行时,实现权利要求1至8任一项所述的字幕 提取方法。
  18. 一种存储介质,存储有可执行程序,所述可执行程序被处理器执行时,实现权利要求1至8任一项所述的字幕提取方法。
PCT/CN2017/096509 2016-08-08 2017-08-08 字幕提取方法及装置、存储介质 WO2018028583A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/201,386 US11367282B2 (en) 2016-08-08 2018-11-27 Subtitle extraction method and device, storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610643390.3 2016-08-08
CN201610643390.3A CN106254933B (zh) 2016-08-08 2016-08-08 字幕提取方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/201,386 Continuation US11367282B2 (en) 2016-08-08 2018-11-27 Subtitle extraction method and device, storage medium

Publications (1)

Publication Number Publication Date
WO2018028583A1 true WO2018028583A1 (zh) 2018-02-15

Family

ID=58078200

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/096509 WO2018028583A1 (zh) 2016-08-08 2017-08-08 字幕提取方法及装置、存储介质

Country Status (3)

Country Link
US (1) US11367282B2 (zh)
CN (1) CN106254933B (zh)
WO (1) WO2018028583A1 (zh)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619257A (zh) * 2018-06-20 2019-12-27 北京搜狗科技发展有限公司 一种文字区域确定方法和装置
WO2020139923A1 (en) * 2018-12-27 2020-07-02 Facebook, Inc. Systems and methods for automated video classification
CN112270317A (zh) * 2020-10-16 2021-01-26 西安工程大学 一种基于深度学习和帧差法的传统数字水表读数识别方法
US10922548B1 (en) 2018-12-27 2021-02-16 Facebook, Inc. Systems and methods for automated video classification
US10956746B1 (en) 2018-12-27 2021-03-23 Facebook, Inc. Systems and methods for automated video classification
CN112749696A (zh) * 2020-09-01 2021-05-04 腾讯科技(深圳)有限公司 一种文本检测方法及装置
CN113052169A (zh) * 2021-03-15 2021-06-29 北京小米移动软件有限公司 视频字幕识别方法、装置、介质及电子设备
US11138440B1 (en) 2018-12-27 2021-10-05 Facebook, Inc. Systems and methods for automated video classification
CN114092925A (zh) * 2020-08-05 2022-02-25 武汉Tcl集团工业研究院有限公司 一种视频字幕检测方法、装置、终端设备及存储介质
CN114666649A (zh) * 2022-03-31 2022-06-24 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质
CN115550714A (zh) * 2021-06-30 2022-12-30 花瓣云科技有限公司 字幕显示方法及相关设备

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11004139B2 (en) 2014-03-31 2021-05-11 Monticello Enterprises LLC System and method for providing simplified in store purchases and in-app purchases using a use-interface-based payment API
US10511580B2 (en) 2014-03-31 2019-12-17 Monticello Enterprises LLC System and method for providing a social media shopping experience
US11080777B2 (en) 2014-03-31 2021-08-03 Monticello Enterprises LLC System and method for providing a social media shopping experience
US12008629B2 (en) 2014-03-31 2024-06-11 Monticello Enterprises LLC System and method for providing a social media shopping experience
CN106254933B (zh) * 2016-08-08 2020-02-18 腾讯科技(深圳)有限公司 字幕提取方法及装置
CN109309844B (zh) * 2017-07-26 2022-02-22 腾讯科技(深圳)有限公司 视频台词处理方法、视频客户端及服务器
CN107424137B (zh) * 2017-08-01 2020-06-19 深信服科技股份有限公司 一种文本增强方法及装置、计算机装置、可读存储介质
CN107454479A (zh) * 2017-08-22 2017-12-08 无锡天脉聚源传媒科技有限公司 一种多媒体数据的处理方法及装置
CN107862315B (zh) * 2017-11-02 2019-09-17 腾讯科技(深圳)有限公司 字幕提取方法、视频搜索方法、字幕分享方法及装置
US11120300B1 (en) 2018-06-07 2021-09-14 Objectvideo Labs, Llc Event detector training
CN110620946B (zh) * 2018-06-20 2022-03-18 阿里巴巴(中国)有限公司 字幕显示方法及装置
CN110942420B (zh) * 2018-09-21 2023-09-15 阿里巴巴(中国)有限公司 一种图像字幕的消除方法及装置
CN109214999B (zh) * 2018-09-21 2021-01-22 阿里巴巴(中国)有限公司 一种视频字幕的消除方法及装置
CN109391842B (zh) * 2018-11-16 2021-01-26 维沃移动通信有限公司 一种配音方法、移动终端
CN110147467A (zh) 2019-04-11 2019-08-20 北京达佳互联信息技术有限公司 一种文本描述的生成方法、装置、移动终端及存储介质
CN111080554B (zh) * 2019-12-20 2023-08-04 成都极米科技股份有限公司 一种投影内容中字幕区域增强方法、装置及可读存储介质
CN111193965B (zh) * 2020-01-15 2022-09-06 北京奇艺世纪科技有限公司 一种视频播放方法、视频处理方法及装置
CN111294646B (zh) * 2020-02-17 2022-08-30 腾讯科技(深圳)有限公司 一种视频处理方法、装置、设备及存储介质
CN112749297B (zh) * 2020-03-03 2023-07-21 腾讯科技(深圳)有限公司 视频推荐方法、装置、计算机设备和计算机可读存储介质
CN111508003B (zh) * 2020-04-20 2020-12-11 北京理工大学 一种红外小目标检测跟踪及识别方法
CN111539427B (zh) * 2020-04-29 2023-07-21 深圳市优优品牌传播有限公司 一种视频字幕的提取方法及***
US11335108B2 (en) * 2020-08-10 2022-05-17 Marlabs Incorporated System and method to recognise characters from an image
CN112381091A (zh) * 2020-11-23 2021-02-19 北京达佳互联信息技术有限公司 视频内容识别方法、装置、电子设备及存储介质
CN112925905B (zh) * 2021-01-28 2024-02-27 北京达佳互联信息技术有限公司 提取视频字幕的方法、装置、电子设备和存储介质
CN112954455B (zh) * 2021-02-22 2023-01-20 北京奇艺世纪科技有限公司 一种字幕跟踪方法、装置及电子设备
CN113869310A (zh) * 2021-09-27 2021-12-31 北京达佳互联信息技术有限公司 对话框检测方法、装置、电子设备及存储介质
CN113709563B (zh) * 2021-10-27 2022-03-08 北京金山云网络技术有限公司 视频封面选取方法、装置、存储介质以及电子设备
CN114615520B (zh) * 2022-03-08 2024-01-02 北京达佳互联信息技术有限公司 字幕定位方法、装置、计算机设备及介质
CN114697761B (zh) * 2022-04-07 2024-02-13 脸萌有限公司 一种处理方法、装置、终端设备及介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060045346A1 (en) * 2004-08-26 2006-03-02 Hui Zhou Method and apparatus for locating and extracting captions in a digital image
CN101510260A (zh) * 2008-02-14 2009-08-19 富士通株式会社 字幕存在时间确定装置和方法
CN102332096A (zh) * 2011-10-17 2012-01-25 中国科学院自动化研究所 一种视频字幕文本提取和识别的方法
CN102750540A (zh) * 2012-06-12 2012-10-24 大连理工大学 基于形态滤波增强的最稳定极值区视频文本检测方法
CN102915438A (zh) * 2012-08-21 2013-02-06 北京捷成世纪科技股份有限公司 一种视频字幕的提取方法及装置
CN106254933A (zh) * 2016-08-08 2016-12-21 腾讯科技(深圳)有限公司 字幕提取方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5703655A (en) * 1995-03-24 1997-12-30 U S West Technologies, Inc. Video programming retrieval using extracted closed caption data which has been partitioned and stored to facilitate a search and retrieval process
US6731788B1 (en) * 1999-01-28 2004-05-04 Koninklijke Philips Electronics N.V. Symbol Classification with shape features applied to neural network
US20020067428A1 (en) * 2000-12-01 2002-06-06 Thomsen Paul M. System and method for selecting symbols on a television display
GB0712880D0 (en) * 2007-07-03 2007-08-08 Skype Ltd Instant messaging communication system and method
JP4620163B2 (ja) * 2009-06-30 2011-01-26 株式会社東芝 静止字幕検出装置、静止字幕を含む画像を表示する映像機器、および静止字幕を含んだ画像の処理方法
US20120206567A1 (en) * 2010-09-13 2012-08-16 Trident Microsystems (Far East) Ltd. Subtitle detection system and method to television video

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060045346A1 (en) * 2004-08-26 2006-03-02 Hui Zhou Method and apparatus for locating and extracting captions in a digital image
CN101510260A (zh) * 2008-02-14 2009-08-19 富士通株式会社 字幕存在时间确定装置和方法
CN102332096A (zh) * 2011-10-17 2012-01-25 中国科学院自动化研究所 一种视频字幕文本提取和识别的方法
CN102750540A (zh) * 2012-06-12 2012-10-24 大连理工大学 基于形态滤波增强的最稳定极值区视频文本检测方法
CN102915438A (zh) * 2012-08-21 2013-02-06 北京捷成世纪科技股份有限公司 一种视频字幕的提取方法及装置
CN106254933A (zh) * 2016-08-08 2016-12-21 腾讯科技(深圳)有限公司 字幕提取方法及装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
SUN, LEI: "Robust Text Detection in Natural Scene Images", ELECTRONIC TECHNOLOGY & INFORMATION SCIENCE, 15 October 2015 (2015-10-15), ISSN: 1674-022X *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110619257A (zh) * 2018-06-20 2019-12-27 北京搜狗科技发展有限公司 一种文字区域确定方法和装置
CN110619257B (zh) * 2018-06-20 2023-11-07 北京搜狗科技发展有限公司 一种文字区域确定方法和装置
US11017237B1 (en) 2018-12-27 2021-05-25 Facebook, Inc. Systems and methods for automated video classification
US10922548B1 (en) 2018-12-27 2021-02-16 Facebook, Inc. Systems and methods for automated video classification
US10956746B1 (en) 2018-12-27 2021-03-23 Facebook, Inc. Systems and methods for automated video classification
US11138440B1 (en) 2018-12-27 2021-10-05 Facebook, Inc. Systems and methods for automated video classification
WO2020139923A1 (en) * 2018-12-27 2020-07-02 Facebook, Inc. Systems and methods for automated video classification
CN114092925A (zh) * 2020-08-05 2022-02-25 武汉Tcl集团工业研究院有限公司 一种视频字幕检测方法、装置、终端设备及存储介质
CN112749696A (zh) * 2020-09-01 2021-05-04 腾讯科技(深圳)有限公司 一种文本检测方法及装置
CN112270317A (zh) * 2020-10-16 2021-01-26 西安工程大学 一种基于深度学习和帧差法的传统数字水表读数识别方法
CN112270317B (zh) * 2020-10-16 2024-06-07 西安工程大学 一种基于深度学习和帧差法的传统数字水表读数识别方法
CN113052169A (zh) * 2021-03-15 2021-06-29 北京小米移动软件有限公司 视频字幕识别方法、装置、介质及电子设备
CN115550714A (zh) * 2021-06-30 2022-12-30 花瓣云科技有限公司 字幕显示方法及相关设备
CN114666649A (zh) * 2022-03-31 2022-06-24 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质
CN114666649B (zh) * 2022-03-31 2024-03-01 北京奇艺世纪科技有限公司 字幕被裁视频的识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
US20190114486A1 (en) 2019-04-18
US11367282B2 (en) 2022-06-21
CN106254933A (zh) 2016-12-21
CN106254933B (zh) 2020-02-18

Similar Documents

Publication Publication Date Title
WO2018028583A1 (zh) 字幕提取方法及装置、存储介质
US8744195B2 (en) Object detection metadata
Tan et al. Mirror detection with the visual chirality cue
US10573039B2 (en) Techniques for incorporating a text-containing image into a digital image
US20200322684A1 (en) Video recommendation method and apparatus
WO2017219900A1 (zh) 一种视频检测方法,服务器及存储介质
KR101469398B1 (ko) 텍스트 기반 3d 증강 현실
TWI253860B (en) Method for generating a slide show of an image
JP5775225B2 (ja) マルチレイヤ連結成分をヒストグラムと共に用いるテキスト検出
WO2020259510A1 (zh) 信息植入区域的检测方法、装置、电子设备及存储介质
CN108875744B (zh) 基于矩形框坐标变换的多方向文本行检测方法
CN103699532A (zh) 图像颜色检索方法和***
JP2003030672A (ja) 帳票認識装置、方法、プログラムおよび記憶媒体
CN111460355A (zh) 一种页面解析方法和装置
JP6203188B2 (ja) 類似画像検索装置
US9866894B2 (en) Method for annotating an object in a multimedia asset
CN112000024A (zh) 用于控制家电设备的方法及装置、设备
US20130182943A1 (en) Systems and methods for depth map generation
CN113312949A (zh) 视频数据处理方法、视频数据处理装置和电子设备
JP2011049866A (ja) 画像表示装置
KR20230162010A (ko) 이미지들 및 비디오로부터 반사 특징들을 제거하기 위한 실시간 기계 학습-기반 프라이버시 필터
CN115035530A (zh) 图像处理方法、图像文本获得方法、装置及电子设备
Ram et al. Video Analysis and Repackaging for Distance Education
KR101911613B1 (ko) 뉴스 인터뷰 영상의 오버레이 텍스트 기반 인물 인덱싱 방법 및 장치
Novozámský et al. Extended IMD2020: a large‐scale annotated dataset tailored for detecting manipulated images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17838713

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 17838713

Country of ref document: EP

Kind code of ref document: A1