CN107145888A - Video caption real time translating method - Google Patents
Video caption real time translating method Download PDFInfo
- Publication number
- CN107145888A CN107145888A CN201710345936.1A CN201710345936A CN107145888A CN 107145888 A CN107145888 A CN 107145888A CN 201710345936 A CN201710345936 A CN 201710345936A CN 107145888 A CN107145888 A CN 107145888A
- Authority
- CN
- China
- Prior art keywords
- mrow
- text
- msub
- mser
- text filed
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/60—Type of objects
- G06V20/62—Text, e.g. of license plates, overlay texts or captions on TV images
- G06V20/635—Overlay text, e.g. embedded captions in a TV program
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/58—Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/158—Segmentation of character regions using character size, text spacings or pitch estimation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Image Analysis (AREA)
Abstract
The present invention provides a kind of video caption real time translating method, including:Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;Based on MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively;The local contrast between each MSER region and its background area is calculated, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out;Determine the border key point in each MSER region;Using border key point as category filter feature, to filtering out each rear remaining MSER region by the SVM progress category filters trained, obtain text filed;According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, according to per two neighboring the distance between text filed, text filed classifying to one text row each on one text row;Based on each text filed progress video caption real time translation after classification.
Description
Technical field
The invention belongs to image processing field, and in particular to a kind of video caption real time translating method.
Background technology
In recent years, in natural scene image text detection and identification has become computer vision, pattern-recognition very
Popular research theme into document analysis field.Researcher proposes substantial amounts of extracts text envelope from natural scene image
The new idea and method of breath.However, at present when being translated to video caption, due to from image extraction text message when
Between complexity it is higher, therefore video caption real time translation can not be realized.
The content of the invention
The present invention provides a kind of video caption real time translating method, to solve at present when being translated to video caption,
The problem of the time complexity that text message is extracted from image is higher can not realize video caption real time translation.
First aspect according to embodiments of the present invention there is provided a kind of video caption real time translating method, including:
Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;
Based on maximum stable extremal region MSER algorithms, the MSER areas of original image and multiple single channel images are extracted respectively
Domain;
Local contrast text feature is introduced, the local contrast between each MSER region and its background area is calculated,
And according to each local contrast, it is determined whether corresponding MSER regions are filtered out;
Border key point text feature is introduced, the border key point in each MSER region is determined;
Using the border key point as category filter feature, after filtering out each rear remaining MSER region by training
Support vector machines carry out category filter, obtain text filed;
According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text area
Point, according on one text row per two neighboring the distance between text filed, to one text row each it is text filed enter
Row classification;
Based on each sorted text filed progress video caption real time translation.
In a kind of optional implementation, multichannel extraction is being carried out to the original image intercepted from video, obtained
Before multiple single channel images, methods described also includes:The original image is carried out to include sharpening and fuzzy pretreatment.
In another optional implementation, the original image to being intercepted from video carries out multichannel extraction,
Obtaining multiple single channel images includes:R, G, B, H, S, V are carried out respectively to the original image and pretreated original image
The image zooming-out of six passages, so as to obtain multiple single channel images.
In another optional implementation, the part calculated between each MSER region and its background area
Contrast, and according to each local contrast, it is determined whether by corresponding MSER regions filter out including:
Local contrast lc between each MSER region and its background is calculated according to below equation:
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、
BiThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions
Individual pixel, j represents j-th of pixel of correspondence background area;
And for each MSER regions, should if the local contrast in the MSER regions is less than the first predetermined threshold value
MSER regions are filtered out.
In another optional implementation, the border key point for determining each MSER region includes:
For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pictures
The gray value of vegetarian refreshments is set to 0;
Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, its neighbor pixel
In the gray value of at least one be 0, it is determined that the pixel be profile point;
After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile
Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions.
In another optional implementation, the ratio of width to height, area girth also to filter out each rear remaining MSER region
Than, convex closure area ratio, stroke width area be used for category filter feature, to filter out each rear remaining MSER region pass through training
SVM afterwards carries out category filter.
In another optional implementation, during SVM is trained, the quantity ratio control of positive sample and negative sample
1:3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic numerals, negative sample is to be carried respectively described
Behind the MSER regions for taking out original image and multiple single channel images, manual identified mark is carried out to the MSER regions extracted
Non-textual region.
In another optional implementation, it is described according on vertical direction per it is two neighboring it is text filed between away from
From each text filed progress line of text, which is distinguished, to be included:
Calculated according to below equation on vertical direction per two neighboring the distance between text filed dv:
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent on vertical direction
The text filed top Y-axis coordinate in adjacent downside, h2Represent the text filed height in adjacent downside on vertical direction;
For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvIt is more than
Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned
For different line of text.
In another optional implementation, it is described according on one text row per it is two neighboring it is text filed between
Distance, each text filed progress classification to one text row includes:
Every two neighboring the distance between text filed d of one text row is calculated according to below equationh:
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent words
Female range difference in the X-axis direction;
For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed dhGreatly
In the 3rd predetermined threshold value, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as
Inhomogeneity.
In another optional implementation, video pictures are intercepted by frame when intercepting original image from video, and
It regard below the video image of interception 2/3rds region as the original image.
The beneficial effects of the invention are as follows:
1st, the present invention is firstly introduced into multiple single channel images, effectively utilizes original graph before text is identified
The colouring information of picture, provides more abundant basic data to text filed extraction, local contrast is then introduced, to basis
The MSER regions extracted in original image and multiple single channel images carry out threshold filtering, can improve text area extraction
The degree of accuracy, and the time complexity of local contrast filtering is linear session, and filtration time is shorter, can be real for video caption
When translation provide basis, introduce border key point as svm classifier screen feature, even if when image is rotated and is scaled
Non-textual region in MSER regions can be excluded to disturb, so as to improve sensitivity of the text area extraction to image rotation and scaling
Degree, after the threshold filtering based on local contrast has been performed to MSER regions, the MSER region remaining to filtering out passes through SVM points
Class device is screened, and can improve the degree of accuracy of text area extraction, and what is obtained after terminating for training screening is text filed, of the invention
Classified this two tiered text sorting algorithm using the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, significantly
Time complexity is reduced, the speed of word identification is improved, to realize that video caption real time translation provides the foundation, thus led to
Crossing the present invention can realize that video caption is accurately translated in real time;
2nd, the present invention can enable the original image after sharpening strengthen text by being sharpened pretreatment to original image
One's respective area and the contrast of its ambient background, are more beneficial for text detection, the present invention is by carrying out fuzzy pre- place to original image
Reason, can make be in complex background under it is text filed more highlight, so as to be more beneficial for text detection;
3rd, the present invention is by controlling positive sample and the quantity of negative sample ratio 1 when training SVM:3, can be with optimal screening
Effect, so as to further improve the degree of accuracy of text filed acquisition;
4th, then the present invention intercepts video by intercepting video image by frame first when intercepting original image from video
The subregion of image can improve accuracy of identification as original image, reduce detection time.
Brief description of the drawings
Fig. 1 is one embodiment flow chart of video caption real time translating method of the present invention;
Fig. 2 is a schematic diagram of Laplace's operation template of the present invention;
Fig. 3 is border key point schematic diagram;
Fig. 4 is that line of text constrained parameters illustrate schematic diagram;
Fig. 5 is that pitch compares statistical chart with interval.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real
Applying the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present invention
Case is described in further detail.
In the description of the invention, unless otherwise prescribed with restriction, it is necessary to which explanation, term " connection " should do broad sense reason
Solution, for example, it may be mechanically connecting or electrical connection or the connection of two element internals, can be joined directly together, also may be used
To be indirectly connected to by intermediary, for the ordinary skill in the art, it can understand above-mentioned as the case may be
The concrete meaning of term.
It is one embodiment flow chart of video caption real time translating method of the present invention referring to Fig. 1.The video caption is real-time
Interpretation method can include:
Step S101, the original image progress multichannel extraction to being intercepted from video, obtain multiple single channel images.
In the present embodiment, video resource can be divided into two kinds:A kind of is the local video that can be played offline, and one kind is to need connection
The Online Video that network propagation is put.For local video, corresponding software can be provided a user, can include turning over offline in the software
Database is translated, after user, which sets up local video and software, to be connected, software can be according to method in this patent in local video
Captions carry out text identification, after text identification is completed, the text that software can be identified using offline translation database to this
Automatic translation is carried out, and translation result return is transferred to local video being shown;For Online Video, it can be carried to user
For corresponding software, Web server can also be built, Web online services are provided a user, when user is by Online Video and Web
After server is established the link, Web server can carry out text identification according to method in this patent to the captions in Online Video,
Complete after recognizing herein, Web server can be translated to the text identified, and translation result return is transferred to and regarded online
Frequency is shown.
In order to realize the real time translation function of video, video figure can be intercepted by frame when intercepting original image from video
Picture, and in order to improve accuracy of identification, detection time is reduced, region interception can be carried out to the video image being truncated to, for example
The region of interception image bottom 2/3rds is used as original image.
After original image is intercepted from video, original image can be carried out first to include sharpening and fuzzy pre- place
Reason.Processing can be sharpened according to formula (1) when being sharpened pretreatment to original image:
G (x, y)=f (x, y)+c [▽2f(x,y)] (1)
Wherein, g (x, y) and f (x, y) represent to sharpen pretreated original image and original image, c value root respectively
According to used in sharpening depending on template, if using being masterplate shown in Fig. 2 (a) or Fig. 2 (b), c=-1;If used
Two kinds of masterplates shown in Fig. 2 (c), then c=1.Original image after sharpening can strengthen text filed and its ambient background
Contrast, is more beneficial for text detection.
Gaussian filtering can be carried out to original image according to formula (2) when carrying out fuzzy preceding operation to original image:
Wherein, f (x) represents the original image after fuzzy preceding operation, and μ represents to defer to the equal of the stochastic variable of normal distribution
Value, σ2Represent stochastic variable x variance.The present invention can be made in the complicated back of the body by carrying out fuzzy preceding operation to original image
Text filed under scape more highlights, so as to be more beneficial for text detection.
After original image completes pretreatment, R can be carried out to original image and pretreated original image first
(red), G (green), B (indigo plant), H (tone), S (saturation degree), the image zooming-out of V (brightness) six passages, so as to obtain multiple single-passes
Road image.The present invention effectively can be used colouring information by carrying out multichannel image extraction, to text filed
Extract and more abundant basic data is provided, so that text filed recognizing that the captions translated are more accurate based on these.
Step S102, the MSER regions based on MSER algorithms, respectively extraction original image and multiple single channel images.
In the present embodiment, in order to accelerate extraction rate, the present invention has been done to divide into the parameter being related in MSER algorithms
Put:The step-length of given threshold is 5, and minimum MSER areas are 80, and maximum MSER areas are 14400.Because MSER algorithms are
Image extraction method well known in the art, thus the specific extraction process no longer to MSER algorithms is repeated herein.
Step S103, introducing local contrast text feature, are calculated between each MSER region and its background area
Local contrast, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out.
In the present embodiment, the MSER regions extracted in step S102 not all to be text filed, through applicants have found that,
Text it is to be identified go out, it is necessary to have certain contrast with its background, and text filed with its background area contrast, with
And non-textual region and its background area contrast and differ, and the former contrast is higher than the latter's contrast.Based on the spy
Point, non-textual region is filtered out invention introduces the feature of local contrast.It is possible, firstly, to be calculated using below equation (3)
The local contrast lc gone out between each MSER region and its background area:
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、
BiThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions
Individual pixel, j represents j-th of pixel of correspondence background area.
It is then possible to be determined whether to filter in corresponding MSER regions according to the local contrast size in each MSER region
Remove, wherein for each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER areas
Domain is filtered out, and is not otherwise filtered out the MSER regions.Through applicants have found that, the local contrast lc in non-textual region is generally small
In 0.35, i.e. first predetermined threshold value can be 0.35.Although the present invention carries out multichannel image in step S101 to image
Extract, the data basis that can be enriched for the more text filed offers obtained, but at the same time have also been introduced more non-
Text filed, the present invention is filtered by using local contrast to MSER regions, can exclude part in MSER regions non-
Text filed distracter, so as to improve the text area extraction degree of accuracy.In addition, being filtered out in the present invention according to local contrast
The time complexity in non-textual region is shorter the time required to being linear session, therefore filtering, can be video caption real time translation
Basis is provided.
Step S104, introducing border key point text feature, determine the border key point in each MSER region.
In the present embodiment, it is determined that each MSER region border key point when, first, each MSER region is schemed
As binaryzation, wherein for each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255,
The gray value of other pixels is set to 0.Then, each pixel in the MSER regions is gradually traveled through, if the ash of the pixel
Angle value is 255, and the gray value of at least one in its neighbor pixel is 0, it is determined that the pixel is profile point, wherein can be with
Each pixel in the MSER regions is gradually traveled through using order from top to bottom from left to right, if the gray value of the pixel
P (x, y)=255, and its right side neighbor pixel gray value p (x+1, y), the gray value p of left side neighbor pixel (x-1,
Y), the value is in the gray value p (x, y+1) of upside neighbor pixel, the gray value p (x, y-1) of downside neighbor pixel
0, it is determined that the pixel is profile point, wherein x represents the X-axis coordinate of pixel, and y represents the Y-axis coordinate of pixel.
After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile
Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions, as shown in Figure 3.Wherein, the present invention can be with
After all profile points for often obtaining a MSER region, each profile point is compressed using Douglas-Pu Ke algorithms, gone
Except redundant points, the border key point (removing remaining profile point after redundant points) in the MSER regions is obtained;Can also often it obtain
After all profile points in predetermined number MSER regions, each profile point is compressed using Douglas-Pu Ke algorithms, gone
Except redundant points, the border key point in the predetermined number MSER regions is obtained;Or all of all MSER regions can be obtained
After profile point, each profile point is compressed using Douglas-Pu Ke algorithms, redundant points are removed, all MSER areas are obtained
The border key point in domain.Through applicants have found that, the border key point k of English alphabet number generally between 5 to 16,
I.e. default number range is 5 to 16, when the number k of the border key point in MSER regions is less than 5 or during more than 16, in translation
It is non-textual region that the MSER regions can be determined when English.
Step S105, using the border key point as category filter feature, to filter out each rear remaining MSER region lead to
The SVM crossed after training carries out category filter, obtains text filed.
In the present embodiment, after the threshold filtering in completing step S103, the present invention is except using border key point conduct
Category filter feature, also selection filter out the ratio of width to height (w/h), the area girth ratio in each rear remaining MSER regionConvex closure
Area ratio (ac/ a), stroke width area ratio (ws/ a) as category filter feature, so as to obtain text filed, wherein w is represented
The width in MSER regions, h represents the height in MSER regions,The area extraction of square root in MSER regions is represented, p represents MSER regions
Girth, acThe area of convex closure (convex closure is the generic term in image procossing, be will not be repeated here) is represented, a represents MSER areas
The area in domain, wsRepresent the stroke width of image.In order that screening effect is optimal, training parameter can be set as follows:
Kernel function uses RBF RBF, and iterations is chosen 100 times.The present invention is by using the SVM classifier trained to filter
Except each rear remaining MSER region is classified, the text filed acquisition degree of accuracy can be improved.In addition, in order to allow SVM to reach most
Good classifying quality, during SVM is trained, the quantity ratio of positive sample and negative sample is controlled 1:3, wherein positive sample is to turn over
Translate the corresponding letter of object language (such as when special translating purpose language is English, its correspondence letter can include ' a '-' z ' and ' A '-
' Z) and Arabic numerals (such as digital ' 0 '-' 9 '), negative sample is to extract original image and multiple lists respectively in step S102
Behind the MSER regions of channel image, manual identified and the non-textual region marked are carried out to the MSER regions extracted, thus may be used
With optimal screening effect, so as to further improve the degree of accuracy of text filed acquisition.
In the outline pixel point set in a region, a portion point is connected by certain order,
It is exactly border key point that just can at utmost reduce the set comprising minimum pixel in the region, the present invention.Even if due to figure
Impacted as rotating and scaling all without to its border key point, therefore the present invention is used as by introducing border key point
Category filter feature, even if non-textual region distracter in MSER regions can also be excluded when image is rotated and is scaled,
So as to improve susceptibility of the text area extraction to image rotation, change in size etc..
Step S106, according on vertical direction per two neighboring the distance between text filed, to each it is text filed enter
One's own profession of composing a piece of writing is distinguished, according on one text row per two neighboring the distance between text filed, to each of one text row
It is text filed to be classified.
In the present embodiment, with reference to shown in Fig. 4 and Fig. 5, according on vertical direction per it is two neighboring it is text filed between
Distance, during to each text filed progress line of text differentiation, can be calculated on vertical direction per adjacent according to formula (4) first
Two the distance between text filed dv:
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent on vertical direction
The text filed top Y-axis coordinate in adjacent downside, h2The text filed height in adjacent downside on vertical direction is represented, such as Fig. 4 institutes
Show.Then, on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvIt is more than
Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned
For different line of text.Through applicants have found that, as two neighboring the distance between text filed d on vertical directionvIt is more than
When 0.62, this is two neighboring text filed in one text row, therefore second predetermined threshold value can be 0.62.
In addition, according on one text row per it is two neighboring it is text filed between distance, to the list of one text row
When word makes a distinction, can first according to formula (5) calculate one text row it is every it is two neighboring it is text filed between away from
From dh:
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents that this is two neighboring text filed adjacent
Interval between the alphabetical adjacent letters of range difference, i.e., two in the X-axis direction.Then, for the every two neighboring of one text row
It is text filed, if two neighboring the distance between the text filed dhMore than the 3rd predetermined threshold value, then it represents that the two neighboring text
One's respective area adjacent letters belong to same word together, and two neighboring text filed this now is attributed into same class;Otherwise represent that this is adjacent
Two text filed adjacent letters adhere to various words separately, and two neighboring text filed this is attributed into inhomogeneity.Due to obtaining text
During one's respective area the different letters of same word may be divided to it is different text filed, and interval between word with
There is obvious difference in the interval inside word between letter, as shown in figure 5, therefore the present invention to text filed by entering to compose a piece of writing
After one's own profession is distinguished, made a distinction for the word on one text row, word recognition accuracy can be improved, and the present invention is adopted
Classified this two tiered text sorting algorithm with the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, dropped significantly
(time complexity of analogous algorithms is O (n to low time complexity2), time complexity of the invention is O (nlog2N)), improve
The speed of word identification, to realize that video caption real time translation provides the foundation.Through applicants have found that, when two neighboring
The distance between text filed dhDuring more than 2.33, two neighboring text filed adjacent letters belong to same word together, thus this
Three predetermined threshold values can be 2.33.
Step S107, based on it is sorted each it is text filed progress video caption real time translation.
In the present embodiment, obtain it is sorted each it is text filed after, can be entered using the framework Tesseract that increases income
Row text identification, at the same in order to system unified management, it is necessary to Tesseract and OpenCV image procossings Runtime Library carry out it is whole
Close.After text is identified, text can be delivered to the interface that Google translations are provided in the form of alphabetic string, obtain translation
As a result, user is finally shown to, so as to realize video caption real time translation.
As seen from the above-described embodiment, the present invention is firstly introduced into multiple single channel images, had before text is identified
Effect ground provides more abundant basic data, then introducing office using the colouring information of original image to text filed extraction
Portion's contrast text feature, to carrying out threshold value mistake according to the MSER regions extracted in original image and multiple single channel images
Filter, can improve the degree of accuracy of text area extraction, and the time complexity of local contrast filtering is linear session, filtering
Time is shorter, and basis can be provided for video caption real time translation, introduces border key point as svm classifier and screens feature, i.e.,
Just non-textual region in MSER regions can be also excluded when image is rotated and is scaled to disturb, it is text filed so as to improve
The susceptibility to image rotation and scaling is extracted, it is right after the threshold filtering based on local contrast has been performed to MSER regions
Filter out remaining MSER regions to screen by the svm classifier trained, the degree of accuracy of text area extraction can be improved, for instruction
It is text filed that white silk screening is obtained after terminating, and the present invention is using the classification of vertical direction line of text and horizontal direction one text style of writing
One's respective area is classified this two tiered text sorting algorithm, is greatly reduced time complexity, is improved the speed of word identification, to realize
Video caption real time translation provides the foundation, and can realize that video caption is accurately translated in real time from there through the present invention.
Those skilled in the art will readily occur to its of the present invention after considering specification and putting into practice invention disclosed herein
Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or
Person's adaptations follow the general principle of the present invention and including undocumented common knowledge in the art of the invention
Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and
And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.
Claims (10)
1. a kind of video caption real time translating method, it is characterised in that including:
Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;
Based on maximum stable extremal region MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively;
Local contrast text feature is introduced, the local contrast between each MSER region and its background area, and root is calculated
According to each local contrast, it is determined whether corresponding MSER regions are filtered out;
Border key point text feature is introduced, the border key point in each MSER region is determined;
Using the border key point as category filter feature, the support after training is passed through to filtering out each rear remaining MSER region
Vector machine SVM carries out category filter, obtains text filed;
According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation,
According to per two neighboring the distance between text filed, text filed being divided to one text row each on one text row
Class;
Based on each sorted text filed progress video caption real time translation.
2. video caption real time translating method according to claim 1, it is characterised in that in the original to being intercepted from video
Beginning image carries out multichannel extraction, obtains before multiple single channel images, methods described also includes:The original image is carried out
Including the pretreatment for sharpening and obscuring.
3. video caption real time translating method according to claim 2, it is characterised in that described to being intercepted from video
Original image carries out multichannel extraction, and obtaining multiple single channel images includes:To the original image and pretreated original
Image carries out the image zooming-out of six passages of R, G, B, H, S, V respectively, so as to obtain multiple single channel images.
4. video caption real time translating method according to claim 1, it is characterised in that described to calculate each MSER area
Local contrast between domain and its background area, and according to each local contrast, it is determined whether by corresponding MSER regions
Filter out including:
Local contrast lc between each MSER region and its background is calculated according to below equation:
<mrow>
<mi>l</mi>
<mi>c</mi>
<mo>=</mo>
<mfrac>
<mrow>
<mo>|</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>R</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>G</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mo>-</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>k</mi>
</munderover>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>R</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>G</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mo>|</mo>
</mrow>
<mrow>
<mi>max</mi>
<mrow>
<mo>(</mo>
<mrow>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</munderover>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>R</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>G</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
<mo>,</mo>
<munderover>
<mi>&Sigma;</mi>
<mrow>
<mi>j</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>k</mi>
</munderover>
<mrow>
<mo>(</mo>
<mrow>
<msub>
<mi>R</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>G</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<msub>
<mi>B</mi>
<mi>i</mi>
</msub>
</mrow>
<mo>)</mo>
</mrow>
</mrow>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
</mrow>
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、BiPoint
The value of three passage red, green, blues of image where Biao Shi MSER regions not corresponded to, i represents i-th of picture in correspondence MSER regions
Vegetarian refreshments, j represents j-th of pixel of correspondence background area;
For each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER regions
Filter out.
5. video caption real time translating method according to claim 1, it is characterised in that each described MSER of the determination
The border key point in region includes:
For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pixels
Gray value be set to 0;
Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, in its neighbor pixel extremely
The gray value of rare one is 0, it is determined that the pixel is profile point;
After at least all profile points in a MSER region are obtained, each profile is clicked through using Douglas-Pu Ke algorithms
Row compression, removes redundant points, obtains the border key point in correspondence MSER regions.
6. video caption real time translating method according to claim 1, it is characterised in that also with filter out it is rear it is remaining each
The ratio of width to height in MSER regions, area girth ratio, convex closure area ratio, stroke width area are used for category filter feature, to filtering out
Each remaining MSER region carries out category filter by the SVM after training afterwards.
7. video caption real time translating method according to claim 6, it is characterised in that during SVM is trained, just
The quantity ratio of sample and negative sample is controlled 1:3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic
Numeral, negative sample is behind the MSER regions for extracting original image and multiple single channel images respectively, to what is extracted
MSER regions carry out the non-textual region of manual identified mark.
8. video caption real time translating method according to claim 1, it is characterised in that described according to every on vertical direction
Two neighboring the distance between text filed, each text filed progress line of text, which is distinguished, to be included:
Calculated according to below equation on vertical direction per two neighboring the distance between text filed dv:
<mrow>
<msub>
<mi>d</mi>
<mi>v</mi>
</msub>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>b</mi>
<mn>1</mn>
</msub>
<mo>-</mo>
<msub>
<mi>t</mi>
<mn>2</mn>
</msub>
</mrow>
<msub>
<mi>h</mi>
<mn>2</mn>
</msub>
</mfrac>
</mrow>
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent adjacent on vertical direction
The text filed top Y-axis coordinate in downside, h2Represent the text filed height in adjacent downside on vertical direction;
For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvMore than second
Predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is classified as not
Same line of text.
9. video caption real time translating method according to claim 1, it is characterised in that described according on one text row
Per two neighboring the distance between text filed, each text filed progress classification to one text row includes:
Every two neighboring the distance between text filed d of one text row is calculated according to below equationh:
<mrow>
<msub>
<mi>d</mi>
<mi>h</mi>
</msub>
<mo>=</mo>
<mfrac>
<mover>
<mi>w</mi>
<mo>&OverBar;</mo>
</mover>
<mrow>
<mi>&Delta;</mi>
<mi>d</mi>
</mrow>
</mfrac>
</mrow>
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent letters in X
Range difference on direction of principal axis;
For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed dhMore than
Three predetermined threshold values, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as difference
Class.
10. video caption real time translating method according to claim 1, it is characterised in that original being intercepted from video
Video pictures are intercepted by frame during image, and regard below the video image of interception 2/3rds region as the original image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710345936.1A CN107145888A (en) | 2017-05-17 | 2017-05-17 | Video caption real time translating method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710345936.1A CN107145888A (en) | 2017-05-17 | 2017-05-17 | Video caption real time translating method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107145888A true CN107145888A (en) | 2017-09-08 |
Family
ID=59778166
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710345936.1A Pending CN107145888A (en) | 2017-05-17 | 2017-05-17 | Video caption real time translating method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107145888A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108520254A (en) * | 2018-03-01 | 2018-09-11 | 腾讯科技(深圳)有限公司 | A kind of Method for text detection, device and relevant device based on formatted image |
CN109284751A (en) * | 2018-10-31 | 2019-01-29 | 河南科技大学 | The non-textual filtering method of text location based on spectrum analysis and SVM |
CN111797632A (en) * | 2019-04-04 | 2020-10-20 | 北京猎户星空科技有限公司 | Information processing method and device and electronic equipment |
CN113287319A (en) * | 2019-01-09 | 2021-08-20 | 奈飞公司 | Optimizing encoding operations in generating buffer-constrained versions of media subtitles |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054271A (en) * | 2009-11-02 | 2011-05-11 | 富士通株式会社 | Text line detection method and device |
CN102542268A (en) * | 2011-12-29 | 2012-07-04 | 中国科学院自动化研究所 | Method for detecting and positioning text area in video |
CN102750540A (en) * | 2012-06-12 | 2012-10-24 | 大连理工大学 | Morphological filtering enhancement-based maximally stable extremal region (MSER) video text detection method |
CN103310439A (en) * | 2013-05-09 | 2013-09-18 | 浙江大学 | Method for detecting maximally stable extremal region of image based on scale space |
WO2015105755A1 (en) * | 2014-01-08 | 2015-07-16 | Qualcomm Incorporated | Processing text images with shadows |
CN105825216A (en) * | 2016-03-17 | 2016-08-03 | 中国科学院信息工程研究所 | Method of locating text in complex background image |
CN105868758A (en) * | 2015-01-21 | 2016-08-17 | 阿里巴巴集团控股有限公司 | Method and device for detecting text area in image and electronic device |
US9576196B1 (en) * | 2014-08-20 | 2017-02-21 | Amazon Technologies, Inc. | Leveraging image context for improved glyph classification |
-
2017
- 2017-05-17 CN CN201710345936.1A patent/CN107145888A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102054271A (en) * | 2009-11-02 | 2011-05-11 | 富士通株式会社 | Text line detection method and device |
CN102542268A (en) * | 2011-12-29 | 2012-07-04 | 中国科学院自动化研究所 | Method for detecting and positioning text area in video |
CN102750540A (en) * | 2012-06-12 | 2012-10-24 | 大连理工大学 | Morphological filtering enhancement-based maximally stable extremal region (MSER) video text detection method |
CN103310439A (en) * | 2013-05-09 | 2013-09-18 | 浙江大学 | Method for detecting maximally stable extremal region of image based on scale space |
WO2015105755A1 (en) * | 2014-01-08 | 2015-07-16 | Qualcomm Incorporated | Processing text images with shadows |
US9576196B1 (en) * | 2014-08-20 | 2017-02-21 | Amazon Technologies, Inc. | Leveraging image context for improved glyph classification |
CN105868758A (en) * | 2015-01-21 | 2016-08-17 | 阿里巴巴集团控股有限公司 | Method and device for detecting text area in image and electronic device |
CN105825216A (en) * | 2016-03-17 | 2016-08-03 | 中国科学院信息工程研究所 | Method of locating text in complex background image |
Non-Patent Citations (1)
Title |
---|
何东健: "《数字图像处理》", 28 February 2015 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108520254A (en) * | 2018-03-01 | 2018-09-11 | 腾讯科技(深圳)有限公司 | A kind of Method for text detection, device and relevant device based on formatted image |
CN109284751A (en) * | 2018-10-31 | 2019-01-29 | 河南科技大学 | The non-textual filtering method of text location based on spectrum analysis and SVM |
CN113287319A (en) * | 2019-01-09 | 2021-08-20 | 奈飞公司 | Optimizing encoding operations in generating buffer-constrained versions of media subtitles |
CN113287319B (en) * | 2019-01-09 | 2024-05-14 | 奈飞公司 | Method and apparatus for optimizing encoding operations |
CN111797632A (en) * | 2019-04-04 | 2020-10-20 | 北京猎户星空科技有限公司 | Information processing method and device and electronic equipment |
CN111797632B (en) * | 2019-04-04 | 2023-10-27 | 北京猎户星空科技有限公司 | Information processing method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107451607B (en) | A kind of personal identification method of the typical character based on deep learning | |
CN112818862B (en) | Face tampering detection method and system based on multi-source clues and mixed attention | |
CN104408449B (en) | Intelligent mobile terminal scene literal processing method | |
CN100565559C (en) | Image text location method and device based on connected component and support vector machine | |
CN110008832A (en) | Based on deep learning character image automatic division method, information data processing terminal | |
CN107145888A (en) | Video caption real time translating method | |
CN107256558A (en) | The cervical cell image automatic segmentation method and system of a kind of unsupervised formula | |
CN106651872A (en) | Prewitt operator-based pavement crack recognition method and system | |
CN106875546A (en) | A kind of recognition methods of VAT invoice | |
CN104778238B (en) | The analysis method and device of a kind of saliency | |
JPH0728940A (en) | Image segmentation for document processing and classification of image element | |
CN111461122B (en) | Certificate information detection and extraction method | |
CN110298376A (en) | A kind of bank money image classification method based on improvement B-CNN | |
CN108615058A (en) | A kind of method, apparatus of character recognition, equipment and readable storage medium storing program for executing | |
CN107085726A (en) | Oracle bone rubbing individual character localization method based on multi-method denoising and connected component analysis | |
CN114005123A (en) | System and method for digitally reconstructing layout of print form text | |
CN109544564A (en) | A kind of medical image segmentation method | |
CN112069900A (en) | Bill character recognition method and system based on convolutional neural network | |
CN107633229A (en) | Method for detecting human face and device based on convolutional neural networks | |
CN112434699A (en) | Automatic extraction and intelligent scoring system for handwritten Chinese characters or components and strokes | |
CN106611174A (en) | OCR recognition method for unusual fonts | |
CN110956167A (en) | Classification discrimination and strengthened separation method based on positioning characters | |
CN106339984A (en) | Distributed image super-resolution method based on K-means driven convolutional neural network | |
CN113673541B (en) | Image sample generation method for target detection and application | |
CN110929746A (en) | Electronic file title positioning, extracting and classifying method based on deep neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170908 |
|
RJ01 | Rejection of invention patent application after publication |