CN107145888A - Video caption real time translating method - Google Patents

Video caption real time translating method Download PDF

Info

Publication number
CN107145888A
CN107145888A CN201710345936.1A CN201710345936A CN107145888A CN 107145888 A CN107145888 A CN 107145888A CN 201710345936 A CN201710345936 A CN 201710345936A CN 107145888 A CN107145888 A CN 107145888A
Authority
CN
China
Prior art keywords
mrow
text
msub
mser
text filed
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710345936.1A
Other languages
Chinese (zh)
Inventor
代劲
王族
宋娟
张鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chongqing University of Post and Telecommunications
Original Assignee
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chongqing University of Post and Telecommunications filed Critical Chongqing University of Post and Telecommunications
Priority to CN201710345936.1A priority Critical patent/CN107145888A/en
Publication of CN107145888A publication Critical patent/CN107145888A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Image Analysis (AREA)

Abstract

The present invention provides a kind of video caption real time translating method, including:Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;Based on MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively;The local contrast between each MSER region and its background area is calculated, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out;Determine the border key point in each MSER region;Using border key point as category filter feature, to filtering out each rear remaining MSER region by the SVM progress category filters trained, obtain text filed;According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, according to per two neighboring the distance between text filed, text filed classifying to one text row each on one text row;Based on each text filed progress video caption real time translation after classification.

Description

Video caption real time translating method
Technical field
The invention belongs to image processing field, and in particular to a kind of video caption real time translating method.
Background technology
In recent years, in natural scene image text detection and identification has become computer vision, pattern-recognition very Popular research theme into document analysis field.Researcher proposes substantial amounts of extracts text envelope from natural scene image The new idea and method of breath.However, at present when being translated to video caption, due to from image extraction text message when Between complexity it is higher, therefore video caption real time translation can not be realized.
The content of the invention
The present invention provides a kind of video caption real time translating method, to solve at present when being translated to video caption, The problem of the time complexity that text message is extracted from image is higher can not realize video caption real time translation.
First aspect according to embodiments of the present invention there is provided a kind of video caption real time translating method, including:
Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;
Based on maximum stable extremal region MSER algorithms, the MSER areas of original image and multiple single channel images are extracted respectively Domain;
Local contrast text feature is introduced, the local contrast between each MSER region and its background area is calculated, And according to each local contrast, it is determined whether corresponding MSER regions are filtered out;
Border key point text feature is introduced, the border key point in each MSER region is determined;
Using the border key point as category filter feature, after filtering out each rear remaining MSER region by training Support vector machines carry out category filter, obtain text filed;
According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text area Point, according on one text row per two neighboring the distance between text filed, to one text row each it is text filed enter Row classification;
Based on each sorted text filed progress video caption real time translation.
In a kind of optional implementation, multichannel extraction is being carried out to the original image intercepted from video, obtained Before multiple single channel images, methods described also includes:The original image is carried out to include sharpening and fuzzy pretreatment.
In another optional implementation, the original image to being intercepted from video carries out multichannel extraction, Obtaining multiple single channel images includes:R, G, B, H, S, V are carried out respectively to the original image and pretreated original image The image zooming-out of six passages, so as to obtain multiple single channel images.
In another optional implementation, the part calculated between each MSER region and its background area Contrast, and according to each local contrast, it is determined whether by corresponding MSER regions filter out including:
Local contrast lc between each MSER region and its background is calculated according to below equation:
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、 BiThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions Individual pixel, j represents j-th of pixel of correspondence background area;
And for each MSER regions, should if the local contrast in the MSER regions is less than the first predetermined threshold value MSER regions are filtered out.
In another optional implementation, the border key point for determining each MSER region includes:
For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pictures The gray value of vegetarian refreshments is set to 0;
Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, its neighbor pixel In the gray value of at least one be 0, it is determined that the pixel be profile point;
After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions.
In another optional implementation, the ratio of width to height, area girth also to filter out each rear remaining MSER region Than, convex closure area ratio, stroke width area be used for category filter feature, to filter out each rear remaining MSER region pass through training SVM afterwards carries out category filter.
In another optional implementation, during SVM is trained, the quantity ratio control of positive sample and negative sample 1:3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic numerals, negative sample is to be carried respectively described Behind the MSER regions for taking out original image and multiple single channel images, manual identified mark is carried out to the MSER regions extracted Non-textual region.
In another optional implementation, it is described according on vertical direction per it is two neighboring it is text filed between away from From each text filed progress line of text, which is distinguished, to be included:
Calculated according to below equation on vertical direction per two neighboring the distance between text filed dv
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent on vertical direction The text filed top Y-axis coordinate in adjacent downside, h2Represent the text filed height in adjacent downside on vertical direction;
For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvIt is more than Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned For different line of text.
In another optional implementation, it is described according on one text row per it is two neighboring it is text filed between Distance, each text filed progress classification to one text row includes:
Every two neighboring the distance between text filed d of one text row is calculated according to below equationh
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent words Female range difference in the X-axis direction;
For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed dhGreatly In the 3rd predetermined threshold value, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as Inhomogeneity.
In another optional implementation, video pictures are intercepted by frame when intercepting original image from video, and It regard below the video image of interception 2/3rds region as the original image.
The beneficial effects of the invention are as follows:
1st, the present invention is firstly introduced into multiple single channel images, effectively utilizes original graph before text is identified The colouring information of picture, provides more abundant basic data to text filed extraction, local contrast is then introduced, to basis The MSER regions extracted in original image and multiple single channel images carry out threshold filtering, can improve text area extraction The degree of accuracy, and the time complexity of local contrast filtering is linear session, and filtration time is shorter, can be real for video caption When translation provide basis, introduce border key point as svm classifier screen feature, even if when image is rotated and is scaled Non-textual region in MSER regions can be excluded to disturb, so as to improve sensitivity of the text area extraction to image rotation and scaling Degree, after the threshold filtering based on local contrast has been performed to MSER regions, the MSER region remaining to filtering out passes through SVM points Class device is screened, and can improve the degree of accuracy of text area extraction, and what is obtained after terminating for training screening is text filed, of the invention Classified this two tiered text sorting algorithm using the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, significantly Time complexity is reduced, the speed of word identification is improved, to realize that video caption real time translation provides the foundation, thus led to Crossing the present invention can realize that video caption is accurately translated in real time;
2nd, the present invention can enable the original image after sharpening strengthen text by being sharpened pretreatment to original image One's respective area and the contrast of its ambient background, are more beneficial for text detection, the present invention is by carrying out fuzzy pre- place to original image Reason, can make be in complex background under it is text filed more highlight, so as to be more beneficial for text detection;
3rd, the present invention is by controlling positive sample and the quantity of negative sample ratio 1 when training SVM:3, can be with optimal screening Effect, so as to further improve the degree of accuracy of text filed acquisition;
4th, then the present invention intercepts video by intercepting video image by frame first when intercepting original image from video The subregion of image can improve accuracy of identification as original image, reduce detection time.
Brief description of the drawings
Fig. 1 is one embodiment flow chart of video caption real time translating method of the present invention;
Fig. 2 is a schematic diagram of Laplace's operation template of the present invention;
Fig. 3 is border key point schematic diagram;
Fig. 4 is that line of text constrained parameters illustrate schematic diagram;
Fig. 5 is that pitch compares statistical chart with interval.
Embodiment
In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real Applying the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present invention Case is described in further detail.
In the description of the invention, unless otherwise prescribed with restriction, it is necessary to which explanation, term " connection " should do broad sense reason Solution, for example, it may be mechanically connecting or electrical connection or the connection of two element internals, can be joined directly together, also may be used To be indirectly connected to by intermediary, for the ordinary skill in the art, it can understand above-mentioned as the case may be The concrete meaning of term.
It is one embodiment flow chart of video caption real time translating method of the present invention referring to Fig. 1.The video caption is real-time Interpretation method can include:
Step S101, the original image progress multichannel extraction to being intercepted from video, obtain multiple single channel images.
In the present embodiment, video resource can be divided into two kinds:A kind of is the local video that can be played offline, and one kind is to need connection The Online Video that network propagation is put.For local video, corresponding software can be provided a user, can include turning over offline in the software Database is translated, after user, which sets up local video and software, to be connected, software can be according to method in this patent in local video Captions carry out text identification, after text identification is completed, the text that software can be identified using offline translation database to this Automatic translation is carried out, and translation result return is transferred to local video being shown;For Online Video, it can be carried to user For corresponding software, Web server can also be built, Web online services are provided a user, when user is by Online Video and Web After server is established the link, Web server can carry out text identification according to method in this patent to the captions in Online Video, Complete after recognizing herein, Web server can be translated to the text identified, and translation result return is transferred to and regarded online Frequency is shown.
In order to realize the real time translation function of video, video figure can be intercepted by frame when intercepting original image from video Picture, and in order to improve accuracy of identification, detection time is reduced, region interception can be carried out to the video image being truncated to, for example The region of interception image bottom 2/3rds is used as original image.
After original image is intercepted from video, original image can be carried out first to include sharpening and fuzzy pre- place Reason.Processing can be sharpened according to formula (1) when being sharpened pretreatment to original image:
G (x, y)=f (x, y)+c [▽2f(x,y)] (1)
Wherein, g (x, y) and f (x, y) represent to sharpen pretreated original image and original image, c value root respectively According to used in sharpening depending on template, if using being masterplate shown in Fig. 2 (a) or Fig. 2 (b), c=-1;If used Two kinds of masterplates shown in Fig. 2 (c), then c=1.Original image after sharpening can strengthen text filed and its ambient background Contrast, is more beneficial for text detection.
Gaussian filtering can be carried out to original image according to formula (2) when carrying out fuzzy preceding operation to original image:
Wherein, f (x) represents the original image after fuzzy preceding operation, and μ represents to defer to the equal of the stochastic variable of normal distribution Value, σ2Represent stochastic variable x variance.The present invention can be made in the complicated back of the body by carrying out fuzzy preceding operation to original image Text filed under scape more highlights, so as to be more beneficial for text detection.
After original image completes pretreatment, R can be carried out to original image and pretreated original image first (red), G (green), B (indigo plant), H (tone), S (saturation degree), the image zooming-out of V (brightness) six passages, so as to obtain multiple single-passes Road image.The present invention effectively can be used colouring information by carrying out multichannel image extraction, to text filed Extract and more abundant basic data is provided, so that text filed recognizing that the captions translated are more accurate based on these.
Step S102, the MSER regions based on MSER algorithms, respectively extraction original image and multiple single channel images.
In the present embodiment, in order to accelerate extraction rate, the present invention has been done to divide into the parameter being related in MSER algorithms Put:The step-length of given threshold is 5, and minimum MSER areas are 80, and maximum MSER areas are 14400.Because MSER algorithms are Image extraction method well known in the art, thus the specific extraction process no longer to MSER algorithms is repeated herein.
Step S103, introducing local contrast text feature, are calculated between each MSER region and its background area Local contrast, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out.
In the present embodiment, the MSER regions extracted in step S102 not all to be text filed, through applicants have found that, Text it is to be identified go out, it is necessary to have certain contrast with its background, and text filed with its background area contrast, with And non-textual region and its background area contrast and differ, and the former contrast is higher than the latter's contrast.Based on the spy Point, non-textual region is filtered out invention introduces the feature of local contrast.It is possible, firstly, to be calculated using below equation (3) The local contrast lc gone out between each MSER region and its background area:
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、 BiThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions Individual pixel, j represents j-th of pixel of correspondence background area.
It is then possible to be determined whether to filter in corresponding MSER regions according to the local contrast size in each MSER region Remove, wherein for each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER areas Domain is filtered out, and is not otherwise filtered out the MSER regions.Through applicants have found that, the local contrast lc in non-textual region is generally small In 0.35, i.e. first predetermined threshold value can be 0.35.Although the present invention carries out multichannel image in step S101 to image Extract, the data basis that can be enriched for the more text filed offers obtained, but at the same time have also been introduced more non- Text filed, the present invention is filtered by using local contrast to MSER regions, can exclude part in MSER regions non- Text filed distracter, so as to improve the text area extraction degree of accuracy.In addition, being filtered out in the present invention according to local contrast The time complexity in non-textual region is shorter the time required to being linear session, therefore filtering, can be video caption real time translation Basis is provided.
Step S104, introducing border key point text feature, determine the border key point in each MSER region.
In the present embodiment, it is determined that each MSER region border key point when, first, each MSER region is schemed As binaryzation, wherein for each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, The gray value of other pixels is set to 0.Then, each pixel in the MSER regions is gradually traveled through, if the ash of the pixel Angle value is 255, and the gray value of at least one in its neighbor pixel is 0, it is determined that the pixel is profile point, wherein can be with Each pixel in the MSER regions is gradually traveled through using order from top to bottom from left to right, if the gray value of the pixel P (x, y)=255, and its right side neighbor pixel gray value p (x+1, y), the gray value p of left side neighbor pixel (x-1, Y), the value is in the gray value p (x, y+1) of upside neighbor pixel, the gray value p (x, y-1) of downside neighbor pixel 0, it is determined that the pixel is profile point, wherein x represents the X-axis coordinate of pixel, and y represents the Y-axis coordinate of pixel.
After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions, as shown in Figure 3.Wherein, the present invention can be with After all profile points for often obtaining a MSER region, each profile point is compressed using Douglas-Pu Ke algorithms, gone Except redundant points, the border key point (removing remaining profile point after redundant points) in the MSER regions is obtained;Can also often it obtain After all profile points in predetermined number MSER regions, each profile point is compressed using Douglas-Pu Ke algorithms, gone Except redundant points, the border key point in the predetermined number MSER regions is obtained;Or all of all MSER regions can be obtained After profile point, each profile point is compressed using Douglas-Pu Ke algorithms, redundant points are removed, all MSER areas are obtained The border key point in domain.Through applicants have found that, the border key point k of English alphabet number generally between 5 to 16, I.e. default number range is 5 to 16, when the number k of the border key point in MSER regions is less than 5 or during more than 16, in translation It is non-textual region that the MSER regions can be determined when English.
Step S105, using the border key point as category filter feature, to filter out each rear remaining MSER region lead to The SVM crossed after training carries out category filter, obtains text filed.
In the present embodiment, after the threshold filtering in completing step S103, the present invention is except using border key point conduct Category filter feature, also selection filter out the ratio of width to height (w/h), the area girth ratio in each rear remaining MSER regionConvex closure Area ratio (ac/ a), stroke width area ratio (ws/ a) as category filter feature, so as to obtain text filed, wherein w is represented The width in MSER regions, h represents the height in MSER regions,The area extraction of square root in MSER regions is represented, p represents MSER regions Girth, acThe area of convex closure (convex closure is the generic term in image procossing, be will not be repeated here) is represented, a represents MSER areas The area in domain, wsRepresent the stroke width of image.In order that screening effect is optimal, training parameter can be set as follows: Kernel function uses RBF RBF, and iterations is chosen 100 times.The present invention is by using the SVM classifier trained to filter Except each rear remaining MSER region is classified, the text filed acquisition degree of accuracy can be improved.In addition, in order to allow SVM to reach most Good classifying quality, during SVM is trained, the quantity ratio of positive sample and negative sample is controlled 1:3, wherein positive sample is to turn over Translate the corresponding letter of object language (such as when special translating purpose language is English, its correspondence letter can include ' a '-' z ' and ' A '- ' Z) and Arabic numerals (such as digital ' 0 '-' 9 '), negative sample is to extract original image and multiple lists respectively in step S102 Behind the MSER regions of channel image, manual identified and the non-textual region marked are carried out to the MSER regions extracted, thus may be used With optimal screening effect, so as to further improve the degree of accuracy of text filed acquisition.
In the outline pixel point set in a region, a portion point is connected by certain order, It is exactly border key point that just can at utmost reduce the set comprising minimum pixel in the region, the present invention.Even if due to figure Impacted as rotating and scaling all without to its border key point, therefore the present invention is used as by introducing border key point Category filter feature, even if non-textual region distracter in MSER regions can also be excluded when image is rotated and is scaled, So as to improve susceptibility of the text area extraction to image rotation, change in size etc..
Step S106, according on vertical direction per two neighboring the distance between text filed, to each it is text filed enter One's own profession of composing a piece of writing is distinguished, according on one text row per two neighboring the distance between text filed, to each of one text row It is text filed to be classified.
In the present embodiment, with reference to shown in Fig. 4 and Fig. 5, according on vertical direction per it is two neighboring it is text filed between Distance, during to each text filed progress line of text differentiation, can be calculated on vertical direction per adjacent according to formula (4) first Two the distance between text filed dv
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent on vertical direction The text filed top Y-axis coordinate in adjacent downside, h2The text filed height in adjacent downside on vertical direction is represented, such as Fig. 4 institutes Show.Then, on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvIt is more than Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned For different line of text.Through applicants have found that, as two neighboring the distance between text filed d on vertical directionvIt is more than When 0.62, this is two neighboring text filed in one text row, therefore second predetermined threshold value can be 0.62.
In addition, according on one text row per it is two neighboring it is text filed between distance, to the list of one text row When word makes a distinction, can first according to formula (5) calculate one text row it is every it is two neighboring it is text filed between away from From dh
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents that this is two neighboring text filed adjacent Interval between the alphabetical adjacent letters of range difference, i.e., two in the X-axis direction.Then, for the every two neighboring of one text row It is text filed, if two neighboring the distance between the text filed dhMore than the 3rd predetermined threshold value, then it represents that the two neighboring text One's respective area adjacent letters belong to same word together, and two neighboring text filed this now is attributed into same class;Otherwise represent that this is adjacent Two text filed adjacent letters adhere to various words separately, and two neighboring text filed this is attributed into inhomogeneity.Due to obtaining text During one's respective area the different letters of same word may be divided to it is different text filed, and interval between word with There is obvious difference in the interval inside word between letter, as shown in figure 5, therefore the present invention to text filed by entering to compose a piece of writing After one's own profession is distinguished, made a distinction for the word on one text row, word recognition accuracy can be improved, and the present invention is adopted Classified this two tiered text sorting algorithm with the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, dropped significantly (time complexity of analogous algorithms is O (n to low time complexity2), time complexity of the invention is O (nlog2N)), improve The speed of word identification, to realize that video caption real time translation provides the foundation.Through applicants have found that, when two neighboring The distance between text filed dhDuring more than 2.33, two neighboring text filed adjacent letters belong to same word together, thus this Three predetermined threshold values can be 2.33.
Step S107, based on it is sorted each it is text filed progress video caption real time translation.
In the present embodiment, obtain it is sorted each it is text filed after, can be entered using the framework Tesseract that increases income Row text identification, at the same in order to system unified management, it is necessary to Tesseract and OpenCV image procossings Runtime Library carry out it is whole Close.After text is identified, text can be delivered to the interface that Google translations are provided in the form of alphabetic string, obtain translation As a result, user is finally shown to, so as to realize video caption real time translation.
As seen from the above-described embodiment, the present invention is firstly introduced into multiple single channel images, had before text is identified Effect ground provides more abundant basic data, then introducing office using the colouring information of original image to text filed extraction Portion's contrast text feature, to carrying out threshold value mistake according to the MSER regions extracted in original image and multiple single channel images Filter, can improve the degree of accuracy of text area extraction, and the time complexity of local contrast filtering is linear session, filtering Time is shorter, and basis can be provided for video caption real time translation, introduces border key point as svm classifier and screens feature, i.e., Just non-textual region in MSER regions can be also excluded when image is rotated and is scaled to disturb, it is text filed so as to improve The susceptibility to image rotation and scaling is extracted, it is right after the threshold filtering based on local contrast has been performed to MSER regions Filter out remaining MSER regions to screen by the svm classifier trained, the degree of accuracy of text area extraction can be improved, for instruction It is text filed that white silk screening is obtained after terminating, and the present invention is using the classification of vertical direction line of text and horizontal direction one text style of writing One's respective area is classified this two tiered text sorting algorithm, is greatly reduced time complexity, is improved the speed of word identification, to realize Video caption real time translation provides the foundation, and can realize that video caption is accurately translated in real time from there through the present invention.
Those skilled in the art will readily occur to its of the present invention after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including undocumented common knowledge in the art of the invention Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.
It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims (10)

1. a kind of video caption real time translating method, it is characterised in that including:
Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained;
Based on maximum stable extremal region MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively;
Local contrast text feature is introduced, the local contrast between each MSER region and its background area, and root is calculated According to each local contrast, it is determined whether corresponding MSER regions are filtered out;
Border key point text feature is introduced, the border key point in each MSER region is determined;
Using the border key point as category filter feature, the support after training is passed through to filtering out each rear remaining MSER region Vector machine SVM carries out category filter, obtains text filed;
According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, According to per two neighboring the distance between text filed, text filed being divided to one text row each on one text row Class;
Based on each sorted text filed progress video caption real time translation.
2. video caption real time translating method according to claim 1, it is characterised in that in the original to being intercepted from video Beginning image carries out multichannel extraction, obtains before multiple single channel images, methods described also includes:The original image is carried out Including the pretreatment for sharpening and obscuring.
3. video caption real time translating method according to claim 2, it is characterised in that described to being intercepted from video Original image carries out multichannel extraction, and obtaining multiple single channel images includes:To the original image and pretreated original Image carries out the image zooming-out of six passages of R, G, B, H, S, V respectively, so as to obtain multiple single channel images.
4. video caption real time translating method according to claim 1, it is characterised in that described to calculate each MSER area Local contrast between domain and its background area, and according to each local contrast, it is determined whether by corresponding MSER regions Filter out including:
Local contrast lc between each MSER region and its background is calculated according to below equation:
<mrow> <mi>l</mi> <mi>c</mi> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>
Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, Ri、Gi、BiPoint The value of three passage red, green, blues of image where Biao Shi MSER regions not corresponded to, i represents i-th of picture in correspondence MSER regions Vegetarian refreshments, j represents j-th of pixel of correspondence background area;
For each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER regions Filter out.
5. video caption real time translating method according to claim 1, it is characterised in that each described MSER of the determination The border key point in region includes:
For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pixels Gray value be set to 0;
Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, in its neighbor pixel extremely The gray value of rare one is 0, it is determined that the pixel is profile point;
After at least all profile points in a MSER region are obtained, each profile is clicked through using Douglas-Pu Ke algorithms Row compression, removes redundant points, obtains the border key point in correspondence MSER regions.
6. video caption real time translating method according to claim 1, it is characterised in that also with filter out it is rear it is remaining each The ratio of width to height in MSER regions, area girth ratio, convex closure area ratio, stroke width area are used for category filter feature, to filtering out Each remaining MSER region carries out category filter by the SVM after training afterwards.
7. video caption real time translating method according to claim 6, it is characterised in that during SVM is trained, just The quantity ratio of sample and negative sample is controlled 1:3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic Numeral, negative sample is behind the MSER regions for extracting original image and multiple single channel images respectively, to what is extracted MSER regions carry out the non-textual region of manual identified mark.
8. video caption real time translating method according to claim 1, it is characterised in that described according to every on vertical direction Two neighboring the distance between text filed, each text filed progress line of text, which is distinguished, to be included:
Calculated according to below equation on vertical direction per two neighboring the distance between text filed dv
<mrow> <msub> <mi>d</mi> <mi>v</mi> </msub> <mo>=</mo> <mfrac> <mrow> <msub> <mi>b</mi> <mn>1</mn> </msub> <mo>-</mo> <msub> <mi>t</mi> <mn>2</mn> </msub> </mrow> <msub> <mi>h</mi> <mn>2</mn> </msub> </mfrac> </mrow>
Wherein, b1Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t2Represent adjacent on vertical direction The text filed top Y-axis coordinate in downside, h2Represent the text filed height in adjacent downside on vertical direction;
For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed dvMore than second Predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is classified as not Same line of text.
9. video caption real time translating method according to claim 1, it is characterised in that described according on one text row Per two neighboring the distance between text filed, each text filed progress classification to one text row includes:
Every two neighboring the distance between text filed d of one text row is calculated according to below equationh
<mrow> <msub> <mi>d</mi> <mi>h</mi> </msub> <mo>=</mo> <mfrac> <mover> <mi>w</mi> <mo>&amp;OverBar;</mo> </mover> <mrow> <mi>&amp;Delta;</mi> <mi>d</mi> </mrow> </mfrac> </mrow>
Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent letters in X Range difference on direction of principal axis;
For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed dhMore than Three predetermined threshold values, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as difference Class.
10. video caption real time translating method according to claim 1, it is characterised in that original being intercepted from video Video pictures are intercepted by frame during image, and regard below the video image of interception 2/3rds region as the original image.
CN201710345936.1A 2017-05-17 2017-05-17 Video caption real time translating method Pending CN107145888A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710345936.1A CN107145888A (en) 2017-05-17 2017-05-17 Video caption real time translating method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710345936.1A CN107145888A (en) 2017-05-17 2017-05-17 Video caption real time translating method

Publications (1)

Publication Number Publication Date
CN107145888A true CN107145888A (en) 2017-09-08

Family

ID=59778166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710345936.1A Pending CN107145888A (en) 2017-05-17 2017-05-17 Video caption real time translating method

Country Status (1)

Country Link
CN (1) CN107145888A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520254A (en) * 2018-03-01 2018-09-11 腾讯科技(深圳)有限公司 A kind of Method for text detection, device and relevant device based on formatted image
CN109284751A (en) * 2018-10-31 2019-01-29 河南科技大学 The non-textual filtering method of text location based on spectrum analysis and SVM
CN111797632A (en) * 2019-04-04 2020-10-20 北京猎户星空科技有限公司 Information processing method and device and electronic equipment
CN113287319A (en) * 2019-01-09 2021-08-20 奈飞公司 Optimizing encoding operations in generating buffer-constrained versions of media subtitles

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054271A (en) * 2009-11-02 2011-05-11 富士通株式会社 Text line detection method and device
CN102542268A (en) * 2011-12-29 2012-07-04 中国科学院自动化研究所 Method for detecting and positioning text area in video
CN102750540A (en) * 2012-06-12 2012-10-24 大连理工大学 Morphological filtering enhancement-based maximally stable extremal region (MSER) video text detection method
CN103310439A (en) * 2013-05-09 2013-09-18 浙江大学 Method for detecting maximally stable extremal region of image based on scale space
WO2015105755A1 (en) * 2014-01-08 2015-07-16 Qualcomm Incorporated Processing text images with shadows
CN105825216A (en) * 2016-03-17 2016-08-03 中国科学院信息工程研究所 Method of locating text in complex background image
CN105868758A (en) * 2015-01-21 2016-08-17 阿里巴巴集团控股有限公司 Method and device for detecting text area in image and electronic device
US9576196B1 (en) * 2014-08-20 2017-02-21 Amazon Technologies, Inc. Leveraging image context for improved glyph classification

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102054271A (en) * 2009-11-02 2011-05-11 富士通株式会社 Text line detection method and device
CN102542268A (en) * 2011-12-29 2012-07-04 中国科学院自动化研究所 Method for detecting and positioning text area in video
CN102750540A (en) * 2012-06-12 2012-10-24 大连理工大学 Morphological filtering enhancement-based maximally stable extremal region (MSER) video text detection method
CN103310439A (en) * 2013-05-09 2013-09-18 浙江大学 Method for detecting maximally stable extremal region of image based on scale space
WO2015105755A1 (en) * 2014-01-08 2015-07-16 Qualcomm Incorporated Processing text images with shadows
US9576196B1 (en) * 2014-08-20 2017-02-21 Amazon Technologies, Inc. Leveraging image context for improved glyph classification
CN105868758A (en) * 2015-01-21 2016-08-17 阿里巴巴集团控股有限公司 Method and device for detecting text area in image and electronic device
CN105825216A (en) * 2016-03-17 2016-08-03 中国科学院信息工程研究所 Method of locating text in complex background image

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何东健: "《数字图像处理》", 28 February 2015 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108520254A (en) * 2018-03-01 2018-09-11 腾讯科技(深圳)有限公司 A kind of Method for text detection, device and relevant device based on formatted image
CN109284751A (en) * 2018-10-31 2019-01-29 河南科技大学 The non-textual filtering method of text location based on spectrum analysis and SVM
CN113287319A (en) * 2019-01-09 2021-08-20 奈飞公司 Optimizing encoding operations in generating buffer-constrained versions of media subtitles
CN113287319B (en) * 2019-01-09 2024-05-14 奈飞公司 Method and apparatus for optimizing encoding operations
CN111797632A (en) * 2019-04-04 2020-10-20 北京猎户星空科技有限公司 Information processing method and device and electronic equipment
CN111797632B (en) * 2019-04-04 2023-10-27 北京猎户星空科技有限公司 Information processing method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN107451607B (en) A kind of personal identification method of the typical character based on deep learning
CN112818862B (en) Face tampering detection method and system based on multi-source clues and mixed attention
CN104408449B (en) Intelligent mobile terminal scene literal processing method
CN100565559C (en) Image text location method and device based on connected component and support vector machine
CN110008832A (en) Based on deep learning character image automatic division method, information data processing terminal
CN107145888A (en) Video caption real time translating method
CN107256558A (en) The cervical cell image automatic segmentation method and system of a kind of unsupervised formula
CN106651872A (en) Prewitt operator-based pavement crack recognition method and system
CN106875546A (en) A kind of recognition methods of VAT invoice
CN104778238B (en) The analysis method and device of a kind of saliency
JPH0728940A (en) Image segmentation for document processing and classification of image element
CN111461122B (en) Certificate information detection and extraction method
CN110298376A (en) A kind of bank money image classification method based on improvement B-CNN
CN108615058A (en) A kind of method, apparatus of character recognition, equipment and readable storage medium storing program for executing
CN107085726A (en) Oracle bone rubbing individual character localization method based on multi-method denoising and connected component analysis
CN114005123A (en) System and method for digitally reconstructing layout of print form text
CN109544564A (en) A kind of medical image segmentation method
CN112069900A (en) Bill character recognition method and system based on convolutional neural network
CN107633229A (en) Method for detecting human face and device based on convolutional neural networks
CN112434699A (en) Automatic extraction and intelligent scoring system for handwritten Chinese characters or components and strokes
CN106611174A (en) OCR recognition method for unusual fonts
CN110956167A (en) Classification discrimination and strengthened separation method based on positioning characters
CN106339984A (en) Distributed image super-resolution method based on K-means driven convolutional neural network
CN113673541B (en) Image sample generation method for target detection and application
CN110929746A (en) Electronic file title positioning, extracting and classifying method based on deep neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170908

RJ01 Rejection of invention patent application after publication