CN107145888A

CN107145888A - Video caption real time translating method

Info

Publication number: CN107145888A
Application number: CN201710345936.1A
Authority: CN
Inventors: 代劲; 王族; 宋娟; 张鹏
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2017-09-08

Abstract

The present invention provides a kind of video caption real time translating method, including：Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained；Based on MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively；The local contrast between each MSER region and its background area is calculated, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out；Determine the border key point in each MSER region；Using border key point as category filter feature, to filtering out each rear remaining MSER region by the SVM progress category filters trained, obtain text filed；According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, according to per two neighboring the distance between text filed, text filed classifying to one text row each on one text row；Based on each text filed progress video caption real time translation after classification.

Description

Video caption real time translating method

Technical field

The invention belongs to image processing field, and in particular to a kind of video caption real time translating method.

Background technology

In recent years, in natural scene image text detection and identification has become computer vision, pattern-recognition very Popular research theme into document analysis field.Researcher proposes substantial amounts of extracts text envelope from natural scene image The new idea and method of breath.However, at present when being translated to video caption, due to from image extraction text message when Between complexity it is higher, therefore video caption real time translation can not be realized.

The content of the invention

The present invention provides a kind of video caption real time translating method, to solve at present when being translated to video caption, The problem of the time complexity that text message is extracted from image is higher can not realize video caption real time translation.

First aspect according to embodiments of the present invention there is provided a kind of video caption real time translating method, including：

Multichannel extraction is carried out to the original image intercepted from video, multiple single channel images are obtained；

Based on maximum stable extremal region MSER algorithms, the MSER areas of original image and multiple single channel images are extracted respectively Domain；

Local contrast text feature is introduced, the local contrast between each MSER region and its background area is calculated, And according to each local contrast, it is determined whether corresponding MSER regions are filtered out；

Border key point text feature is introduced, the border key point in each MSER region is determined；

Using the border key point as category filter feature, after filtering out each rear remaining MSER region by training Support vector machines carry out category filter, obtain text filed；

According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text area Point, according on one text row per two neighboring the distance between text filed, to one text row each it is text filed enter Row classification；

Based on each sorted text filed progress video caption real time translation.

In a kind of optional implementation, multichannel extraction is being carried out to the original image intercepted from video, obtained Before multiple single channel images, methods described also includes：The original image is carried out to include sharpening and fuzzy pretreatment.

In another optional implementation, the original image to being intercepted from video carries out multichannel extraction, Obtaining multiple single channel images includes：R, G, B, H, S, V are carried out respectively to the original image and pretreated original image The image zooming-out of six passages, so as to obtain multiple single channel images.

In another optional implementation, the part calculated between each MSER region and its background area Contrast, and according to each local contrast, it is determined whether by corresponding MSER regions filter out including：

Local contrast lc between each MSER region and its background is calculated according to below equation：

Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, R_i、G_i、 B_iThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions Individual pixel, j represents j-th of pixel of correspondence background area；

And for each MSER regions, should if the local contrast in the MSER regions is less than the first predetermined threshold value MSER regions are filtered out.

In another optional implementation, the border key point for determining each MSER region includes：

For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pictures The gray value of vegetarian refreshments is set to 0；

Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, its neighbor pixel In the gray value of at least one be 0, it is determined that the pixel be profile point；

After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions.

In another optional implementation, the ratio of width to height, area girth also to filter out each rear remaining MSER region Than, convex closure area ratio, stroke width area be used for category filter feature, to filter out each rear remaining MSER region pass through training SVM afterwards carries out category filter.

In another optional implementation, during SVM is trained, the quantity ratio control of positive sample and negative sample 1：3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic numerals, negative sample is to be carried respectively described Behind the MSER regions for taking out original image and multiple single channel images, manual identified mark is carried out to the MSER regions extracted Non-textual region.

In another optional implementation, it is described according on vertical direction per it is two neighboring it is text filed between away from From each text filed progress line of text, which is distinguished, to be included：

Calculated according to below equation on vertical direction per two neighboring the distance between text filed d_v：

Wherein, b₁Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t₂Represent on vertical direction The text filed top Y-axis coordinate in adjacent downside, h₂Represent the text filed height in adjacent downside on vertical direction；

For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed d_vIt is more than Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned For different line of text.

In another optional implementation, it is described according on one text row per it is two neighboring it is text filed between Distance, each text filed progress classification to one text row includes：

Every two neighboring the distance between text filed d of one text row is calculated according to below equation_h：

Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent words Female range difference in the X-axis direction；

For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed d_hGreatly In the 3rd predetermined threshold value, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as Inhomogeneity.

In another optional implementation, video pictures are intercepted by frame when intercepting original image from video, and It regard below the video image of interception 2/3rds region as the original image.

The beneficial effects of the invention are as follows：

1st, the present invention is firstly introduced into multiple single channel images, effectively utilizes original graph before text is identified The colouring information of picture, provides more abundant basic data to text filed extraction, local contrast is then introduced, to basis The MSER regions extracted in original image and multiple single channel images carry out threshold filtering, can improve text area extraction The degree of accuracy, and the time complexity of local contrast filtering is linear session, and filtration time is shorter, can be real for video caption When translation provide basis, introduce border key point as svm classifier screen feature, even if when image is rotated and is scaled Non-textual region in MSER regions can be excluded to disturb, so as to improve sensitivity of the text area extraction to image rotation and scaling Degree, after the threshold filtering based on local contrast has been performed to MSER regions, the MSER region remaining to filtering out passes through SVM points Class device is screened, and can improve the degree of accuracy of text area extraction, and what is obtained after terminating for training screening is text filed, of the invention Classified this two tiered text sorting algorithm using the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, significantly Time complexity is reduced, the speed of word identification is improved, to realize that video caption real time translation provides the foundation, thus led to Crossing the present invention can realize that video caption is accurately translated in real time；

2nd, the present invention can enable the original image after sharpening strengthen text by being sharpened pretreatment to original image One's respective area and the contrast of its ambient background, are more beneficial for text detection, the present invention is by carrying out fuzzy pre- place to original image Reason, can make be in complex background under it is text filed more highlight, so as to be more beneficial for text detection；

3rd, the present invention is by controlling positive sample and the quantity of negative sample ratio 1 when training SVM:3, can be with optimal screening Effect, so as to further improve the degree of accuracy of text filed acquisition；

4th, then the present invention intercepts video by intercepting video image by frame first when intercepting original image from video The subregion of image can improve accuracy of identification as original image, reduce detection time.

Brief description of the drawings

Fig. 1 is one embodiment flow chart of video caption real time translating method of the present invention；

Fig. 2 is a schematic diagram of Laplace's operation template of the present invention；

Fig. 3 is border key point schematic diagram；

Fig. 4 is that line of text constrained parameters illustrate schematic diagram；

Fig. 5 is that pitch compares statistical chart with interval.

Embodiment

In order that those skilled in the art more fully understand the technical scheme in the embodiment of the present invention, and make of the invention real Applying the above-mentioned purpose of example, feature and advantage can be more obvious understandable, below in conjunction with the accompanying drawings to technical side in the embodiment of the present invention Case is described in further detail.

In the description of the invention, unless otherwise prescribed with restriction, it is necessary to which explanation, term " connection " should do broad sense reason Solution, for example, it may be mechanically connecting or electrical connection or the connection of two element internals, can be joined directly together, also may be used To be indirectly connected to by intermediary, for the ordinary skill in the art, it can understand above-mentioned as the case may be The concrete meaning of term.

It is one embodiment flow chart of video caption real time translating method of the present invention referring to Fig. 1.The video caption is real-time Interpretation method can include：

Step S101, the original image progress multichannel extraction to being intercepted from video, obtain multiple single channel images.

In the present embodiment, video resource can be divided into two kinds：A kind of is the local video that can be played offline, and one kind is to need connection The Online Video that network propagation is put.For local video, corresponding software can be provided a user, can include turning over offline in the software Database is translated, after user, which sets up local video and software, to be connected, software can be according to method in this patent in local video Captions carry out text identification, after text identification is completed, the text that software can be identified using offline translation database to this Automatic translation is carried out, and translation result return is transferred to local video being shown；For Online Video, it can be carried to user For corresponding software, Web server can also be built, Web online services are provided a user, when user is by Online Video and Web After server is established the link, Web server can carry out text identification according to method in this patent to the captions in Online Video, Complete after recognizing herein, Web server can be translated to the text identified, and translation result return is transferred to and regarded online Frequency is shown.

In order to realize the real time translation function of video, video figure can be intercepted by frame when intercepting original image from video Picture, and in order to improve accuracy of identification, detection time is reduced, region interception can be carried out to the video image being truncated to, for example The region of interception image bottom 2/3rds is used as original image.

After original image is intercepted from video, original image can be carried out first to include sharpening and fuzzy pre- place Reason.Processing can be sharpened according to formula (1) when being sharpened pretreatment to original image：

G (x, y)=f (x, y)+c [▽²f(x,y)] (1)

Wherein, g (x, y) and f (x, y) represent to sharpen pretreated original image and original image, c value root respectively According to used in sharpening depending on template, if using being masterplate shown in Fig. 2 (a) or Fig. 2 (b), c=-1；If used Two kinds of masterplates shown in Fig. 2 (c), then c=1.Original image after sharpening can strengthen text filed and its ambient background Contrast, is more beneficial for text detection.

Gaussian filtering can be carried out to original image according to formula (2) when carrying out fuzzy preceding operation to original image：

Wherein, f (x) represents the original image after fuzzy preceding operation, and μ represents to defer to the equal of the stochastic variable of normal distribution Value, σ²Represent stochastic variable x variance.The present invention can be made in the complicated back of the body by carrying out fuzzy preceding operation to original image Text filed under scape more highlights, so as to be more beneficial for text detection.

After original image completes pretreatment, R can be carried out to original image and pretreated original image first (red), G (green), B (indigo plant), H (tone), S (saturation degree), the image zooming-out of V (brightness) six passages, so as to obtain multiple single-passes Road image.The present invention effectively can be used colouring information by carrying out multichannel image extraction, to text filed Extract and more abundant basic data is provided, so that text filed recognizing that the captions translated are more accurate based on these.

Step S102, the MSER regions based on MSER algorithms, respectively extraction original image and multiple single channel images.

In the present embodiment, in order to accelerate extraction rate, the present invention has been done to divide into the parameter being related in MSER algorithms Put：The step-length of given threshold is 5, and minimum MSER areas are 80, and maximum MSER areas are 14400.Because MSER algorithms are Image extraction method well known in the art, thus the specific extraction process no longer to MSER algorithms is repeated herein.

Step S103, introducing local contrast text feature, are calculated between each MSER region and its background area Local contrast, and according to each local contrast, it is determined whether corresponding MSER regions are filtered out.

In the present embodiment, the MSER regions extracted in step S102 not all to be text filed, through applicants have found that, Text it is to be identified go out, it is necessary to have certain contrast with its background, and text filed with its background area contrast, with And non-textual region and its background area contrast and differ, and the former contrast is higher than the latter's contrast.Based on the spy Point, non-textual region is filtered out invention introduces the feature of local contrast.It is possible, firstly, to be calculated using below equation (3) The local contrast lc gone out between each MSER region and its background area：

Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, R_i、G_i、 B_iThe value of three passage red, green, blues of image where representing correspondence MSER regions respectively, i represents the i-th of correspondence MSER regions Individual pixel, j represents j-th of pixel of correspondence background area.

It is then possible to be determined whether to filter in corresponding MSER regions according to the local contrast size in each MSER region Remove, wherein for each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER areas Domain is filtered out, and is not otherwise filtered out the MSER regions.Through applicants have found that, the local contrast lc in non-textual region is generally small In 0.35, i.e. first predetermined threshold value can be 0.35.Although the present invention carries out multichannel image in step S101 to image Extract, the data basis that can be enriched for the more text filed offers obtained, but at the same time have also been introduced more non- Text filed, the present invention is filtered by using local contrast to MSER regions, can exclude part in MSER regions non- Text filed distracter, so as to improve the text area extraction degree of accuracy.In addition, being filtered out in the present invention according to local contrast The time complexity in non-textual region is shorter the time required to being linear session, therefore filtering, can be video caption real time translation Basis is provided.

Step S104, introducing border key point text feature, determine the border key point in each MSER region.

In the present embodiment, it is determined that each MSER region border key point when, first, each MSER region is schemed As binaryzation, wherein for each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, The gray value of other pixels is set to 0.Then, each pixel in the MSER regions is gradually traveled through, if the ash of the pixel Angle value is 255, and the gray value of at least one in its neighbor pixel is 0, it is determined that the pixel is profile point, wherein can be with Each pixel in the MSER regions is gradually traveled through using order from top to bottom from left to right, if the gray value of the pixel P (x, y)=255, and its right side neighbor pixel gray value p (x+1, y), the gray value p of left side neighbor pixel (x-1, Y), the value is in the gray value p (x, y+1) of upside neighbor pixel, the gray value p (x, y-1) of downside neighbor pixel 0, it is determined that the pixel is profile point, wherein x represents the X-axis coordinate of pixel, and y represents the Y-axis coordinate of pixel.

After at least all profile points in a MSER region are obtained, using Douglas-Pu Ke algorithms to each profile Point is compressed, and removes redundant points, obtains the border key point in correspondence MSER regions, as shown in Figure 3.Wherein, the present invention can be with After all profile points for often obtaining a MSER region, each profile point is compressed using Douglas-Pu Ke algorithms, gone Except redundant points, the border key point (removing remaining profile point after redundant points) in the MSER regions is obtained；Can also often it obtain After all profile points in predetermined number MSER regions, each profile point is compressed using Douglas-Pu Ke algorithms, gone Except redundant points, the border key point in the predetermined number MSER regions is obtained；Or all of all MSER regions can be obtained After profile point, each profile point is compressed using Douglas-Pu Ke algorithms, redundant points are removed, all MSER areas are obtained The border key point in domain.Through applicants have found that, the border key point k of English alphabet number generally between 5 to 16, I.e. default number range is 5 to 16, when the number k of the border key point in MSER regions is less than 5 or during more than 16, in translation It is non-textual region that the MSER regions can be determined when English.

Step S105, using the border key point as category filter feature, to filter out each rear remaining MSER region lead to The SVM crossed after training carries out category filter, obtains text filed.

In the present embodiment, after the threshold filtering in completing step S103, the present invention is except using border key point conduct Category filter feature, also selection filter out the ratio of width to height (w/h), the area girth ratio in each rear remaining MSER regionConvex closure Area ratio (a_c/ a), stroke width area ratio (w_s/ a) as category filter feature, so as to obtain text filed, wherein w is represented The width in MSER regions, h represents the height in MSER regions,The area extraction of square root in MSER regions is represented, p represents MSER regions Girth, a_cThe area of convex closure (convex closure is the generic term in image procossing, be will not be repeated here) is represented, a represents MSER areas The area in domain, w_sRepresent the stroke width of image.In order that screening effect is optimal, training parameter can be set as follows： Kernel function uses RBF RBF, and iterations is chosen 100 times.The present invention is by using the SVM classifier trained to filter Except each rear remaining MSER region is classified, the text filed acquisition degree of accuracy can be improved.In addition, in order to allow SVM to reach most Good classifying quality, during SVM is trained, the quantity ratio of positive sample and negative sample is controlled 1：3, wherein positive sample is to turn over Translate the corresponding letter of object language (such as when special translating purpose language is English, its correspondence letter can include ' a '-' z ' and ' A '- ' Z) and Arabic numerals (such as digital ' 0 '-' 9 '), negative sample is to extract original image and multiple lists respectively in step S102 Behind the MSER regions of channel image, manual identified and the non-textual region marked are carried out to the MSER regions extracted, thus may be used With optimal screening effect, so as to further improve the degree of accuracy of text filed acquisition.

In the outline pixel point set in a region, a portion point is connected by certain order, It is exactly border key point that just can at utmost reduce the set comprising minimum pixel in the region, the present invention.Even if due to figure Impacted as rotating and scaling all without to its border key point, therefore the present invention is used as by introducing border key point Category filter feature, even if non-textual region distracter in MSER regions can also be excluded when image is rotated and is scaled, So as to improve susceptibility of the text area extraction to image rotation, change in size etc..

Step S106, according on vertical direction per two neighboring the distance between text filed, to each it is text filed enter One's own profession of composing a piece of writing is distinguished, according on one text row per two neighboring the distance between text filed, to each of one text row It is text filed to be classified.

In the present embodiment, with reference to shown in Fig. 4 and Fig. 5, according on vertical direction per it is two neighboring it is text filed between Distance, during to each text filed progress line of text differentiation, can be calculated on vertical direction per adjacent according to formula (4) first Two the distance between text filed d_v：

Wherein, b₁Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t₂Represent on vertical direction The text filed top Y-axis coordinate in adjacent downside, h₂The text filed height in adjacent downside on vertical direction is represented, such as Fig. 4 institutes Show.Then, on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed d_vIt is more than Second predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is returned For different line of text.Through applicants have found that, as two neighboring the distance between text filed d on vertical direction_vIt is more than When 0.62, this is two neighboring text filed in one text row, therefore second predetermined threshold value can be 0.62.

In addition, according on one text row per it is two neighboring it is text filed between distance, to the list of one text row When word makes a distinction, can first according to formula (5) calculate one text row it is every it is two neighboring it is text filed between away from From d_h：

Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents that this is two neighboring text filed adjacent Interval between the alphabetical adjacent letters of range difference, i.e., two in the X-axis direction.Then, for the every two neighboring of one text row It is text filed, if two neighboring the distance between the text filed d_hMore than the 3rd predetermined threshold value, then it represents that the two neighboring text One's respective area adjacent letters belong to same word together, and two neighboring text filed this now is attributed into same class；Otherwise represent that this is adjacent Two text filed adjacent letters adhere to various words separately, and two neighboring text filed this is attributed into inhomogeneity.Due to obtaining text During one's respective area the different letters of same word may be divided to it is different text filed, and interval between word with There is obvious difference in the interval inside word between letter, as shown in figure 5, therefore the present invention to text filed by entering to compose a piece of writing After one's own profession is distinguished, made a distinction for the word on one text row, word recognition accuracy can be improved, and the present invention is adopted Classified this two tiered text sorting algorithm with the classification of vertical direction line of text and horizontal direction one text style of writing one's respective area, dropped significantly (time complexity of analogous algorithms is O (n to low time complexity²), time complexity of the invention is O (nlog₂N)), improve The speed of word identification, to realize that video caption real time translation provides the foundation.Through applicants have found that, when two neighboring The distance between text filed d_hDuring more than 2.33, two neighboring text filed adjacent letters belong to same word together, thus this Three predetermined threshold values can be 2.33.

Step S107, based on it is sorted each it is text filed progress video caption real time translation.

In the present embodiment, obtain it is sorted each it is text filed after, can be entered using the framework Tesseract that increases income Row text identification, at the same in order to system unified management, it is necessary to Tesseract and OpenCV image procossings Runtime Library carry out it is whole Close.After text is identified, text can be delivered to the interface that Google translations are provided in the form of alphabetic string, obtain translation As a result, user is finally shown to, so as to realize video caption real time translation.

As seen from the above-described embodiment, the present invention is firstly introduced into multiple single channel images, had before text is identified Effect ground provides more abundant basic data, then introducing office using the colouring information of original image to text filed extraction Portion's contrast text feature, to carrying out threshold value mistake according to the MSER regions extracted in original image and multiple single channel images Filter, can improve the degree of accuracy of text area extraction, and the time complexity of local contrast filtering is linear session, filtering Time is shorter, and basis can be provided for video caption real time translation, introduces border key point as svm classifier and screens feature, i.e., Just non-textual region in MSER regions can be also excluded when image is rotated and is scaled to disturb, it is text filed so as to improve The susceptibility to image rotation and scaling is extracted, it is right after the threshold filtering based on local contrast has been performed to MSER regions Filter out remaining MSER regions to screen by the svm classifier trained, the degree of accuracy of text area extraction can be improved, for instruction It is text filed that white silk screening is obtained after terminating, and the present invention is using the classification of vertical direction line of text and horizontal direction one text style of writing One's respective area is classified this two tiered text sorting algorithm, is greatly reduced time complexity, is improved the speed of word identification, to realize Video caption real time translation provides the foundation, and can realize that video caption is accurately translated in real time from there through the present invention.

Those skilled in the art will readily occur to its of the present invention after considering specification and putting into practice invention disclosed herein Its embodiment.The application be intended to the present invention any modification, purposes or adaptations, these modifications, purposes or Person's adaptations follow the general principle of the present invention and including undocumented common knowledge in the art of the invention Or conventional techniques.Description and embodiments are considered only as exemplary, and true scope and spirit of the invention are by following Claim is pointed out.

It should be appreciated that the invention is not limited in the precision architecture for being described above and being shown in the drawings, and And various modifications and changes can be being carried out without departing from the scope.The scope of the present invention is only limited by appended claim.

Claims

1. a kind of video caption real time translating method, it is characterised in that including：

Based on maximum stable extremal region MSER algorithms, the MSER regions of original image and multiple single channel images are extracted respectively；

Local contrast text feature is introduced, the local contrast between each MSER region and its background area, and root is calculated According to each local contrast, it is determined whether corresponding MSER regions are filtered out；

Using the border key point as category filter feature, the support after training is passed through to filtering out each rear remaining MSER region Vector machine SVM carries out category filter, obtains text filed；

According on vertical direction per two neighboring the distance between text filed, to each text filed progress line of text differentiation, According to per two neighboring the distance between text filed, text filed being divided to one text row each on one text row Class；

Based on each sorted text filed progress video caption real time translation.

2. video caption real time translating method according to claim 1, it is characterised in that in the original to being intercepted from video Beginning image carries out multichannel extraction, obtains before multiple single channel images, methods described also includes：The original image is carried out Including the pretreatment for sharpening and obscuring.

3. video caption real time translating method according to claim 2, it is characterised in that described to being intercepted from video Original image carries out multichannel extraction, and obtaining multiple single channel images includes：To the original image and pretreated original Image carries out the image zooming-out of six passages of R, G, B, H, S, V respectively, so as to obtain multiple single channel images.

4. video caption real time translating method according to claim 1, it is characterised in that described to calculate each MSER area Local contrast between domain and its background area, and according to each local contrast, it is determined whether by corresponding MSER regions Filter out including：

<mrow> <mi>l</mi> <mi>c</mi> <mo>=</mo> <mfrac> <mrow> <mo>|</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>-</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>|</mo> </mrow> <mrow> <mi>max</mi> <mrow> <mo>(</mo> <mrow> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> <mo>,</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mrow> <mo>(</mo> <mrow> <msub> <mi>R</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>G</mi> <mi>i</mi> </msub> <mo>+</mo> <msub> <mi>B</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> </mrow> </mrow> <mo>)</mo> </mrow> </mrow> </mfrac> </mrow>

Wherein, n represents the number of pixels in correspondence MSER regions, and k represents the pixel number of correspondence background area, R_i、G_i、B_iPoint The value of three passage red, green, blues of image where Biao Shi MSER regions not corresponded to, i represents i-th of picture in correspondence MSER regions Vegetarian refreshments, j represents j-th of pixel of correspondence background area；

For each MSER regions, if the local contrast in the MSER regions is less than the first predetermined threshold value, by the MSER regions Filter out.

5. video caption real time translating method according to claim 1, it is characterised in that each described MSER of the determination The border key point in region includes：

For each MSER regions, the gray value that MSER pixels are detected in the MSER regions is set to 255, other pixels Gray value be set to 0；

Each pixel in the MSER regions is gradually traveled through, if the gray value of the pixel is 255, in its neighbor pixel extremely The gray value of rare one is 0, it is determined that the pixel is profile point；

After at least all profile points in a MSER region are obtained, each profile is clicked through using Douglas-Pu Ke algorithms Row compression, removes redundant points, obtains the border key point in correspondence MSER regions.

6. video caption real time translating method according to claim 1, it is characterised in that also with filter out it is rear it is remaining each The ratio of width to height in MSER regions, area girth ratio, convex closure area ratio, stroke width area are used for category filter feature, to filtering out Each remaining MSER region carries out category filter by the SVM after training afterwards.

7. video caption real time translating method according to claim 6, it is characterised in that during SVM is trained, just The quantity ratio of sample and negative sample is controlled 1：3, wherein the positive sample is the corresponding letter of special translating purpose language and Arabic Numeral, negative sample is behind the MSER regions for extracting original image and multiple single channel images respectively, to what is extracted MSER regions carry out the non-textual region of manual identified mark.

8. video caption real time translating method according to claim 1, it is characterised in that described according to every on vertical direction Two neighboring the distance between text filed, each text filed progress line of text, which is distinguished, to be included：

Wherein, b₁Represent the Y-axis coordinate of the text filed bottom in adjacent upside on vertical direction, t₂Represent adjacent on vertical direction The text filed top Y-axis coordinate in downside, h₂Represent the text filed height in adjacent downside on vertical direction；

For on vertical direction per two neighboring text filed, if two neighboring the distance between the text filed d_vMore than second Predetermined threshold value, then by this it is two neighboring it is text filed be classified as one text row, otherwise this two neighboring text filed is classified as not Same line of text.

9. video caption real time translating method according to claim 1, it is characterised in that described according on one text row Per two neighboring the distance between text filed, each text filed progress classification to one text row includes：

<mrow> <msub> <mi>d</mi> <mi>h</mi> </msub> <mo>=</mo> <mfrac> <mover> <mi>w</mi> <mo>&OverBar;</mo> </mover> <mrow> <mi>&Delta;</mi> <mi>d</mi> </mrow> </mfrac> </mrow>

Wherein,The average value of all pitches of this article one's own profession is represented, Δ d represents two adjacent text filed adjacent letters in X Range difference on direction of principal axis；

For the every two neighboring text filed of one text row, if two neighboring the distance between the text filed d_hMore than Three predetermined threshold values, then by this it is two neighboring it is text filed be classified as a class, otherwise represent that this two neighboring text filed is classified as difference Class.

10. video caption real time translating method according to claim 1, it is characterised in that original being intercepted from video Video pictures are intercepted by frame during image, and regard below the video image of interception 2/3rds region as the original image.