CN113657375B

CN113657375B - Bottled object text detection method based on 3D point cloud

Info

Publication number: CN113657375B
Application number: CN202110769157.0A
Authority: CN
Inventors: 赵凡; 李海宁; 闻治泉; 景翠宁
Original assignee: Xian University of Technology
Current assignee: Xian University of Technology
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2024-04-19
Anticipated expiration: 2041-07-07
Also published as: CN113657375A

Abstract

The invention discloses a bottled object text detection method based on a 3D point cloud. Three-dimensional reconstruction is carried out on the acquired bottled product image sequence to generate 3D point cloud data of the curved surface bottled product; in order to improve the expression capability of the character features, on the basis of the 3D point cloud space coordinate features, RGB color features and SWT stroke width features for distinguishing significance on bottled products are fused; in order to apply the existing image segmentation technology to the 3D point cloud, mapping 3D point cloud data to a pseudo image by adopting a drawing technology, and then carrying out text example segmentation on the pseudo image based on a U-Net network; the method provided by the invention can be used for accurately detecting the curved surface characters on the bottled objects commonly existing in pharmacies, supermarkets, cosmetic shops and the like, and experimental results prove that the method provided by the invention has the accuracy of detecting the characters on the curved surface bottled products.

Description

Bottled object text detection method based on 3D point cloud

Technical Field

The invention belongs to the technical field of image processing, and relates to a bottled object text detection method based on a 3D point cloud.

Background

With the development of deep learning theory and computer vision technology, the text detection technology in natural scenes is widely applied in the aspects of automatic navigation, product identification, language translation and the like. The existing scene text detection method can accurately detect characters in any direction, any size and any shape in a natural scene to a certain extent, but has poor text detection effect on bottled products commonly existing in pharmacies, supermarkets, cosmetic shops and the like. Aiming at the problem that the existing scene text detection method cannot accurately detect curved text by utilizing the 2D information of the image, the bottled object text detection method based on the 3D point cloud needs to be provided, so that the curved text on the bottled object can be effectively detected.

Disclosure of Invention

The invention aims to provide a bottled object text detection method based on 3D point cloud, which solves the problem that the existing scene text detection method cannot accurately detect the text on a curved bottled object by using the 2D information of an image, and improves the performance of a text detection algorithm on the curved bottled object.

The technical scheme adopted by the invention is that the bottled object text detection method based on the 3D point cloud specifically comprises the following steps:

Step 1, defining a total number of bottled articles variable N _obj, defining a pseudo-image set variable Pimg, defining a bottled article number variable N _obj, pimg to initialize to an empty set, pimg =null, and setting N _obj to 1, namely N _obj =1;

step 2, for the nth _obj bottled objects Performing multi-view image acquisition to obtain a curved surface scene image sequence img= { I ₁,…,I_k,…,I_K }, wherein K represents the number of acquired multi-view images;

step 3, adopting a 3D point cloud generation method (OpenMVG + PMVS) for the curved scene image sequence Img to generate 3D point cloud data Wherein N ₁ represents the number of 3D points in PS ₁, and the projection relationship matrices of each image in the projection relationship matrices H _k,PS₁ through Img corresponding between the images I _k in PS ₁ through Img are obtained simultaneously to form a projection relationship matrix set HS, wherein hs= { H ₁,…,H_k,…,H_K };

Step 4, performing downsampling processing on the point cloud data PS ₁ to obtain sampled point cloud data And sample labeling is performed on the point cloud data PS ₂, wherein N ₂ represents the number of 3D points in PS ₂, and the points/>The corresponding spatial position is characterized by/>Wherein/>Respectively represent 3D points/>X, y and z coordinate values of (a);

Step 5, randomly extracting an image I _k from the Img, and obtaining a corresponding 2D point set of the point cloud data PS ₂ in the 2D image I _k according to the projection relation matrix H _k

Step 6, find the points in the 2D Point set PI _k in image I _k RGB color features/>And stroke width feature/>Wherein/>And/>Respectively represent points/>R channel value, G channel value and B channel value;

step 7, for 3D points Spatial location features/>2D Point/>RGB color features of (c)And stroke width feature/>Feature fusion is carried out to generate points/>Is a fusion feature of (2)

Step 8, calling the library function spring_layout () in Networkx package in Python programming language to map the point cloud data PS ₂, namely mapping the points in PS ₂ to 2D grid pseudo imageWherein the pixel points/>Is characterized by fusion features/>, corresponding to 3D pointsHandle/>Additional into the set Pimg of pseudo-images, i.e./>

Step 9, judging whether N _obj is greater than or equal to N _obj, if N _obj≥N_obj, entering step 10; otherwise, n _obj＝n_obj +1 returns to the step 2;

step 10, taking a pseudo image set Pimg as input, and training by adopting a multi-scale U-Net network to obtain a MSUnet network model M _MSUnet;

Step 11, inputting bottled objects obj ', executing step 2, collecting K' multi-viewpoint images for obj 'to obtain a curved surface scene image sequence Img' = { I '₁,…,I′_k′,…,I′_K′ }, executing step 3, and generating 3D point cloud data PS' ₁,PS′₁ of obj 'and an Img' projection relation matrix set HS ', HS' = { H '₁,…,H′_k′,…,H′_K′ }, by using a OpenMVG + PMVS method for Img';

Step 12, taking PS ' ₁, img ' and HS ' as inputs, and executing steps 4-8 to obtain a pseudo image

Step 13, pseudo image is obtainedInto MSUnet network model M _MSUnet, output all text instance classification results CL= { CL ₁,…,cl_c,…,cl_C }, and text instance classification Score score= { sc ₁,…,sc_c,…,sc_C }, whereC represents the total number of text examples, nc represents the number of 3D points in class C in CL, and/>Represents the nc 3D point in cl _c,/>A classification Score representing the nc 3D point in Score;

step 14, performing refinement adjustment on the CL according to a refinement adjustment mechanism to obtain an adjusted point cloud classification result CL '= { CL' ₁,…,cl′_c,…,cl′_C }, wherein Nc 'represents the number of 3D points in class c in CL';

Step 15, defining an image number counter k ', and initializing k' =1;

Step 16, according to H '_k′ in the projection relation matrix set HS', calculating a 2D classification point set cp= { CP ₁,…,cp_c,…,cp_C } corresponding to CL 'in the image I' _k′, wherein a calculation formula of the 2D point CP _c is CP _c＝H′_k′×cl′_c;

Step 17, executing a text filling algorithm, performing text filling on the 2D classification point set CP in the image I '_k′ to obtain a text instance classification result of the image I' _k′, and simultaneously outputting all the text instance external polygonal frame sets

Step 18, judging whether K 'is less than or equal to K', if K 'is less than or equal to K', K '=k' +1, returning to step 16, otherwise, ending the procedure.

The invention is also characterized in that:

The specific process of the step 4 is as follows:

The specific process of the downsampling in the step 4 is as follows: opening point cloud processing software CloudCompare V2.6.3, clicking a button Open file on a toolbar, and loading 3D point cloud data PS ₁; clicking a button Delete on the toolbar to manually remove irrelevant background points on non-bottled objects in the 3D point cloud data PS ₁ to obtain interesting point cloud data Wherein N '₁ represents the number of 3D points in PS'; clicking a button Clean on the toolbar, setting filter parameters MEAN DISTANCE and nSigma in a pull-down menu of the button Clean, and performing SOR filter operation; clicking a button Subsample on the toolbar, setting a space sampling distance parameter space and a sampling point number N ₂ in a drop-down menu of the button Subsample, and performing point cloud down-sampling operation to obtain point cloud data/>Wherein the point/>Represents the nth ₂ sampling points, 1 is less than or equal to n ₂≤N₂,/>Is characterized by/>

In the step 4, the specific process of sample labeling on the 3D point cloud data PS ₂ is as follows: clicking a button segment on a toolbar of the point cloud processing software CloudCompare V2.6.3, sequentially manually framing point clouds of each text instance in the point cloud data PS ₂ by using a mouse according to the sequence from top to bottom and from left to right, clicking a button Add constant SF on the toolbar, adding a tag value label to the framed text instance point cloud data, framing all text instances in the point cloud data PS ₂ and adding tag values, clicking a button Merge multiple clouds on the toolbar, and merging all text instance point cloud data selected by the frame in PS ₂ and non-text instance point cloud data on a bottled object into marked point cloud data PS _LA＝{PS₀,PS₁,…,PS_l,…,PS_L, wherein PS ₀ is the non-text instance point cloud data on the bottled object, PS _l represents the first text instance point cloud data, L is the total number of text instances in the 3D point cloud data PS ₂, and the tag value label=l of PS _l.

The specific process of the step 5 is as follows: randomly extracting an image I _k from the image sequence Img, and calculating a 2D point set corresponding to the point cloud data PS ₂ in the image I _k according to H _k in the projection relation matrix set HS corresponding to the image I _k The specific calculation formula is as follows:

wherein: d represents a 3D point Distance to the camera.

The specific steps of the step 6 are as follows:

Step 6.1, for 2D points in image I _k Invoking image library functions in PIL packages in Python programming languageExtraction Point/>R, G, B channel value of (c) as RGB color characteristic/>, of the pointWherein/>And/>Respectively represent 2D points/>Is the abscissa and ordinate of (2);

Step 6.2, calling the stroke width transformation library function swttransform () in SWTloc package in Python programming language to the image I _k to obtain the stroke width values of all the pixel points in the image I _k, 2D point The stroke width value of (2) is 2D pointSWT Stroke Width characterization of/>

The specific process of the step 7 is as follows: for any point in the 3D point cloud data PS ₂ Coordinate features/>RGB color characterization/>SWT Stroke Width characterization/>Performing serial fusion according to columns to obtain a point/>Fused features/>

The specific steps of the step 8 are as follows:

Step 8.1, taking 3D point cloud data PS ₂ as input, setting the number of clusters NL and the number of cluster iteration IT, calling a cluster library function KMeans (), in a Scikit-learn package, in a Python programming language, and initially clustering the point cloud data PS ₂ to obtain NL initial cluster centers Cen '= { ce' ₁,…,ce′_nl,…,ce′_NL }, and a distance matrix from each point to a cluster center point

Step 8.2, using PS ₂、Dist、N₂ and NL as input, adopting graph_cut algorithm to refine and divide the initial clustering result, obtaining a cluster center point coordinate set Cen= { ce ₁,…,ce_nl,…,ce_NL } after dividing, and a point set in the NL-th classWhere ce _nl denotes the center point coordinates of the nl-th class,/>Represents the knl th point within the nl class, knl represents the number of points in Cint _nl;

Step 8.3, calling a distance function pdist () in Scipy packages in the Python programming language, calculating Euclidean distance { Dis ₁,…,Dis_NL×NL } between the coordinates of each cluster center point in Cen, and calling a squareform () function in Scipy packages in the Python programming language to convert { Dis ₁,…,Dis_NL×NL } into a matrix form to obtain a cluster center point distance matrix Dcc _NL×NL;

step 8.4, constructing an undirected graph G ^c by taking the Dcc _NL×NL as input, carrying out first-level graph drawing on G ^c to generate a first-level 2D Grid graph Grid ₄ with the size of Wg×Wg, Wherein/>Represents the nl-th Grid point in Grid ₄,/> And/>Respectively representing the abscissa and the ordinate of the nl-th Grid point in the 2D Grid map Grid ₄;

Step 8.5, calling a distance function pdist () in a Scipy package in a Python programming language, calculating Euclidean distance { Dis '₁,…,Dis′_NL×NL } between points in a point set Cint _nl in the nl-th class, and calling a squareform () function in a Scipy package in the Python programming language to convert { Dis' ₁,…,Dis′_NL×NL } into a matrix form to obtain a distance matrix of points in the cluster

Step 8.6, connectingAs input, a two-level 2D mesh map/> of size Wg×Wg is generated according to the method of step 8.4 Wherein/>An INL grid point in the nl class is represented, and INL represents the number of points in the nl class;

Step 8.7, call OpenCV library function CV2.restore () will each Grid point in Grid ₄ Block/> enlarged to Wg×Wg sizeThe Grid ₄ with the enlarged size is assigned to Grid ₅,Grid₅, and the Grid consists of Wg×Wg blocks, wherein the size of each block is Wg×Wg;

step 8.8, the second-level 2D grid diagram of the nl-th class Corresponding blocks/>, embedded in Grid ₅ in sequenceGrid ₅ was defined as the 2D pseudo image/>, of the nth _obj bottled itemI.e./>

The specific steps of step 10 are as follows:

Step 10.1, designing MSUnet a network structure;

step 10.2, defining MSUnet a loss function of the network multi-classification task:

Wherein: n represents the number of training samples; c represents the number of categories; y _ic is a sample class identifier, if the class of the i-th sample is c, y _ic =1, otherwise y _ic＝0;p_ic represents the probability that the i-th sample is predicted to be class c;

step 10.3, labeling MSUnet sample labels of the network: 2D grid pseudo-image The label of each non-blank position pixel in the map is the corresponding point/>, in the point cloud data PS ₂ A category label value label of (1), a pixel label value label=0 of a blank position;

Step 10.4, training MSUnet the network.

The specific steps of step 15 are as follows:

Step 15.1, defining an adjusted text instance classification result CL ', defining a text instance classification result counter c, CL ' initialized to an empty set, CL ' =null, c initialized to 1, c=1;

Step 15.2, taking out the c-th classification result CL _c,cl_c＝{x₁,…,x_i,…,x_NP from the text instance classification result cl= { CL ₁,…,cl_c,…,cl_C }, wherein x _i is the i-th point in CL _c, and NP is the total number of points in CL _c;

Step 15.3, for each point x _i in cl _c, calculate the distance of point x _i to the other points in cl _c:

d_ij＝||x_i-x_j||₂,x_i∈cl_c,x_j∈cl_c,i≠j

Step 15.4, setting super parameter km, selecting the nearest distance between km x _i and other points Find collection/>Mean value d _i,/>, of all elements in (1)All d _i constitute the set d ₁,…,d_i,…,d_NP;

step 15.5, calculate mean and variance of { d ₁,…,d_i,…,d_NP } stddev:

Step 15.6, setting a super parameter lambda, and calculating a threshold value:

thre＝mean+λ×stddev

Step 15.7, judging whether x _i is an outlier, if d _i is more than thre, x _i is the outlier, eliminating x _i from the collection cl _c, and assigning cl _c after all outliers are eliminated to cl' _c;

Step 15.8, determining whether C is greater than C, if C > C, preserving CL ', otherwise, c=c+1, adding CL' _c to CL ', i.e., CL' =cl '+cl' _c, and returning to step 15.2.

The specific steps of step 17 are as follows:

In step 17.1, an image I' _k′, a classification point set cp= { CP ₁,…,cp_c,…,cp_C }, a text Score set score= { sc ₁,…,sc_c,…,sc_C } of CP corresponding points, a threshold variable T is defined, wherein, Represents the nz _c th point in cp _c,A score representing the NZ _c th point in sc _c, NZ _c representing the number of points in cp _c;

Step 17.2, filtering the points with low Chinese score in the CP according to the threshold T, namely: if (if) Then delete the point/>, in cp _c Giving the variable TR, TR= { TR ₁,…,tr_c,…,tr_C }, where/>, to the CP after filtering all points with low text score Represents NTR _c th point in tr _c, and NTR _c represents the number of points in tr _c;

Step 17.3, calculating center points of each category in the TR, and forming a center point set TC, TC= { cen ₁,…,cen_c,…,cen_C }, wherein cen _c＝mean(tr_c), and mean () represents a mean function;

Step 17.4, defining a text polygon outer bounding box set Poly _k′,Poly_k′ in an image I '_k′, initializing to be an empty set, poly _k′ =null, defining all pixels in an image B _k,B_k with the same size as the image I' _k′ to be assigned 0 values, and initializing a text instance class counter c to be 1, wherein c=1;

Step 17.5, using cen _c as an initial seed point, calling a flooding filling library function floodfill () in OpenCV to fill tr _c to obtain a filled point set trfill _c, and assigning 1 value to the pixel of the corresponding point in trfill _c in B _k;

Step 17.6, calling OpenCV library function cv2.morphyox () to perform 5 times of open operation processing on B _k to obtain an image Mopen, and calling OpenCV library function cv2.dialite () to perform 10 times of expansion processing on image Mopen to obtain an image Mdilate;

Step 17.7, calling the connected region ConectedRegion of the image Mdilate by the OpenCV library function cv2.connectiedcomponents withstats (), calling the outline ContourPS of the connected region ConectedRegion by the OpenCV library function cv2.findcoutour (), and calling the convex hull vertex set of the outline ContourPS by the OpenCV library function cv2.convexhull () function, namely the polygon vertex set of the c-th text example Polygonal peripheral frame/>, constituting the c-th text instance in image I' _k′ I.e.Where NPL represents the number of vertices in CNT _c, each vertex in CNT _c is plotted in image I' _k′;

step 17.8, judging whether C is less than or equal to C, if C is less than or equal to C, c=c+1, and if C is less than or equal to C Added to Poly _k′, i.eReturning to the step 17.5, otherwise, entering the step 17.9;

Step 17.9, outputting a set Poly _k′ of all text polygon peripheral frames on the image I' _k′,

The beneficial effects of the invention are as follows: the existing scene text detection method can accurately detect characters in any direction, size and shape in a natural scene to a certain extent, but has poor detection effect on curved characters on bottled objects commonly existing in pharmacies, supermarkets, cosmetic shops and the like.

Drawings

FIG. 1 is a schematic flow chart of a bottled object text detection method based on a 3D point cloud;

FIG. 2 is a schematic diagram of a 2D pseudo image generation flow based on drawing in a bottled object text detection method based on 3D point cloud;

FIG. 3 is a schematic diagram of a process of embedding a secondary grid pattern into a primary grid pattern in the drawing of the method for detecting the characters of the bottled object based on the 3D point cloud;

Fig. 4 is a schematic diagram of a network structure of MSUnet in a method for detecting characters of a bottled object based on a 3D point cloud according to the present invention;

FIG. 5 is a schematic diagram of a MSUnet network training process in a 3D point cloud-based bottled object text detection method according to the present invention;

FIG. 6 is a schematic diagram of a 3D point cloud refinement adjustment flow in a bottled object text detection method based on a 3D point cloud;

FIG. 7 is a schematic flow chart of a text filling algorithm in a bottled object text detection method based on 3D point cloud;

FIG. 8 is an image of a bottled product in an embodiment of a 3D point cloud based bottled object text detection method of the present invention;

FIG. 9 is an image of another bottled product in an embodiment of a 3D point cloud based bottled object text detection method of the present invention;

FIG. 10 is a diagram showing the results of text detection in an image of the bottled object of FIG. 8 using the method of the present invention, with white boxes being text boxes;

Fig. 11 shows a display of the result of text detection in an image of the bottled object of fig. 9 using the method of the present invention, with white boxes being text boxes.

Detailed Description

The invention will be described in detail below with reference to the drawings and the detailed description.

The invention discloses a bottled object text detection method based on a 3D point cloud, which specifically comprises the following steps:

step 2, for the nth _obj bottled objects Performing multi-view image acquisition to obtain a curved scene image sequence img= { I ₁,…,I_k,…,I_K }, wherein K represents the number of acquired multi-view images, and in the embodiment of the invention, k=90;

Step 3, generating 3D point cloud data using the 3D point cloud generation method OpenMVG + PMVS proposed in the text of "Openmvg: open multiple view geometry" proposed in the conference International Workshop on Reproducible RESEARCH IN PATTERN Recognition (IWRRPR) by MoulonP, monasse P et al in 2016 on the curved surface scene image sequence img Wherein N ₁ represents the number of 3D points in PS ₁, and the projection relationship matrices of each image in the projection relationship matrices H _k,PS₁ through Img corresponding between the images I _k in PS ₁ through Img are obtained simultaneously to form a projection relationship matrix set HS, wherein hs= { H ₁,…,H_k,…,H_K };

The specific process of the downsampling in the step 4 is as follows: opening point cloud processing software CloudCompare V2.6.3, clicking a button Open file on a toolbar, and loading 3D point cloud data PS ₁; clicking a button Delete on the toolbar to manually remove irrelevant background points on non-bottled objects in the 3D point cloud data PS ₁ to obtain interesting point cloud data Wherein N '₁ represents the number of 3D points in PS'; clicking a button Clean on a toolbar, setting filter parameters MEAN DISTANCE and nSigma in a pull-down menu of the button Clean, and performing SOR filter operation, wherein MEAN DISTANCE =8 and nsima=1.5 in the embodiment of the invention; clicking a button Subsample on the toolbar, setting a space sampling distance parameter space and a sampling point number N ₂ in a drop-down menu of the button Subsample, and performing point cloud down-sampling operation to obtain point cloud data/>Wherein the point/>Represents the nth ₂ sampling points, 1 is less than or equal to n ₂≤N₂,/>Is characterized by/>Space=1.585 and n ₂ =8192 in the embodiment of the present invention;

In the step 4, the specific process of sample labeling on the 3D point cloud data PS ₂ is as follows: clicking a button segment on a toolbar of point cloud processing software CloudCompare V2.6.3, sequentially manually framing point clouds of each text instance in point cloud data PS ₂ by using a mouse in sequence from top to bottom and from left to right, clicking a button Add constant SF on the toolbar, adding tag values label to the framed text instance point cloud data, framing all text instances in point cloud data PS ₂ and adding tag values, clicking a button Merge multiple clouds on the toolbar, and merging all text instance point cloud data selected by a frame in PS ₂ and non-text instance point cloud data on a bottled object into marked point cloud data PS _LA＝{PS₀,PS₁,…,PS_l,…,PS_L }, wherein PS ₀ is non-text instance point cloud data on the bottled object, PS _l represents the first text instance point cloud data, L is the total number of text instances in 3D point cloud data PS ₂, and the tag value label=l of PS _l;

The specific process of the step 5 is as follows: randomly extracting an image I _k (K is more than or equal to 1 and less than or equal to K) from an image sequence Img, and calculating a 2D point set corresponding to point cloud data PS ₂ in the image I _k according to H _k in a projection relation matrix set HS corresponding to the image I _k The specific calculation formula is as follows:

wherein: d represents a 3D point Distance to camera, in the embodiment of the invention, get/>

The specific steps of the step 6 are as follows:

Step 6.2, obtaining the stroke width values of all pixels in image I _k, 2D points, using the text detection method based on the stroke width transformation set forth in the text entitled "DETECTING TEXT IN NATURAL SCENES WITH stroke width transform" set forth in the Computer Society Conference on Computer Vision AND PATTERN Recognition (CSCCVPR) conference of B.Epshtein, E.Ofek et al, 2010 The stroke width value of (2) is 2D point/>SWT Stroke Width characterization of/>

The drawing process of the step 8 is shown in fig. 2, and the specific implementation steps are as follows:

Step 8.1, taking 3D point cloud data PS ₂ as input, setting the number of clusters NL and the number of cluster iteration IT, calling a cluster library function KMeans (), in a Scikit-learn package, in a Python programming language, and initially clustering the point cloud data PS ₂ to obtain NL initial cluster centers Cen '= { ce' ₁,…,ce′_nl,…,ce′_NL }, and a distance matrix from each point to a cluster center point Nl=128, it=100 in the embodiment of the present invention;

Step 8.2, using PS ₂、Dist、N₂ and NL as input, adopting graph_cut algorithm to refine and divide the initial clustering result, obtaining a cluster center point coordinate set Cen= { ce ₁,…,ce_nl,…,ce_NL } after dividing, and a point set in the NL-th class Where ce _nl denotes the center point coordinates of the nl-th class,/>Represents the knl th point in the NL class, knl represents the number of points in Cint _nl, NL is 1.ltoreq.nl.ltoreq.NL, and knl.ltoreq. Knl;

Step 8.4.1, using Dcc _NL×NL as input, calling graph construction function from_ numpy _matrix () in Networkx library in Python programming language to construct an undirected graph G ^c,G^c = (V, E), wherein V represents cluster center point coordinate set Cen, E represents distance matrix Dcc _NL×NL between each vertex,

Step 8.4.2, taking the graph G ^c as input, calling a graph drawing function spring_layout () in a Networkx library in the Python programming language to obtain a 2D Grid map Grid ₁ of the graph G ^c,Wherein the method comprises the steps ofRepresents the nl-th Grid point in Grid ₁,/> And/>Respectively representing the abscissa and the ordinate of the nl-th Grid point in the 2D Grid map Grid ₁;

Step 8.4.3, scaling the 2D Grid map Grid ₁ to Grid map Grid ₂ of Wg by Wg size by Scale factor Scale, Wherein/>Represents the nl-th Grid point in Grid ₂, And/>The Scale factor Scale is calculated as follows, representing the abscissa and ordinate, respectively, of the nl-th Grid point in the 2D Grid map Grid ₂: calculating Grid points/>, in Grid ₁ And/>Europe distance Disg _ij,/>I is more than or equal to 1 and less than or equal to Wg, j is more than or equal to 1 and less than or equal to Wg, and the Euclidean distance matrix between Grid points in Grid ₁ is DisG,DisG＝{Disg_ij},scale1＝1/min(DisG),scalex＝(Wg-2)/(max(Grid₁.x)-min(Grid₁.x)),scaley＝(Wg-2)/(max(Grid₁.y)-min(Grid₁.y)),Scale＝min(scale1,scalex,scaley),/> Wherein max () and min () represent maximum and minimum functions, respectively, wg=16 in the embodiment of the present invention;

Step 8.4.4, performing coordinate rounding operation on the Grid points in the 2D Grid map Grid ₂ to obtain a Grid map with rounded coordinates I.e./> Where int () represents a rounding function;

In step 8.4.5, position adjustment is performed on Grid points with coincident coordinates in Grid ₃ to generate a final Grid graph Grid ₄, which specifically includes: if it is And/>Coordinate coincidence, take out/>Will/>Placing on the unassigned neighborhood grid points;

The specific steps for training using a Multi-scale U-Net network (MSUnet) in step 10 are as follows:

Step 10.1, the structural design of the MSUnet network structure is shown in fig. 4: the total layer number of MSUnet network structures is 18, including 1 input layer, 5 convolution layers, 6 series connection layers, 2 maximum pooling layers, 2 upsampling layers, 1 full connection layer and 1 output layer, the specific connection sequence of MSUnet network structures is: input layer-convolution layer 1-series layer 1-maximum pooling layer 1-convolution layer 2-series layer 2-maximum pooling layer 2-full connection layer-upsampling layer 1-series layer 3-convolution layer 3-series layer 4-upsampling layer 2-series layer 5-convolution layer 4-series layer 6-convolution layer 5-output layer, the categories and numbers of network layers in MSUnet network structure are shown in table 1, wherein S is the category of network layer, n is the number of network layer;

table 1 MSUnet categories and numbers of network layers in network structure

The input layer is the 2D mesh pseudo image from step 8 generationPseudo image/>Is 256×256×7 in size;

Convoluting layer 1 pair of pseudo images Features are extracted in parallel by using convolution kernels with the sizes of 1×1 and 3×3 respectively, and two feature maps with the sizes of 256×256×64 are output; /(I)

The serial layer 1 splices two feature maps from the convolution layer 1 together, and outputs a feature map with the size of 256×256×128;

The maximum pooling layer 1 performs space downsampling on the features from the serial layers and outputs a feature map with the size of 16 multiplied by 128;

the convolution layer 2 uses convolution kernels with the sizes of 1×1 and 3×3 to extract the features from the maximum pooling layer 1 in parallel, and outputs two feature maps with the sizes of 16×16×128;

The serial layer 2 splices two feature images from the rolling layer 2 together and outputs a feature image with the size of 16 multiplied by 256;

The largest pooling layer 2 performs space downsampling on the features from the serial layer 2 and outputs a feature map with the size of 1 multiplied by 256;

In an implementation, the activation function used by the fully-connected layer and each convolution layer is a linear rectification function (RECTIFIED LINEAR Unit, reLU), The ReLU () function is a piecewise linear function, changing all negative values to 0, while positive values are unchanged, and is an open source activation function commonly used in artificial neural networks;

The input of the full connection layer is the output characteristic of the maximum pooling layer 2, and a characteristic diagram with the size of 1 multiplied by 256 is output;

the up-sampling layer 1 carries out linear interpolation on the feature map from the full-connection layer and outputs a feature map with the size of 16 multiplied by 256;

The series layer 3 performs series splicing on the feature map from the series layer 2 and the feature map of the up-sampling layer 1, and outputs the feature map with the size of 16 multiplied by 512;

the convolution layer 3 uses convolution kernels with the sizes of 1×1 and 3×3 to extract the features from the serial layer 3 in parallel, and outputs two feature maps with the sizes of 16×16×128;

The tandem layer 4 performs tandem splicing on the two feature graphs from the convolution layer 3, and outputs a feature graph with the size of 16×16×256;

the up-sampling layer 2 carries out linear interpolation on the feature map from the serial layer 4 and outputs a feature map with the size of 256 multiplied by 256;

the series layer 5 performs series splicing on the feature map from the series layer 1 and the feature map of the up-sampling layer 2, and outputs the feature map with the size of 256×256×384;

The convolution layer 4 extracts features in parallel by using convolution kernels with the sizes of 1×1 and 3×3 for the features from the serial layer 5, and outputs two feature maps with the sizes of 256×256×64;

The series layer 6 performs series splicing on two feature graphs from the convolution layer 4, and outputs a feature graph with the size of 16×16×256;

The convolution layer 5 performs convolution operation using a convolution kernel of 1×1 size to output a feature map of 256×256×50;

The output layer uses softmax activation function to activate the feature map from the convolution layer 5, and outputs the feature map with the size of 256×256×50;

the feature map sizes of the inputs and outputs of each network layer in MSUnet network structure are shown in table 2:

network layer parameters of Table 2 MSUnet

Step 10.2, defining a loss function of the MSUnet network multi-classification task as follows:

step 10.3, labeling MSUnet sample labels of the network: 2D grid pseudo-image The label of each non-blank position pixel in the map is the corresponding point/>, in the point cloud data PS ₂ A category label value label (1 is less than or equal to label is less than or equal to C), and a pixel label value label=0 at a blank position;

Step 10.4, MSUnet network training is shown in FIG. 5, and the specific steps are as follows:

Step 10.4.1, inputting a pseudo-image set Pimg;

Step 10.4.2, setting MSUnet network model training parameters, setting a learning rate variable lr, a training iteration total number variable epoch and a batch data size variable batch, defining a training iteration number variable step and the like, and in specific implementation, specifically setting as shown in table 3:

table 3 MSUnet network model training parameter specification

Parameters (parameters)	Parameter description	Value taking
			lr	Learning rate	0.0001
display	The loss function value is displayed on the screen every other iteration	20
			batch	Size of data per batch	4
epoch	Total number of training iterations	200
			step	Initial value of training iteration number variable	1

Step 10.4.3, randomly extracting batch pseudo images from the pseudo image set Pimg each time, and sending the batch pseudo images into a MSUnet network for MSUnet network training;

Step 10.4.4, calculating the absolute difference Dif of the loss function L of two iterations in the MSUnet network training process, if (Dif < Th 1) | (step > epoch), in the embodiment of the invention, th 1=0.002, the model converges, and storing MSUnet network model M _MSUnet, and ending the training; otherwise step = step +1, the adam optimizer proposed in 2015 of Diederik p. Kingma, jimmy Ba et al at International Conference on Learning Representations (ICLR) conference, entitled "adam: a method for stochastic optimization", is used to reverse the correction of the weight coefficients of each network layer in the training model, returning to step 10.4.3;

step 11, defining an image number counter k ', and initializing k' =1;

Step 12, inputting bottled articles obj ' common in daily life scenes, performing multi-view image acquisition on obj ' in step 2 to obtain a curved scene image sequence Img ' = { I ' ₁,…,I′_k′,…,I′_K′ }, performing the OpenMVG + PMVS method in step 3 on Img ', and generating 3D point cloud data PS ₁ ' and a projection relation matrix set HS ' = { H ' ₁,…,H′_k′,…,H′_K′ }, wherein K ' = 90 in the embodiment of the invention;

Step 13, executing steps 4-8 on PS' ₁ to obtain a pseudo image

Step 14, willInto MSUnet network model M _MSUnet, output all text instance classification results CL= { CL ₁,…,cl_c,…,cl_C }, and text instance classification Score score= { sc ₁,…,sc_c,…,sc_C }, whereC represents the total number of lines of text, nc represents the number of points in class C in CL,/>Represents the nc-th point in cl _c,/>A classification Score representing the nc-th point in Score;

The refinement and adjustment flow of the point cloud segmentation result in the step 15 is shown in fig. 6, and the specific steps are as follows:

Step 15.3, for each point x _i in cl _c, calculating its distance d _ij＝||x_i-x_j||₂,x_i∈cl_c,x_j∈cl_c, i+.j to the other points;

Step 15.4, setting up super parameter km, selecting the nearest distance from km x _i to other points in cl _c Find collection/>Mean value d _i,/>, of all elements in (1)All d _i constitute the set d ₁,…,d_i,…,d_NP, km=200 in the embodiment of the invention;

Step 15.5, calculate the mean and variance of { d ₁,…,d_i,…,d_NP } stddev,

Step 15.6, setting a super parameter lambda, and calculating a threshold value: thre=mean+λ× stddev, λ=1.5 in the embodiment of the present invention;

Step 15.8, judging whether C is greater than C, if C is greater than C, preserving CL ', otherwise, c=c+1, adding CL' _c into CL ', i.e., CL' =cl '+cl' _c, returning to step 15.2;

Step 16, calculating a classification point set cp= { CP ₁,…,cp_c,…,cp_C } corresponding to CL 'in the image I' _k′ according to H '_k′ in the projection relation matrix set HS', wherein the calculation formula of the point CP _c is CP _c＝H′_k′×cl′_c;

the text filling algorithm flow in step 17 is shown in fig. 7, and the specific steps are as follows:

In step 17.1, an image I' _k′, a classification point set cp= { CP ₁,…,cp_c,…,cp_C }, a text Score set score= { sc ₁,…,sc_c,…,sc_C } of CP corresponding points, a threshold variable T is defined, wherein, Represents the nz _c th point in cp _c,/>A score representing the NZ _c th point in sc _c, NZ _c representing the number of points in cp _c;

Step 17.2, filtering the points with low Chinese score in the CP according to the threshold T, namely: if (if) Then delete the point/>, in cp _c Giving the variable TR, TR= { TR ₁,…,tr_c,…,tr_C }, where/>, to the CP after filtering all points with low text score Represents the NTR _c th point in tr _c, NTR _c represents the number of points in tr _c, and t=0.75 in the embodiment of the present invention;

Step 18, judging whether K 'is less than or equal to K', if K 'is less than or equal to K', K '=k' +1, returning to step 13, otherwise, ending the procedure.

The invention discloses a bottled object text detection method based on 3D point cloud, which aims at solving the problem that curved text cannot be accurately detected by utilizing 2D information of an image in the existing scene text detection method, and introduces 3D point cloud information into a text detection algorithm, wherein the 3D information is used for detecting the curved text for the first time, color features and stroke width features for distinguishing significance on bottled products are fused on the basis of 3D point cloud space coordinate features, fusion features with strong distinguishability are generated, 3D point cloud data are mapped onto a 2D grid graph by adopting a graph drawing technology, channel features of grid points are represented by the fusion features of the point cloud, a pseudo image which can be subjected to target segmentation by utilizing the existing image segmentation algorithm is generated, and experimental results show that the curved text on the bottled products can be accurately detected by adopting the method.

Examples

In the embodiment of the invention, curved text positioning effect test is carried out on the curved bottled object images which are common in life, and subjective and objective evaluation is carried out on the test results respectively.

The subjective effect diagram of text positioning in the embodiment of the invention is shown in fig. 10 and 11:

Inputting any common bottled object in life, and using the method of the invention to detect and test the text instance on the bottled object. Fig. 8 shows an image of a bottled object, and fig. 10 shows a display of the result of text detection of the bottled object in fig. 8 in the image using the method of the present invention, with white boxes being text boxes; fig. 9 shows an image of another bottled object, and fig. 11 shows a display of the result of text detection of the bottled object in fig. 9 in the image using the method of the present invention, with white boxes being text boxes.

As can be seen from the text detection results of fig. 10 and 11, for the curved text examples of "pleasant and alive" and "milk" in fig. 8, the text boundary can be accurately detected by using the method of the invention, and the detected boundary is smoother and closer to the text content, so that the human visual perception of the text boundary is met.

In the embodiment of the invention, 50 common bottled objects in life are collected, characters on the collected bottled objects are detected and tested by adopting the method, and the detection result is objectively evaluated by adopting the following indexes:

① Accuracy (P). The accuracy represents the ratio of the number of detected correct targets to the total number of detected targets.

② Recall (R). The recall represents the ratio of the number of correct targets detected to the total number of truth boxes for all labels.

③ And a harmonic mean (F-measure, F). The harmonic mean is a weighted average of recall and accuracy, so F-measure is a comprehensive measure of the performance of the detection algorithm, the higher the value of the harmonic mean is, the better the performance of the algorithm is, and the calculation expression is:

The character detection performance on the bottled article is shown in table 4:

TABLE 4 detection Performance of text examples on bottled objects

Test object	Accuracy rate of	Recall rate of recall	Harmonic mean
				Bottled object	85.9％	77.5％	81.5％

As can be seen from Table 4, the average accuracy, recall and reconciliation average of the text examples on the 50 bottled objects collected by the method of the present invention were 85.9%, 77.5% and 81.5%, respectively, and the objective evaluation results of Table 4 demonstrate the effectiveness of the method of the present invention for the text examples on the bottled objects.

From the subjective and objective results, the method can well detect the curved character examples on the bottled objects, and the detection result shows the high efficiency of detecting the character examples with any shape, size and direction.

Claims

1. A bottled object text detection method based on 3D point cloud is characterized in that: the method specifically comprises the following steps:

Step 3, adopting a 3D point cloud generation method OpenMVG + PMVS to the curved surface scene image sequence Img to generate 3D point cloud data Wherein N ₁ represents the number of 3D points in PS ₁, and the projection relationship matrices of each image in the projection relationship matrices H _k,PS₁ through Img corresponding between the images I _k in PS ₁ through Img are obtained simultaneously to form a projection relationship matrix set HS, wherein hs= { H ₁,…,H_k,…,H_K };

Step 6, find the points in the 2D Point set PI _k in image I _k RGB color features/>And stroke width featureWherein/>And/>Respectively represent points/>R channel value, G channel value and B channel value;

Step 15, defining an image number counter k ', and initializing k' =1;

2. The bottled object text detection method based on 3D point cloud as claimed in claim 1, wherein the method is characterized in that: the specific process of the step 4 is as follows:

The specific process of the downsampling in the step 4 is as follows: opening point cloud processing software CloudCompare V2.6.3, clicking a button Open file on a toolbar, and loading 3D point cloud data PS ₁; clicking a button Delete on the toolbar to manually remove irrelevant background points on non-bottled objects in the 3D point cloud data PS ₁ to obtain interesting point cloud data Wherein N ₁ 'represents the number of 3D points in PS'; clicking a button Clean on the toolbar, setting filter parameters MEAN DISTANCE and nSigma in a pull-down menu of the button Clean, and performing SOR filter operation; clicking a button Subsample on the toolbar, setting a space sampling distance parameter space and a sampling point number N ₂ in a drop-down menu of the button Subsample, and performing point cloud down-sampling operation to obtain point cloud data/>Wherein the point/>Represents the nth ₂ sampling points, 1 is less than or equal to n ₂≤N₂,/>Is characterized by/>

3. The bottled object text detection method based on the 3D point cloud according to claim 2, wherein the method is characterized in that: the specific process of the step 5 is as follows: randomly extracting an image I _k from the image sequence Img, and calculating a 2D point set corresponding to the point cloud data PS ₂ in the image I _k according to H _k in the projection relation matrix set HS corresponding to the image I _k The specific calculation formula is as follows:

wherein: d represents a 3D point Distance to the camera.

4. The bottled object text detection method based on 3D point cloud as claimed in claim 3, wherein: the specific steps of the step 6 are as follows:

5. The bottled object text detection method based on the 3D point cloud as claimed in claim 4, wherein the method is characterized by comprising the following steps of: the specific process of the step 7 is as follows: for any point in the 3D point cloud data PS ₂ Coordinate features/>RGB color characterization/>SWT Stroke Width characterization/>Performing serial fusion according to columns to obtain a point/>Fused features

6. The bottled object text detection method based on 3D point cloud as claimed in claim 5, wherein the method is characterized in that: the specific steps of the step 8 are as follows:

7. The bottled object text detection method based on the 3D point cloud, according to claim 6, is characterized in that: the specific steps of the step 10 are as follows:

Step 10.1, designing MSUnet a network structure;

Step 10.4, training MSUnet the network.

8. The bottled object text detection method based on the 3D point cloud, which is characterized in that: the specific steps of the step 15 are as follows:

d_ij＝||x_i-x_j||₂,x_i∈cl_c,x_j∈cl_c,i≠j

Step 15.4, setting super parameter km, selecting the nearest distance between km x _i and other points AggregationMean value d _i,/>, of all elements in (1)All d _i constitute the set d ₁,…,d_i,…,d_NP;

step 15.5, calculate mean and variance of { d ₁,…,d_i,…,d_NP } stddev:

Step 15.6, setting a super parameter lambda, and calculating a threshold value:

thre＝mean+λ×stddev

9. The bottled object text detection method based on the 3D point cloud, according to claim 8, is characterized in that: the specific steps of the step 17 are as follows:

Step 17.2, filtering the points with low Chinese score in the CP according to the threshold T, namely: if it is Then delete the point in cp _c The CP filtered out of all points with low literal scores is assigned to the variable TR, tr= { TR ₁,…,tr_c,…,tr_C }, where, Represents NTR _c th point in tr _c, and NTR _c represents the number of points in tr _c;

Step 17.4, defining a set of text polygon outer bounding boxes Poly _k′,Poly_k′ in the image I _k "to be initialized to be empty, poly _k′ =null, defining all pixels in the image B _k,B_k having the same size as the image I _k" to be assigned 0 values, and initializing a text instance class counter c to be 1, c=1;

Step 17.7, calling the connected region ConectedRegion of the image Mdilate by the OpenCV library function cv2.connectiedcomponents withstats (), calling the outline ContourPS of the connected region ConectedRegion by the OpenCV library function cv2.findcoutour (), and calling the convex hull vertex set of the outline ContourPS by the OpenCV library function cv2.convexhull () function, namely the polygon vertex set of the c-th text example Polygonal peripheral frame/>, constituting the c-th text instance in image I' _k′ I.e.Where NPL represents the number of vertices in CNT _c, each vertex in CNT _c is plotted in image I _k';