CN109977896A

CN109977896A - A kind of supermarket's intelligence vending system

Info

Publication number: CN109977896A
Application number: CN201910263910.1A
Authority: CN
Inventors: 刘昱昊
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-04-03
Filing date: 2019-04-03
Publication date: 2019-07-05

Abstract

The invention discloses a kind of supermarket's intelligence vending systems, the problem that either scans' commodity are excessively time-consuming during tradition is settled accounts is directed to improve, when commodity Input Process is advanceed to shopper's picking, the time loss of items scanning when to remove checkout, checkout speed is greatly improved, improves the shopping experience of customer.Movement during the present invention selects shopper goods using algorithm for pattern recognition is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, recognition of face is carried out to customer and obtains the identity of customer using human body image recognition when recognition of face is undesirable, the abnormal behaviour of customer is identified to determine whether there is pilferage behavior.This system can realize programming count function under the premise of not reducing customer purchase experience.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing supermarket's organizational structure seamless interfacing.

Description

A kind of supermarket's intelligence vending system

Technical field

The present invention relates to computer vision monitoring technology field, target detection, target following and area of pattern recognition, specifically It is related to for being detected, being tracked to the individual before the shelf based on monitoring camera and the field of action recognition.

Background technique

Checkout is carried out by either scans' commodity under traditional supermarket model, and this mode is very easy to cause congestion, causes A large amount of shoppers are queued in cashier counter, and whole check-out process is limited by the space of cashier and the number of cashier limitation nothing Method increases considerably, therefore due to the limitation of traditional cash register mode, congestion of settling accounts not can avoid；Existing customer is by independently sweeping Although the mode for retouching merchandise checkout can be reduced the time of items scanning, going out, there is still a need for manual inspection commodity, still It so will cause congestion.The reason of analyzing congestion, most time-consuming process is the process of typing commodity, therefore commodity Input Process is mentioned It is preceding to shopper's picking when, can by most time-consuming process in advance and can parallel work-flow, so that knot be greatly improved Account speed improves the shopping experience of customer.

System proposed by the invention is exactly to identify and count during shopper selects goods using monitoring camera Number is identified to improve checkout speed by the picking to customer with the process for putting back to commodity to carry out picking quantity Plus-minus, obtain the type of merchandise by being identified when client picks and places commodity to commodity, pass through the abnormal behaviour to customer Identification is to determine whether there is pilferage behavior, to realize automatically not reducing customer under the premise of selecting the shopping experience of goods process Statistical function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing There is supermarket's organizational structure seamless interfacing.

Summary of the invention

The technical problem to be solved by the present invention is to propose to overcome slow-footed situation of settling accounts under traditional supermarket model A kind of supermarket's intelligence vending system.The identification of customer purchase behavior and the identification of commodity are completed using monitoring camera.

The technical solution adopted by the present invention to solve the technical problems is:

A kind of supermarket's intelligence vending system, the video taken the photograph based on the monitoring camera being fixed in supermarket and on shelf Image is as input.Including image pre-processing module, module of target detection, action recognition module of doing shopping, product identification module is a Body identification module, recognition result processing module.The video image that the image pre-processing module takes the photograph monitoring camera into Row pretreatment, carries out denoising to the noise that may contain in the image of input first, then carries out light to the image after denoising According to compensation, image enhancement then is carried out to the image after illumination compensation, the data after image enhancement are finally passed into target inspection Survey module；The module of target detection carries out target detection to the image received, detects in current video image respectively Human body overall region, face facial area, hand region and product area, then hand region and product area are sent to Human body image region and face facial area, are sent to individual identification module, product area are passed by shopping action recognition module Pass product identification module；The shopping action recognition module carries out static movement to the hand region information received and knows Not, the start frame for grasping video is found, it is then lasting that movement is identified until finding the movement for putting down article as knot Then beam frame is identified video using dynamic action recognition classifier, identify that current action is to take out article, put back to object Product, take out again put back to, taken out article do not put back to either suspicious stealing.Then recognition result is sent to recognition result processing Module, the video for by the video of only grasp motion and only putting down movement are sent to recognition result processing module；The production Product identification module identifies the video of the product area received, identify currently by it is mobile be any product, so Recognition result is sent to recognition result processing module afterwards, product identification module can also increase at any time or delete some product； The individual identification module identifies the human face region and human region that receive, in conjunction with human face region and human region Information, for identification out current individual be in entire supermarket who individual, recognition result is then sent to recognition result Processing module；The recognition result processing module integrates the recognition result received, is passed according to individual identification module It passs the ID of customer come and determines the corresponding customer of current shopping information, the recognition result come according to the transmitting of product identification module is come true The shopping for determining current customer acts corresponding product, is determined currently according to the recognition result that shopping action recognition module transmitting comes Whether shopping movement modifies to shopping cart.To obtain the shopping list of current customer.Shopping action recognition module is known Other suspicious stealing sounds an alarm.

The image pre-processing module, method are: in initial phase, the module does not work；In the detection process: The first step, the monitoring image taken the photograph to monitoring camera carries out mean denoising, thus the monitoring image after being denoised；Second Step carries out illumination compensation to the monitoring image after denoising, to obtain the image after illumination compensation；Third step, by illumination compensation Image afterwards carries out image enhancement, and the data after image enhancement are passed to module of target detection.

The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph Monitoring image be X_src, because of X_srcFor color RGB image, therefore there are X_src-R, X_src-G, X_src-BThree components, for each A component X_src', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image X_src' each pixel Point X_src' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the point_src′(i-1,j-1),X_src′ (i-1,j),X_src′(i-1,j+1),X_src′(i,j-1),X_src′(i,j),X_src′(i,j+1),X_src′(i+1,j-1),X_src′(i+ 1,j),X_src' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoising_src" pixel (i, J) value is assigned to X after corresponding filtering_src″(i,j)；For X_src' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoising_src″ (i, j), thus, new image array X_srcIt " is X_srcImage array after the denoising of current RGB component, for X_src-R, X_src-G, X_src-BAfter three components carry out denoising operation respectively, the X that will obtain_src-R", X_src-G", X_src-B" component, by this three A new component is integrated into a new color image X_DenResulting image after as denoising.

Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoising_Den, because of X_DenFor Color RGB image, therefore X_DenThere are tri- components of RGB, for each component X_Den', illumination compensation is carried out respectively, then will Obtained X_cpst' integration obtains colored RBG image X_cpst, X_cpstAs X_DenImage after illumination compensation, to each component X_Den' respectively carry out illumination compensation the step of are as follows: the first step, if X_Den' arranged for m row n, construct X_Den′^sumAnd Num_DenFor same m row The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein function Min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1； Second step, if X_DenTop left co-ordinate be (1,1), from coordinate (1,1) start, according to window size be l and step-length s determine it is each A candidate frame, which is [(a, b), (a+l, b+l)] area defined, for X_Den' corresponding in candidate frame region Image array carry out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] X_Den", then X_Den′^sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates X_Den′^sum(a+i_Xsum,b+ j_Xsum)=X_Den′^sum(a+i_Xsum,b+j_Xsum)+X_Den″(i_Xsum,j_Xsum), wherein (i_Xsum,j_Xsum) it is integer and 1≤i_Xsum≤ l, 1 ≤j_Xsum≤ l, and by Num_DenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1；Finally, calculating Wherein (i_XsumNum,j_XsumNum) it is X_DenEach corresponding point, to obtain X_cpstAs to present component X_Den' carry out illumination Compensation.

Described is that l and step-length s determines each candidate frame according to window size, be the steps include:

If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is selection area Bottom right angular coordinate, which is indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)]；

As a+l≤m:

B=1；

As b+l≤n:

Selected region is [(a, b), (a+l, b+l)]；

B=b+s；

Interior loop terminates；

A=a+s；

Outer loop terminates；

In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time.

It is described for X_Den' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame Region is [(a, b), (a+l, b+l)] area defined, X_DenIt " is X_Den' the figure in the region [(a, b), (a+l, b+l)] It as information, the steps include: the first step, construct vector I, I (i_I) it is X_Den" middle pixel value is equal to i_INumber, 0≤i_I≤255；The Two steps calculate vectorThird step, for X_Den" on each point (i_XDen, j_XDen), pixel value is X_Den″(i_XDen, j_XDen), calculate X "_Den(i_XDen, j_XDen)=I ' (X "_Den(i_XDen, j_XDen)).To X_Den" all pixels in image Histogram equalization process terminates after point value is all calculated and changed, X_Den" the result of the interior as histogram equalization saved.

Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is X_cpst, correspond to RGB channel be respectively X_cpstR, X_cpstG, X_cpstB, to X_cpstThe image obtained after image enhancement is X_enh.Image increasing is carried out to it Strong step are as follows: the first step, for X_cpstThe important X of institute_cpstR, X_cpstG, X_cpstBIt is calculated to carry out after obscuring by specified scale Image；Second step, structural matrix LX_enhR, LX_enhG, LX_enhBFor with X_cpstRThe matrix of identical dimensional, for image X_cpst's The channel R in RGB channel calculates LX_enhR(i, j)=log (X_cpstR(i, j))-LX_cpstRThe value range of (i, j), (i, j) is All points in image array, for image X_cpstRGB channel in the channel G and channel B use algorithm same as the channel R Obtain LX_enhGAnd LX_enhB；Third step, for image X_cpstRGB channel in the channel R, calculate LX_enhRMiddle all the points value Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 × Then VarR calculates X_enhR(i, j)=Fix ((LX_cpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255；For in RGB channel the channel G and channel B X is obtained using algorithm same as the channel R_enhGAnd X_enhB, the X of RGB channel will be belonging respectively to_enhR、X_enhG、X_enhBIt is integrated into one Color image X_enh。

It is described for X_cpstThe important X of institute_cpstR, X_cpstG, X_cpstBIt calculates it and carries out the figure after obscuring by specified scale Picture, for the channel the R X in RGB channel_cpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x² +y²)/σ²), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for X_cpstREach point X_cpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only Calculate X_cpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, 0 is assigned a value of if value < 0, value > 255 is then assigned a value of 255.For in RGB channel the channel G and channel B using algorithm same as the channel R update X_cpstGWith X_cpstG。

The module of target detection has demarcated human body image region, face face using having during initialization The image in region, hand region and product area carries out parameter initialization to algorithm of target detection；In the detection process, figure is received As the image that preprocessing module is transmitted, then it is handled, each frame image is carried out using algorithm of target detection Target detection obtains human body image region, face facial area, hand region and the product area of present image, then by hand Portion region and product area are sent to shopping action recognition module, and human body image region and face facial area are sent to individual Product area is passed to product identification module by identification module；

The use have demarcated human body image region, face facial area, hand region and product area figure As carrying out parameter initialization to algorithm of target detection, it the steps include: that the first step, construction feature extract depth network；Second step, structure Make regional choice network, third step, according to each in database used in the construction feature extraction depth network Open image X and the corresponding each human region manually demarcatedThen by ROI layers, Input is image X and regionOutputIt is 7 × 7 × 512 dimensions；Third step, building coordinate refine network.

The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64；The second layer: volume Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64；Third layer: Chi Hua Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64, Output is 384 × 512 × 128；4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to Road number channels=128；Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel Number channels=128；Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 × 128 are connected in third dimension, and exporting is 192 × 256 × 256；Layer 7: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；8th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；9th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512；11st Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；Floor 12: Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；13rd layer: volume Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512, Output is 48 × 64 × 1024；15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel Number channels=512；16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 × 512 are connected in third dimension, and exporting is 48 × 64 × 1024；19th layer: convolutional layer, input as 48 × 64 × 1024, exporting is 48 × 64 × 256, port number channels=256；20th layer: pond layer, inputting is 48 × 64 × 256, Output is 24 × 62 × 256；Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel Number channels=256；Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256；20th Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128；24th Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128；25th layer: full articulamentum, first by the 6 of input The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768, Activation primitive is relu activation primitive；26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive is relu activation primitive；27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive is soft-max activation primitive；The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride =(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2)；If setting the depth network as Fconv27, for a width color image X, warp Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27 (X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).

The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer₁(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；Then by Conv₁(FConv (X)) is separately input to two convolutional layer (Conv_2-1And Conv_2-2), Conv_2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains Output be Conv_2-1(Conv₁(Fconv (X))), then softmax is obtained using activation primitive softmax to the output (Conv_2-1(Conv₁(Fconv(X))))；Conv_2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, Port number channels=36；There are two the loss functions of the network: first error function loss1 is to W_shad-cls⊙ (Conv₂-₁(Conv₁(Fconv(X)))-W_cls(X)) softmax error is calculated, second error function loss2 is to W_shad-reg (X)⊙(Conv_2-1(Conv₁(Fconv(X)))-W_reg(X)) smooth L1 error, the loss function of regional choice network are calculated =loss1/sum (W_cls(X))+loss2/sum(W_clS (X)), the sum of sum () representing matrix all elements, convergence direction is It is minimized, W_cls(X) and W_regIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence Position is multiplied, W_shad-cls(X) and W_shad-regIt (X) is mask, it acts as selection W_shad(X) part that weight is 1 in is trained, To avoiding positive and negative sample size gap excessive, when each iteration, regenerates W_shad-cls(X) and W_shad-reg(X), algorithm iteration 1000 times.

The construction feature extracts database used in depth network, for each image in database, Step 1: each human body image-region, face facial area, hand region and product area are manually demarcated, if it schemes in input The centre coordinate of picture is (a_{bas_tr}, b_{bas_tr}), centre coordinate is l in the distance of fore-and-aft distance upper and lower side frame_{bas_tr}, centre coordinate It is w in the distance of lateral distance left and right side frame_{bas_tr}, then it corresponds to Conv₁Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part；The Two steps: positive negative sample is generated at random.

The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for data The each image X in library_trIf W_clsFor 48 × 64 × 18 dimensions, W_regFor 48 × 64 × 36 dimensions, all initial values are 0, right W_clsAnd W_regIt is filled.

Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro₁(x_Ro, y_Ro)=(x_Ro, y_Ro, 64,64), Ro₂ (x_Ro, y_Ro)=(x_Ro, y_Ro, 45,90), Ro₃(x_Ro, y_Ro)=(x_Ro, y_Ro, 90,45), Ro₄(x_Ro, y_Ro)=(x_Ro, y_Ro, 128, 128), Ro₅(x_Ro, y_Ro)=(x_Ro, y_Ro, 90,180), Ro₆(x_Ro, y_Ro)=(x_Ro, y_Ro, 180,90), Ro₇(x_Ro, y_Ro)= (x_Ro, y_Ro, 256,256), Ro₈(x_Ro, y_Ro)=(x_Ro, y_Ro, 360,180), Ro₉(x_Ro, y_Ro)=(x_Ro, y_Ro, 180,360), it is right In each region unit, Ro_i(x_Ro, y_Ro) indicate for ith zone frame, the centre coordinate (x of current region frame_Ro, y_Ro), the Three indicate pixel distance of the central point apart from upper and lower side frame, and the 4th indicates pixel distance of the central point apart from left and right side frame, i Value from 1 to 9.

It is described to W_clsAnd W_regIt is filled, method are as follows:

For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picture_{bas_tr}, b_{bas_tr}), Centre coordinate is l in the distance of fore-and-aft distance upper and lower side frame_{bas_tr}, centre coordinate is in the distance of lateral distance left and right side frame w_{bas_tr}, then it corresponds to Conv₁Position be that center coordinate isHalf is a length ofHalf-breadth is

For the upper left cornerThe lower right corner CoordinateEach point in the section surrounded (x_Ctr, y_Ctr):

For i value from 1 to 9:

For point (x_Ctr, y_ctr), it is upper left angle point (16 (x in the mapping range of database images_Ctr- 1)+1,16 (y_ctr- 1)+1) bottom right angle point (16x_ctr, 16y_Ctr) 16 × 16 sections that are surrounded, for each point (x in the section_Otr, y_Otr):

Calculate (x_Otr, y_Otr) corresponding to region Ro_i(x_Otr, y_Otr) with current manual calibration section coincidence factor；

Select the highest point (x of coincidence factor in current 16 × 16 section_IoUMax, y_IoUMax), if coincidence factor > 0.7, W_cls (x_Ctr, y_ctr, 2i-1)=1, W_cls(x_Ctr, y_ctr, 2i)=0, which is positive sample, W_reg(x_Ctr, y_ctr, 4i-3) and=(x_Otr- 16x_Ctr+ 8)/8, W_reg(x_Ctr, y_Ctr, 4i-2) and=(y_Otr-16y_Ctr+ 8)/8, W_reg(x_Ctr, y_Ctr, 4i-2) and=Down1 (l_{bas_tr}/ Ro_iThird position), W_reg(x_Ctr, y_Ctr, 4i) and=Down1 (w_{bas_tr}/Ro_iThe 4th), Down1 () if indicate value be greater than 1 Then value is 1；If coincidence factor < 0.3, W_cls(x_Ctr, y_Ctr, 2i-1)=0, W_cls(x_Ctr, y_Ctr, 2i)=1；Otherwise W_cls (x_Ctr, y_Ctr, 2i-1)=- 1, W_cls(x_Ctr, y_Ctr, 2i)=- 1.

If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6_i(x_Otr, y_Otr), then select coincidence factor most High Ro_i(x_Otr, y_Otr) to W_clsAnd W_regAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7.

Calculating (the x_Otr, y_Otr) corresponding to region Ro_i(x_Otr, y_Otr) be overlapped with the section of current manual's calibration Rate, method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (a_{bas_tr}, b_{bas_tr}), centre coordinate It is l in the distance of fore-and-aft distance upper and lower side frame_{bas_tr}, centre coordinate is w in the distance of lateral distance left and right side frame_{bas_tr}If Ro_i (x_Otr, y_Otr) third position be l_Otr, the 4th is w_OtrIf meeting | x_Otr-a_{bas_tr}|≤l_Otr+l_{bas_tr}- 1 and | y_Otr- b_{bas_tr}|≤w_Otr+w_{bas_tr}- 1, illustrate that there are overlapping region, overlapping regions=(l_Otr+l_{bas_tr}-1-|x_Otr-a_{bas_tr}|)× (w_otr+w_{bas_tr}-1-|_yOtr-b_{bas_tr}|), otherwise overlapping region=0；Calculate whole region=(2l_otr-1)×(2w_Otr-1)+ (2a_{bas_tr}-1)×(2w_{bas_tr}- 1)-overlapping region；To obtain coincidence factor=overlapping region/whole region, | | expression takes Absolute value.

The W_shad-cls(X) and W_shad-reg(X), building method are as follows: for image X, corresponding positive negative sample Information is W_cls(X) and W_reg(X), the first step constructs W_shad-cls(X) with and W_shad-reg(X), W_shad-cls(X) and W_cls(X) dimension It is identical, W_shad-reg(X) and W_reg(X) dimension is identical；Second step records the information of all positive samples, for i=1 to 9, if W_cls (X) (a, b, 2i-1)=1, then W_shad-cls(X) (a, b, 2i-1)=1, W_shad-cls(X) (a, b, 2i)=1, W_shad-reg(X) (a, B, 4i-3)=1, W_shad-reg(X) (a, b, 4i-2)=1, W_shad-reg(X) (a, b, 4i-1)=1, W_shad-reg(X) (a, b, 4i)= 1, positive sample has selected altogether sum (W_shad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum (W_shad-cls(X)) 256 > retain 256 positive samples at random；Third step randomly chooses negative sample, randomly chooses (a, b, i), if W_cls(X) (a, b, 2i-1)=1, then W_shad-cls(X) (a, b, 2i-1)=1, W_shad-cls(X) (a, b, 2i)=1, W_shad-reg(X) (a, b, 4i-3)=1, W_shad-reg(X) (a, b, 4i-2)=1, W_shad-reg(X) (a, b, 4i-1)=1, W_shad-reg(X) (a, b, 4i)=1, if the negative sample quantity chosen is 256-sum (W_shad-cls(X)) a, although negative sample lazy weight 256- sum(W_shad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates.

The ROI layer, input are image X and regionIts method are as follows: for Image X is 48 × 64 × 512 by the dimension of obtained output Fconv (X) of feature extraction depth network Fconv, for every One 48 × 64 matrix V_{ROI_I}Information (512 matrixes altogether), extract V_{ROI_I}The upper left corner in matrix The lower right cornerIt is surrounded Region,Indicate round numbers part；Output is roi_I(X) dimension is 7 × 7, then step-length

For i_RoI=1: to 7:

For j_ROI=1 to 7:

Construct section

roi_I(X)(i_ROI, j_ROIThe value of maximum point in)=section.

When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame ROI in range.

The building coordinate refines network, method are as follows: the first step, extending database: extended method is for data Each image X and the corresponding each region manually demarcated in libraryIts is corresponding ROI isIf current interval be human body image-region if BClass=[1,0,0, 0,0], [0,0,0,0] BBox=, the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0, 0,0,0], BClass=[0,0,1,0,0], BBox=[0,0,0,0], if current interval is if current interval is hand region Product area then [0,0,0,1,0] BClass=, BBox=[0,0,0,0]；It is random to generate value random number between -1 to 1 a_rand, b_rand, l_rand, w_rand, to obtain new section Indicate round numbers part, the BBox=[a in the section_rand, b_rand, l_rand, w_rand], if new section withThe then BClass=current region of coincidence factor > 0.7 BClass, if new section withCoincidence factor < 0.3, then BClass=[0,0,0,0, 1], the two is not satisfied, then not assignment.Each section at most generates 10 positive sample regions, if generating Num₁A positive sample area Domain then generates Num₁+ 1 negative sample region, if the inadequate Num in negative sample region₁+ 1, then expand a_rand, b_rand, l_rand, w_rand Range, until finding enough negative sample numbers.Second step, building coordinate refine network: for every in database One image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through Cross two full articulamentum Fc², obtain output Fc²(ROI), then by Fc²(ROI) micro- by classification layer FClass and section respectively Layer FBBox is adjusted, output FClass (Fc is obtained²And FBBox (Fc (ROI))²(ROI)), classification layer FClass is full articulamentum, Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is 512, output vector length is 4；There are two the loss functions of the network: first error function loss1 is to FClass (Fc² (ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc²(ROI))-BBox) meter Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first 1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration.

The full articulamentum Fc of described two², structure are as follows: first layer: full articulamentum, input vector length is 25088, defeated Outgoing vector length is 4096, and activation primitive is relu activation primitive；The second layer: full articulamentum, input vector length is 4096, defeated Outgoing vector length is 512, and activation primitive is relu activation primitive.

Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image of present image Region, face facial area, hand region and product area, the steps include:

The first step, by input picture X_cpstIt is divided into the subgraph of 768 × 1024 dimensions；

Second step, for each subgraph X_s:

2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 spies Levy subgraph set Fconv (X_s)；

2.2nd step, to Fconv (X_s) using area selection network in first layer Conv₁, second layer Conv_2-1+soffmax Activation primitive and Conv_2-2Into transformation, output soffmax (Conv is respectively obtained_2-1(Conv₁(Fconv(X_s)))) and Conv_2-2 (Conv₁(Fconv(X_s))), all preliminary candidate sections in the section are then obtained according to output valve；

2.3rd step, for all preliminary candidate sections of all subgraphs of current frame image:

2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidates Section is as candidate region；

2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out weight in candidate section Folded frame, to obtain final candidate section；

2.3.3 step, by subgraph X_sROI layers are input to each final candidate section, obtains corresponding ROI output, If current final candidate section is (a_BB(1), b_BB(2), l_BB(3), w_BB(4)) FBBox (Fc, is then calculated²(ROI)) it obtains Four output (Out_BB(1), Out_BB(2), Out_BB(3), Out_BB(4)) to obtain updated coordinate (a_BB(1)+8×Out_BB (1), b_BB(2)+8×Out_BB(2), l_BB(3)+8×Out_BB(3), w_BB(4)+8×Out_BB(4))；Then FClass (Fc is calculated² (ROI)) exported, if exporting first maximum current interval be human body image-region, if output second maximum when It is people's face facial area between proparea, current interval is hand region if exporting third position maximum, if the 4th maximum of output Current interval is product area, and current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate regions Between.Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate region Coordinate be (TLx, TLy, RBx, RBy), the top left co-ordinate of corresponding subgraph is (Sea_sub, Seb_sub), updated seat It is designated as (TLx+Sea_sub- 1, TLy+Seb_sub- 1, RBx, RBy).

It is described by input picture X_cpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as 384 and 512, if window size is m row n column, (a_sub, b_sub) be selected region top left co-ordinate, the initial value of (a, b) is (1,1)；Work as a_subWhen < m:

b_sub=1:

Work as b_subWhen < n:

Selected region is [(a_sub, b_sub), (a_sub+ 384, b_sub+ 512)], by input picture X_cpstUpper section institute is right The information for the image-region answered copies in new subgraph, and is attached to top left co-ordinate (a_sub, b_sub) it is used as location information；If choosing Region is determined beyond input picture X_cpstSection then will exceed the corresponding rgb pixel value of the pixel in range and be assigned a value of 0；

b_sub=b_sub+ 512:

Interior loop terminates；

a_sub=a_sub+ 384:

Outer loop terminates；

Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv_2-1(Conv₁(Fconv(X_s)))) its output be 48 × 64 × 18, for Conv_2-2(Conv₁(Fconv (X_s))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv_2-1 (Conv₁(Fconv(X_s)))) (x, y) be 18 dimensional vector II, Conv_{2_2}(Conv₁(Fconv(X_s))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, l_OtrFor Ro_i(x_Otr, y_otr) third position, w_OtrFor Ro_i (x_Otr, y_otr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, l_Otr× IIII (4i-1), w_Otr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively l_Otr× IIII (4i-1) and w_Otr×IIII(4i))。

All candidate sections of crossing the border, method are as follows: set monitoring image as m row n in the candidate section set of the adjustment Column, for each candidate section, if its [(a_ch, b_ch)], the long half-breadth of the half of candidate frame is respectively l_chAnd w_chIf a_ch+l_ch> M, thenThen its a is updated_ch=a '_ch, l_ch= l′_ch；If b_ch+w_ch> n, thenThen it updates Its b_ch=b '_ch, w_ch=w '_ch.

Described weeds out the frame being overlapped in candidate section, the steps include:

If candidate section set is not sky:

The maximum candidate section i of score is taken out from the set of candidate section_out:

Calculate candidate section i_outWith candidate section set each of candidate section i_cCoincidence factor, if coincidence factor > 0.7, then gather from candidate section and deletes candidate section i_c；

By candidate section i_outIt is put into the candidate section set of output；

When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out candidate regions Between middle overlapping frame after obtained candidate section set.

The calculating candidate section i_outWith candidate section set each of candidate section i_cCoincidence factor, side Method are as follows: set candidate section i_cCoordinate section centered on point [(a_ic, b_ic)], the long half-breadth of the half of candidate frame is respectively l_icAnd w_ic, wait I between constituency_cCoordinate section centered on point [(a_iout, b_icout)], the long half-breadth of the half of candidate frame is respectively l_ioutAnd w_iout；It calculates XA=max (a_ic, a_iout)；YA=max (b_ic, b_iout)；XB=min (lic, l_iout), yB=min (w_ic, w_iout)；If meeting | a_ic-a_iout|≤l_ic+l_iout- 1 and | b_ic-b_iout|≤w_ic+w_iout- 1, illustrate that there are overlapping region, overlapping regions=(l_ic+ l_iout-1-|a_ic-a_iout|)×(w_ic+w_iout-1-|b_ic-b_iout|), otherwise overlapping region=0；Calculate whole region=(2l_ic- 1)×(2w_ic-1)+(2l_iout-1)×(2w_iout- 1)-overlapping region；To obtain coincidence factor=overlapping region/whole region.

The shopping action recognition module, method is: in initialization, using the hand motion image of standard first Static action recognition classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put Lower movement；Then dynamic action recognition classifier is initialized using hand motion video, so that dynamic action be made to identify Classifier can recognize that the taking-up article of hand, put back to article, take out but put back to, taken out article do not put back to either it is suspicious steal Surreptitiously；In the detection process: the first step carries out each the hand region information received using static action recognition classifier Identification, recognition methods are as follows: the image inputted each time is set as Handp1, exporting as StaticN (Handp1) is 3 bit vectors, if First maximum is then identified as grasping, and is identified as putting down if second maximum, if third position maximum is identified as other；Second Step carries out target following to current grasp motion corresponding region after recognizing grasp motion, if current hand region is next The recognition result that static action recognition classifier is used corresponding to frame tracking box is when putting down movement, and target following terminates, will It is currently available since recognizing grasp motion and being video, recognize that put down movement be that video terminates, so that it is dynamic to obtain hand The video marker is complete video by the continuous videos of work.If tracking is lost during tracking, by currently available from identification It is that video starts, the image before tracking loss terminates as video to grasp motion, to obtain the view of only grasp motion It frequently, then is the video of only grasp motion by the video marker；Movement is put down when recognizing, and the movement is not in target following In obtained image, illustrates that the grasp motion of the movement is lost, then terminates using the corresponding hand region of present image as video, Tracking is carried forward since present frame using method for tracking target, until tracking is lost, then the next frame of lost frames is as view The video marker is the video for only putting down movement by the start frame of frequency.Third step makes the obtained complete video of second step It is identified with dynamic action recognition classifier, recognition methods are as follows: set the image inputted each time as Handv1, export and be DynamicN (Handv1) is 5 bit vectors, is identified as taking out article if first maximum, if second maximum is identified as putting Article is returned, is identified as taking out if the maximum of third position and put back to again, if the 4th maximum is identified as having taken out article and not put back to, if 5th maximum is then identified as the movement of suspicious stealing, and the recognition result is then sent to recognition result processing module, will be only The video for having grasp motion and the video for only putting down movement are sent to recognition result processing module, by complete video and only grab The video for holding movement is sent to product identification module and individual identification module.

The hand motion image using standard initializes static action recognition classifier, method are as follows: The first step arranges video data: firstly, choose the video that a large amount of people does shopping in supermarket, these videos include extract product, Article is put back to, takes out and puts back to, taken out article and do not put back to movement with suspicious stealing；Manually each section of video clip is carried out Interception encounters commodity as start frame using manpower, leaves commodity as end frame using manpower, then use mesh for each frame of video Mark detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, Will scaling rear video be put into hand motion video collection, and mark the video for take out article, put back to article, take out but put back to, Article has been taken out one of not put back to the movement of suspicious stealing；It is taking-up article for classification, puts back to article, takes out and puts It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other.To obtain hand motion view Frequency set and hand motion image collection；Second step constructs static action recognition classifier StaticN；Third step, it is dynamic to static state Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time The image of input is Handp, is exported as StaticN (Handp), classification y_Handp, y_HandpRepresentation method are as follows: grasp: y_Handp=[1,0,0], puts down: y_Handp=[0,1,0], other: y_Handp=[0,0,1], the evaluation function of the network are pair (StaticN(Handp)-y_Handp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000 It is secondary.

The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer inputs and is 256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, input as 256 × 256 × 64, exporting is 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128；The Four layers: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；5th Layer: convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 6: Pond layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, Output is 64 × 64 × 256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256 It is connected in third dimension, exporting is 32 × 32 × 512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is 32 × 32 × 512, port number channels=512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512；13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512；14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512；17th layer: convolutional layer, input as 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512；18th layer: pond layer is inputted and is exported for the 15th layer 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024；19th Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256；20th layer: Chi Hua Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128；Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 × 128；23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive；24th layer: full connection Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive；25th layer: Quan Lian Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive；All convolutional layers Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive；All pond layers It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2).

Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, Construct data acquisition system: the when the hand motion image using standard initializes static action recognition classifier The hand motion video collection that one step is constructed uniformly extracts 10 frame images, as input；Second step, construction dynamic action identification Classifier DynamicN；Third step initializes dynamic action recognition classifier DynamicN, and input is the first step pair The set that 10 frame images of each video extraction are constituted exports if the 10 frame images inputted each time are Handv and is DynamicN (Handv), classification y_Handv, y_HandvRepresentation method are as follows: take out article: y_Handv=[1,0,0,0,0] is put Return article: y_HandvIt takes out and puts back to again in=[0,1,0,0,0]: y_HandvIt has taken out article and has not put back to in=[0,0,1,0,0]: y_Handv The movement of=[0,0,0,1,0] and suspicious stealing: y_Handv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN (Handv)-y_Handv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.

The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames.First 1st frame image zooming-out of video image is come out into the 1st frame as extracted set, by the last frame image of video image Extract the 10th frame as extracted set, the i-th of extracted set_cktFrame is the of video imageFrame, wherein i_ckt=2 to 9:,Indicate round numbers part.

The construction dynamic action recognition classifier DynamicN, network structure are as follows:

First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels= 512；The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels= 128；Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128；4th layer: convolutional layer, input It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, input the 4th Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 × 256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；The Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, it is defeated Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32 ×512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels= 512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512； 13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor It is connected, exporting is 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, Port number channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel Number channels=512；18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, Output is 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 × 4×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 128, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 128, and output vector is long Degree is 32, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 32, and output vector is long Degree is 3, and activation primitive is soft-max activation primitive；The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length Stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, parameter Chi Huaqu Between size kernel_size=2, step-length stride=(2,2).

It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows: If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First Step extracts the ORB feature ORB of image Hgrab_Hgrab；Second step, it is corresponding for all hand regions in the next frame of Hgrab Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box；Third Step, by ORB_HgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HgrabThe Hamming distance of feature The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosen_HgrabThe similarity > 0.85 of feature, similarity =(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.

The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and calculate in OpenCV Has realization inside machine vision library；Its ORB feature is extracted to a picture, input value is current image, is exported as several group leaders Identical character string is spent, each group represents an ORB feature.

Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame Be carried forward tracking, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, currently with Track region is region corresponding to image Hdown.

If not tracking loss:

The first step extracts the ORB feature ORB of image Hdown_Hdown, moved since the process recognizes grasping in described working as After work, calculated during carrying out target following to current grasp motion corresponding region, so being not required here again Secondary calculating；

Second step, for the corresponding image of all hand regions in the former frame of image Hdown calculate its ORB feature from And ORB characteristic set is obtained, and delete the ORB feature chosen by other tracking box；

Third step, by ORB_HdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HdownIt is special The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosen_HdownThe similarity of feature > 0.85, similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.

The product identification module, method is: in initialization, using the product image set of all angles first Product identification classifier is initialized, and product list is generated to product image；When changing product list: if deleting certain Product then deletes the image of the product from the product image set of all angles, and corresponding position in product list is deleted, If increasing certain product, the product image of all angles of current production is put into the product image set of all angles, will be produced Last the current title for increasing product of back addition of product list, then with the product image set of new all angles with New product list upgrading products recognition classifier；In the detection process, the first step, according to shopping action recognition module transmitting come Complete video and only grasp motion video, first in module of target detection institute according to corresponding to current video first frame Obtained position detects forward the inputted video image of the position from current video first frame, detect the region not by The frame blocked is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain The recognition result of current production, recognition methods are as follows: set the image inputted each time as Goods1, export as GoodsN (Goods1) For a vector, if the i-th of the vector_goodsPosition is maximum, then shows that current recognition result is i-th in product list_goodsPosition Recognition result is sent to recognition result processing module by product；

Described first initializes product identification classifier using the product image set of all angles, and to production Product image generates product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is that product is each The image of a angle, product list list_GoodsFor a vector, each of vector corresponds to a product name；Second step, structure Make product identification classifier GoodsN；Third step initializes construction product identification classifier GoodsN, and input is each The product image set of a angle exports if input picture is Goods as GoodsN (Goods), classification y_Goods, y_Goods For one group of vector, length is equal to the number of product in product list, y_GoodsRepresentation method are as follows: if image Goods be i-th_Goods The product of position, then y_GoodsI-th_GoodsPosition is 1, other are to (GoodsN (Goods)-for the evaluation function of 0. network y_Goods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times.

The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number Channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number Channels=128；Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128；4th layer: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolution Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64 ×64×256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels= 256；8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；The Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export It is 32 × 32 × 512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 are connected in third dimension, and exporting is 16 × 16 × 1024；15th layer: convolutional layer, input as 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, Output is 16 × 16 × 512, port number channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, output It is 16 × 16 × 512, port number channels=512；18th layer: pond layer, input for the 15th layer output 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024；19th layer: convolution Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256；20th layer: pond layer, input It is 8 × 8 × 256, exporting is 4 × 4 × 256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 × 128, port number channels=128；The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1, 1), activation primitive is relu activation primitive；All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2).The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated The data entered are launched into the vector of 2048 dimensions, then input into first layer；First layer: full articulamentum, input vector length are 2048, output vector length is 1024, and activation primitive is relu activation primitive；The second layer: full articulamentum, input vector length are 1024, output vector length is 1024, and activation primitive is relu activation primitive；Third layer: full articulamentum, input vector length are 1024, output vector length is len (list_Goods), activation primitive is soft-max activation primitive；len(list_Goods) indicate The length of product list.For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2)).

The product image set and new product list upgrading products recognition classifier with new all angles, Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list；Second step, for neotectonics Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification y_Goods3, y_Goods For one group of vector, length is equal to the number of updated product list, y_GoodsRepresentation method are as follows: if image Goods is the i_GoodsThe product of position, then y_GoodsI-th_GoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network (Goods)-y_Goods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1 Parameter value remain unchanged, the number of iterations be 500 times.

It is described according to corresponding to current video first frame in the obtained position of module of target detection, to the position Inputted video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as Corresponding to preceding video first frame the obtained position of module of target detection be (a_goods, b_goods, l_goods, w_goods), if currently Current video first frame is i-th_crgsFrame, frame under process i_cr=i_crgs: the first step, i-th_crFrame is obtained by module of target detection All detection zones be Task_icr；Second step, for Task_icrEach of regional frame (a_task, b_task, l_task, w_task), Calculate its distance d_gt=(a_task-a_goods)²+(b_task-b_goods)²-(l_task+l_goods)²-(w_task+w_goods)².Distance if it does not exist < 0, then i-th_crCorresponding (a of frame_goods, b_goods, l_goods, w_goods) region be region detected for detecting not by The frame blocked, algorithm terminate；Otherwise, distance < 0 if it exists, the then d (i in recording distance list d_cr)=minimum range, and i_cr =i_cr- 1, if i_cr> 0, then algorithm jumps to the first step, if i_cr≤ 0, then selection takes this apart from the maximum record of list d intermediate value Record the corresponding (a of corresponding frame_goods, b_goods, l_goods, w_goods) it is what the region detected detected was not blocked Frame, algorithm terminate.

The individual identification module, method is: in initialization, using the face image set of all angles first Face characteristic extractor FaceN is initialized and calculates μ face, then using the human body image of all angles to human body spy Sign extractor BodyN is initialized and is calculated μ body；In the detection process, when user enters supermarket, pass through target detection Module obtains the image Face1 of the face in current human region Body1 and human region, is then mentioned respectively using characteristics of human body Device BodyN and face characteristic extractor FaceN is taken to extract characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), BodyN (Body1) is saved in BodyFtu set, saves FaceN (Face1) in FaceFtu set, and save current visitor The id information at family, id information can be user supermarket account either user enter supermarket when be randomly assigned it is unduplicated Number, id information are used to distinguish different customers, whenever there is customer to enter supermarket, then extract its characteristics of human body and face characteristic；When In supermarket when user's mobile product, according to shopping action recognition module transmitting come complete video and only grasp motion view Frequently, its corresponding human region and human face region are searched out, face feature extractor FaceN and characteristics of human body's extractor are used BodyN carries out recognition of face or human bioequivalence mode, obtains currently doing shopping corresponding to the video that action recognition module transmitting comes The ID of customer.

The face image set using all angles is initialized and is calculated to face characteristic extractor FaceN μ face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection；Second step constructs people Face feature extractor FaceN is simultaneously initialized using face data set；Step 3:

Everyone i concentrated for human face data_Peop, obtain human face data and concentrate all to belong to i_PeopFacial image Set FaceSet (i_Peop):

For FaceSet (i_Peop) in each facial image Face (j_iPeop):

Calculate face characteristic FaceN (Face (j_iPeop))；

Count current face's image collection FaceSet (i_Peop) in all face characteristics average value as current face scheme Center center (FaceN (Face (the j of picture_iPeop))), calculate current face's image collection FaceSet (i_Peop) in all faces Feature

With the center center (FaceN (Face (j of current face's image_iPeop))) distance constitute i_PeopCorresponding distance Set.The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if Distance set length is n_diset, Indicate round numbers part.

The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection By N_facesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128；4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256；The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512； Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 512, output vector Length is N_faceset, activation primitive is soft-max activation primitive；The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each face Face4 exports as FaceN25 (face4), classification y_face, y_faceIt is equal to N for length_facesetVector, y_faceExpression Method are as follows: if face face4 belongs to i-th in face image set_face4Personal face, then y_faceI-th_face4Position is 1, other Position is that the evaluation function of 0. network is to (FaceN25 (face4)-y_face) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times；After iteration, face characteristic extractor FaceN be FaceN25 network from First layer is to the 24th layer.

The human body image using all angles initializes characteristics of human body's extractor BodyN and calculates μ Body, method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection；Second step constructs human body Simultaneously user's volumetric data set initializes feature extractor BodyN；Step 3:

Everyone i concentrated for somatic data_Peop1, obtain somatic data and concentrate all to belong to i_Peop1Human figure Image set closes BodySet (i_Peop1):

For BodySet (i_Peop1) in each human body image Body (j_iPeop1):

Calculate characteristics of human body BodyN (Body (j_iPeop1))；

Count current human's image collection BodySet (i_Peop1) in all characteristics of human body average value as current human scheme Center center (BodyN (Body (the j of picture_iPeop1))), calculate current human's image collection BodySet (i_Peop1) in owner Center center (BodyN (Body (the j of body characteristics and current human's image_iPeop1))) distance constitute i_Peop1Corresponding distance Set.

The owner concentrated to somatic data obtains its corresponding distance set, and distance set is arranged from small to large Afterwards, if distance set length is n_diset1, It indicates to be rounded Number part.

Construction characteristics of human body's extractor BodyN and user's volumetric data set is initialized, if somatic data collection By N_bodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128；4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256；The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512； Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 512, output vector Length is N_faceset, activation primitive is soft-max activation primitive；The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2).Its initialization procedure are as follows: set for each Zhang Renti Body4 exports as BodyN25 (body4), classification y_body, y_bodyIt is equal to N for length_bodysetVector, y_bodyExpression Method are as follows: if human body body4 belongs to i-th in human body image set_body4Personal human body, then y_bodyI-th_body4Position is 1, other Position is that the evaluation function of 0. network is to (BodyN25 (body4)-y_body) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times；After iteration, characteristics of human body's extractor BodyN be BodyN25 network from First layer is to the 24th layer.

It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, search out Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.Its Process are as follows: according to shopping action recognition module transmitting come video, begin look for from the first frame of video to corresponding human body area Domain and human face region, until algorithm terminates or handled the last frame of video:

Corresponding human region image Body2 and human face region image Face2 are used into characteristics of human body's extractor respectively BodyN and face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2)；

Then face identification information is used first: comparing all face characteristics in FaceN (Face2) and FaceFtu set Euclidean distance d_Face, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN (Face3), if d_Face< μ face then identifies that current face's image belongs to the visitor of facial image corresponding to FaceN (Face3) Family ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates；

If d_Face>=μ face shows only to identify current individual with face identification method, then compares BodyN (Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtu_Body, select Euclidean distance minimum when it is corresponding Feature in BodyFtu set, if this feature is BodyN (Face3), if d_Body+d_Face< μ face+ μ body, then identify and work as The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes ID corresponding to video actions.

If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid mistake is known Not Gou Wu main body cause the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled.

It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding Human region and human face region, method are as follows: according to the transmitting of shopping action recognition module come video, from the first frame of video into Row processing.If currently processed to i-th_fRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,_ifRg, b_ifRg, l_ifRg, w_ifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detection_ifRg Human region collection is combined into FaceFrameSet_ifRg, for BodyFrameSet_ifRgEach of human region (a_BFSifRg, b_BFSifRg, l_BFSifRg, w_BFSifRg), calculate its distance d_gbt=(a_BFSifRg-a_ifRg)²+(b_BFSifRg-b_ifRg)²-(l_BFSifRg-l_ifRg)²- (w_BFSifRg-w_ifRg)², selecting the smallest human region of distance in all human region set is the corresponding human body area of current video Domain, if it is (a that the human region chosen, which is position,_BFS1, b_BFS1, l_BFS1, w_BFS1), human face region collection is combined into FaceFrameSet_ifRgEach of human face region (a_FFSifRg, b_FFsifRg, l_FFsifRg, w_FFSifRg), calculate its distance d_gft= (a_BFS1-a_FFSifRg)²+(b_BFS1-b_FFSifRg)²-(l_BFs1-l_FFSifRg)²-(w_BFS1-w_FFSifRg)², select all face regional ensembles It is middle apart from the smallest human face region be the corresponding human face region of current video.

The recognition result processing module does not work in initialization.In identification process, to the identification knot received Fruit carries out integration to generating the corresponding shopping list of each customer: first according to individual identification module transmit come customer ID determines the corresponding customer of current shopping information, so that choosing the shopping list number modified is ID, then according to product identification The recognition result that module transmitting comes sets product to determine that the shopping of current customer acts corresponding product as GoodA, then basis Whether the recognition result that shopping action recognition module transmitting comes modifies to shopping cart to determine that current shopping acts, if identification Then increase product G oodA on shopping list ID to take out article, accelerating is 1, is being purchased if being identified as putting back to article Product G oodA is reduced on object inventory ID, reducing quantity is 1, if be identified as " take out and put back to " or " taken out article not put again Return " then shopping list do not change, to supermarket's monitoring transmission alarm signal and current video if recognition result is " suspicious stealing " Corresponding location information.

It, can will be in shopping process the invention has the advantages that when commodity Input Process is advanceed to shopper's picking Most time-consuming process advances in shopping process, to remove the time loss of items scanning when checkout, is greatly improved Checkout speed, improves the shopping experience of customer.The present invention selects shopper using algorithm for pattern recognition dynamic during goods It is identified and is counted, the picture of commodity is identified to obtain type of merchandize when picking and placing commodity to client, is carried out to customer Recognition of face and the identity for obtaining customer using human body image recognition when recognition of face is undesirable, to the exception of customer Activity recognition is to determine whether there is pilferage behavior.This system can realize automatic system under the premise of not reducing customer purchase experience Count function.Original institutional framework the present invention relates to the shopping accounting procedure of customer without changing supermarket, consequently facilitating with existing Supermarket's organizational structure seamless interfacing.

Detailed description of the invention

Fig. 1 is functional flow diagram of the invention

Fig. 2 is whole functional module of the invention and its correlation block diagram

Specific embodiment

The present invention will be further described below with reference to the drawings.

A kind of supermarket's intelligence vending system, functional flow diagram is as shown in Figure 1, correlation between its module As shown in Figure 2.

Be provided below three specific embodiments to a kind of detailed process of supermarket's intelligence vending system of the present invention into Row explanation: embodiment 1:

The present embodiment realizes a kind of process of the parameter initialization of supermarket's intelligence vending system.

1. image pre-processing module, in initial phase, the module does not work；

2. human body target detection module has demarcated human body image region, face face using having during initialization The image in region, hand region and product area carries out parameter initialization to algorithm of target detection.

The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first Layer: convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64；The second layer: volume Lamination, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64；Third layer: Chi Hua Layer, input first layer output 768 × 1024 × 64 are connected in third dimension with third layer output 768 × 1024 × 64, Output is 384 × 512 × 128；4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, is led to Road number channels=128；Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, channel Number channels=128；Layer 6: pond layer, the 4th layer of output 384 × 512 × 128 of input and layer 5 384 × 512 × 128 are connected in third dimension, and exporting is 192 × 256 × 256；Layer 7: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；8th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；9th layer: convolutional layer, input as 192 × 256 × 256, exporting is 192 × 256 × 256, port number channels=256；Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 are connected in third dimension with the 9th layer 192 × 256 × 256, and exporting is 96 × 128 × 512；11st Layer: convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；Floor 12: Convolutional layer, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；13rd layer: volume Lamination, inputting is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 96 × 128 × 512 with the 13rd layer 96 × 128 × 512, Output is 48 × 64 × 1024；15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, channel Number channels=512；16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 48 × 64 × 512 and the 17th layer 48 × 64 × 512 are connected in third dimension, and exporting is 48 × 64 × 1024；19th layer: convolutional layer, input as 48 × 64 × 1024, exporting is 48 × 64 × 256, port number channels=256；20th layer: pond layer, inputting is 48 × 64 × 256, Output is 24 × 62 × 256；Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, channel Number channels=256；Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256；20th Three layers: convolutional layer, inputting is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128；24th Layer: pond layer, inputting is 12 × 16 × 128, and exporting is 6 × 8 × 128；25th layer: full articulamentum, first by the 6 of input The data of × 8 × 128 dimensions are launched into the vector of 6144 dimensions, then input into full articulamentum, and output vector length is 768, Activation primitive is relu activation primitive；26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive is relu activation primitive；27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive is soft-max activation primitive；The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length stride =(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is pond section size Kernel size=2, step-length stride=(2,2)；If setting the depth network as Fconv27, for a width color image X, warp Crossing the obtained feature set of graphs Fconv27 (X) of the depth network indicates, the evaluation function of the network is to (Fconv27 (X)-y) its cross entropy loss function is calculated, convergence direction is to be minimized, and y inputs corresponding classification.Database is in nature The image comprising passerby and non-passerby of boundary's acquisition, every image are the color image of 768 × 1024 dimensions, according to being in image No to be divided into two classes comprising pedestrian, the number of iterations is 2000 times.After training, first layer is taken to be characterized extraction to the 17th layer Depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X).

The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer₁(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel Size=1 kernel, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number Channels=512；Then by Conv₁(Fconv (X)) is separately input to two convolutional layer (Conv_2-1And Conv_2-2), Conv_2-1Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 18, and port number channels=18, the layer obtains Output be Conv_2-1(Conv₁(Fconv (X))), then softmax is obtained using activation primitive softmax to the output (Conv_2-1(Conv₁(Fconv(X))))；Conv_2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, Port number channels=36；There are two the loss functions of the network: first error function loss1 is to W_shad-cls⊙ (Conv_2-1(Conv₁(Fconv(X)))-W_cls(X)) softmax error is calculated, second error function loss2 is to W_shad-reg (X)⊙(Conv_2-1(Conv₁(Fconv(X)))-W_reg(X)) smooth L1 error, the loss function of regional choice network are calculated =loss1/sum (W_cls(X))+loss2/sum(W_cls(X)), the sum of sum () representing matrix all elements, convergence direction are It is minimized, W_cls(X) and W_regIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is according to correspondence Position is multiplied, W_shad-cls(X) and W_shad-regIt (X) is mask, it acts as selection W_shad(X) part that weight is 1 in is trained, To avoiding positive and negative sample size gap excessive, when each iteration, regenerates W_shad-cls(X) and W_shad-reg(X), algorithm iteration 1000 times.

It is described to W_clsAnd W_regIt is filled, method are as follows:

For i value from 1 to 9:

For i_ROI=1: to 7:

For j_ROI=1 to 7:

Construct section

roi_I(X)(i_ROI, j_ROIThe value of maximum point in)=section.

3. action recognition module of doing shopping first knows static state movement using the hand motion image of standard in initialization Other classifier is initialized, so that static action recognition classifier be made to can recognize that the grasping of hand, put down movement；Then make Dynamic action recognition classifier is initialized with hand motion video, so that dynamic action recognition classifier be enable to identify The taking-up article sold puts back to article, takes out and put back to, having taken out article and do not put back to either suspicious stealing.

4. product identification module, in initialization, first using the product image set of all angles to product identification point Class device is initialized, and generates product list to product image.

5. individual identification module first mentions face characteristic using the face image set of all angles in initialization It takes device FaceN to be initialized and calculates μ face, then using the human body image of all angles to characteristics of human body's extractor BodyN is initialized and is calculated μ body.

For FaceSet (i_Peop) in each facial image Face (j_iPeop):

Calculate face characteristic FaceN (Face (j_iPeop))；

Count current face's image collection FaceSet (i_Peop) in all face characteristics average value as current face scheme Center center (FaceN (Face (the j of picture_iPeop))), calculate current face's image collection FaceSet (i_Peop) in all faces Center center (FaceN (Face (the j of feature and current face's image_iPeop))) distance constitute i_PeopCorresponding distance set It closes.

The owner concentrated to human face data obtains its corresponding distance set, and distance set is arranged from small to large Afterwards, if distance set length is n_diset, Indicate round numbers Part.

For BodySet (i_Peop1) in each human body image Body (j_iPeop1):

Calculate characteristics of human body BodyN (Body (j_iPeop1))；

6. recognition result processing module does not work in initialization.

Embodiment 2:

The present embodiment realizes a kind of detection process of supermarket's intelligence vending system.

1. image pre-processing module, in the detection process: the first step, the monitoring image taken the photograph to monitoring camera carry out equal Value denoising, thus the monitoring image after being denoised；Second step carries out illumination compensation to the monitoring image after denoising, thus Image after to illumination compensation；Image after illumination compensation is carried out image enhancement, the data after image enhancement is passed by third step Pass module of target detection.

The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting monitoring camera and is taken the photograph Monitoring image be X_src, because of X_srcFor color RGB image, therefore there are X_src-R, X_src-G, X_src-BThree components, for each A component X_src', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image X_src' each pixel Point X_src' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the point_src' (i-1, j-1), X_src′ (i-1, j), X_src' (i-1, j+1), X_src' (i, j-1), X_src' (i, j), X_src' (i, j+1), X_src' (i+1, j-1), X_src′(i+ 1, j), X_src' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoising_src" pixel (i, J) value is assigned to X after corresponding filtering_src" (i, j)；For X_src' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoising_src″ (i, j), thus, new image array X_srcIt " is X_srcImage array after the denoising of current RGB component, for X_src-R, X_src-G, X_src-BAfter three components carry out denoising operation respectively, the X that will obtain_src-R", X_src-c", X_src-B" component, by this three A new component is integrated into a new color image X_DenResulting image after as denoising.

Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoising_Den, because of X_DenFor Color RGB image, therefore X_DenThere are tri- components of RGB, for each component X_Den', illumination compensation is carried out respectively, then will Obtained X_cpst' integration obtains colored RBG image X_cpst, X_cpstAs X_DenImage after illumination compensation, to each component X_Den' respectively carry out illumination compensation the step of are as follows: the first step, if X_Den' arranged for m row n, construct X_Den′^sumAnd Num_DenFor same m row The matrix of n column, initial value is 0,Step-lengthWindow size is l, wherein letter Number min (m, n) expression takes the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l if l < 1 =1；Second step, if X_DenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s is determined according to window size Each candidate frame, which is [(a, b), (a+l, b+l)] area defined, for X_Den' in candidate frame region institute Corresponding image array carries out histogram equalization, the image after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] Matrix X_Den", then X_Den′^sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates X_Den′^sum(a+i_Xsum, b +j_Xsum)=X_Den′^sum(a+i_Xsum, b+j_Xsum)+X_Den″(i_Xsum, j_Xsum), wherein (i_Xsum, j_Xsum) it is integer and 1≤i_Xsum≤ l, 1≤j_Xsum≤ l, and by Num_DenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1；Finally, calculating Wherein (i_XsumNum, j_XsumNum) it is X_DenEach corresponding point, to obtain X_cpstAs to present component X_Den' carry out illumination Compensation.

As a+l≤m:

B=1；

As b+l≤n:

Selected region is [(a, b), (a+l, b+l)]；

B=b+s；

Interior loop terminates；

A=a+s；

Outer loop terminates；

Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is X_cpsT is corresponded to RGB channel be respectively X_cpstR, X_cpstG, X_cpstB, to X_cpstThe image obtained after image enhancement is X_enh.Image increasing is carried out to it Strong step are as follows: the first step, for X_cpstThe important X of institute_cpstR, X_cpstG, X_cpstBIt is calculated to carry out after obscuring by specified scale Image；Second step, structural matrix LX_enhR, LX_enhG, LX_enhBFor with X_cpstRThe matrix of identical dimensional, for image X_cpst's The channel R in RGB channel calculates LX_enhR(i, j)=log (X_cpstR(i, j))-LX_cpstRThe value range of (i, j), (i, j) is All points in image array, for image X_cpstRGB channel in the channel G and channel B use algorithm same as the channel R Obtain LX_enhGAnd LX_enhB；Third step, for image X_cpstRGB channel in the channel R, calculate LX_enhRMiddle all the points value Mean value MeanR and mean square deviation VarR (attention is mean square deviation), calculating MinR=MeanR-2 × VarR and MaxR=MeanR+2 × Then VarR calculates X_enhR(i, j)=Fix ((LX_cpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix expression takes Integer part is assigned a value of 0 if value < 0, and value > 255 is assigned a value of 255；For in RGB channel the channel G and channel B X is obtained using algorithm same as the channel R_enhGAnd X_enhB, the X of RGB channel will be belonging respectively to_enhR、X_enhG、X_enhBIt is integrated into one Color image X_enh。

2. module of target detection receives image pre-processing module and transmits the image come, then to it in the detection process Handled, to each frame image using algorithm of target detection carry out target detection, obtain present image human body image region, Then hand region and product area are sent to shopping action recognition mould by face facial area, hand region and product area Human body image region and face facial area, are sent to individual identification module by block, and product area is passed to product identification mould Block；

Second step, for each subgraph X_s:

2.2nd step, to Fconv (X_s) using area selection network in first layer Conv₁, second layer Conv_2-1+softmax Activation primitive and Conv_2-2Into transformation, output softmax (Conv is respectively obtained_2-1(Conv₁(Fconv(X_s)))) and Conv_2-2 (Conv₁(Fconv(X_s))), all preliminary candidate sections in the section are then obtained according to output valve；

It is described by input picture X_cpstBe divided into the subgraph of 768 × 1024 dimensions, the steps include: to set the step-length of segmentation as 384 and 512, if window size is m row n column, (a_sub, b_sub) be selected region top left co-ordinate, the initial value of (a, b) is (1,1)；

Work as a_subWhen < m:

b_sub=1；

Work as b_subWhen < n:

b_sub=b_sub+512；

Interior loop terminates；

a_sub=a_sub+384；

Outer loop terminates；

Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv_2-1(Conv₁(Fconv(X_s)))) its output be 48 × 64 × 18, for Conv₂-₂(Conv₁(Fconv (X_s))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv_2-1 (Conv₁(Fconv(X_s)))) (x, y) be 18 dimensional vector II, Conv_2-2(Conv₁(Fconv(X_s))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, l_OtrFor Ro_i(x_Otr, y_Otr) third position, w_OtrFor Ro_i (x_Otr, y_otr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, l_Otr× IIII (4i-1), w_Otr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively l_Otr× IIII (4i-1) and w_Otr×IIII(4i))。

If candidate section set is not sky:

By candidate section i_outIt is put into the candidate section set of output；

The calculating candidate section i_outWith candidate section set each of candidate section i_cCoincidence factor, side Method are as follows: set candidate section i_cCoordinate section centered on point [(a_ic, b_ic)], the long half-breadth of the half of candidate frame is respectively l_icAnd w_ic, wait I between constituency_cCoordinate section centered on point [(a_iout, b_icout)], the long half-breadth of the half of candidate frame is respectively l_ioutAnd w_iout；It calculates XA=max (a_ic, a_iout)；YA=max (b_ic, b_iout)；XB=min (l_ic, l_iout), yB=min (w_ic, w_iout)；If meeting | a_ic-a_iout|≤l_ic+l_iout- 1 and | b_ic-b_iout|≤w_ic+w_iout- 1, illustrate that there are overlapping region, overlapping regions=(l_ic+ l_iout-1-|a_ic-a_iout|)×(w_ic+w_iout-1-|b_ic-b_iout|), otherwise overlapping region=0；Calculate whole region=(2l_ic- 1)×(2w_ic-1)+(2l_iout-1)×(2w_iout- 1)-overlapping region；To obtain coincidence factor=overlapping region/whole region.

3. action recognition module of doing shopping, in the detection process: the first step makes each the hand region information received It is identified with static action recognition classifier, recognition methods are as follows: set the image inputted each time as Handp1, export and be StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second maximum is identified as putting down, if Third position maximum is then identified as other；Second step carries out mesh to current grasp motion corresponding region after recognizing grasp motion Mark tracking, if being using the recognition result of static action recognition classifier corresponding to the next frame tracking box of current hand region When putting down movement, target following terminates, by it is currently available since recognize grasp motion be video, recognize and put down movement Terminate for video, is complete video by the video marker to obtain the continuous videos of hand motion.If being tracked during tracking It loses, then terminates currently available since recognizing grasp motion and being video, from the image before tracking loss as video, It is then the video of only grasp motion by the video marker to obtain the video of only grasp motion；When recognize put down it is dynamic Make, and the movement illustrates that the grasp motion of the movement is lost, then with present image not in the obtained image of target following Corresponding hand region terminates for video, is carried forward tracking since present frame using method for tracking target, until tracking is lost It loses, then start frame of the next frame of lost frames as video, is the video for only putting down movement by the video marker.Third step, The obtained complete video of second step is identified using dynamic action recognition classifier, recognition methods are as follows: set defeated each time The image entered is Handv1, and exporting as DynamicN (Handv1) is 5 bit vectors, is identified as extract if first maximum Product are identified as putting back to article if second maximum, put back to again if third position maximum is identified as taking out, if the 4th maximum It is identified as having taken out article and not put back to, the movement of suspicious stealing is identified as if the 5th maximum, then sends out the recognition result Recognition result processing module is given, the video for by the video of only grasp motion and only putting down movement is sent at recognition result Module is managed, the video of complete video and only grasp motion is sent to product identification module and individual identification module.

It is described after recognizing grasp motion, target following, method are carried out to current grasp motion corresponding region are as follows: If the image of the grasp motion currently recognized is Hgrab, current tracking area is region corresponding to image Hgrab.First Step extracts the ORB feature ORB of image Hgrab_Hgrab；Second step, it is corresponding for all hand regions in the next frame of Hgrab Image calculate its ORB feature to obtaining ORB characteristic set, and delete the ORB feature chosen by other tracking box；Third Step, by ORB_HgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HgrabThe Hamming distance of feature The smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosen_HgrabThe similarity > 0.85 of feature, similarity =(Hamming distance/0RB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is image Hgrab next frame tracking box, if otherwise similarity < 0.85 show tracking lose.

If not tracking loss:

Third step, by ORB_HaownIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HdownIt is special The smallest ORB feature of the Hamming distance of sign is the ORB feature chosen, if the ORB feature and ORB chosen_HdownThe similarity of feature > 0.85, similarity=(Hamming distance/0RB characteristic lengths of two ORB features of 1-), the then corresponding hand of ORB feature chosen Portion region is tracking box of the image Hdown in next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.

4. product identification module, in the detection process, the first step, according to the transmitting of shopping action recognition module come complete view The video of frequency and only grasp motion, first in the obtained position of module of target detection according to corresponding to current video first frame It sets, the inputted video image of the position is detected forward from current video first frame, detect the frame that the region is not blocked, It is finally identified the image in region corresponding to frame as the input of product identification classifier, to obtain current production Recognition result, recognition methods are as follows: set the image inputted each time as Goods1, export for GoodsN (Goods1) be one to Amount, if the i-th of the vector_goodsPosition is maximum, then shows that current recognition result is i-th in product list_goodsThe product of position, will Recognition result is sent to recognition result processing module；

5. individual identification module when user enters supermarket, is obtained currently in the detection process by module of target detection The image Face1 of human region Body1 and the face in human region, then respectively using characteristics of human body's extractor BodyN and Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN (Body1) in BodyFtu set, FaceN (Face1) is saved in FaceFtu set, and saves the ID letter of existing customer Breath, id information can be the unduplicated number that user is randomly assigned when either user enters supermarket in the account of supermarket, ID Information is used to distinguish different customers, whenever there is customer to enter supermarket, then extracts its characteristics of human body and face characteristic；When being used in supermarket When the mobile product of family, according to shopping action recognition module transmitting come complete video and only grasp motion video, search out Its corresponding human region and human face region carry out people using face feature extractor FaceN and characteristics of human body's extractor BodyN Face identification or human bioequivalence mode obtain the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes.

6. it is every to generate to carry out integration to the recognition result received in identification process for recognition result processing module The corresponding shopping list of one customer: first according to individual identification module transmit come the ID of customer determine current shopping information pair The customer answered, thus choose the shopping list number modified be ID, then according to product identification module transmit come recognition result Product is set as GoodA, then according to shopping action recognition module transmitting to determine that the shopping of current customer acts corresponding product Whether the recognition result come modifies to shopping cart to determine that current shopping acts, clear in shopping if being identified as taking out article Increase product G oodA on single ID, accelerating is 1, reduces product on shopping list ID if being identified as putting back to article GoodA, reducing quantity is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again Become, to supermarket's monitoring transmission alarm signal and the corresponding location information of current video if recognition result is " suspicious stealing ".

Embodiment 3:

The present embodiment realizes a kind of process of the upgrading products list of supermarket's intelligence vending system.

1. a process has only used product identification module.When changing product list: if deleting certain product, from each angle The image of the product is deleted in the product image set of degree, and corresponding position in product list is deleted, if increasing certain product, The product image of all angles of current production is put into the product image set of all angles, by product list last The current title for increasing product of back addition, is then updated with the product image set of new all angles and new product list Product identification classifier.

The product image set and new product list upgrading products recognition classifier with new all angles, Method are as follows: the first step modifies network structure: for the network of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Structure is constant, identical as GoodsN1 network structure when initialization, the first layer and second layer knot of GoodsN2 ' network structure Structure remains unchanged, and the output vector length of third layer becomes the length of updated product list；Second step, for neotectonics Product identification classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture For Goods3, export as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification y_coods3, y_Goods For one group of vector, length is equal to the number of updated product list, y_GoodsRepresentation method are as follows: if image Goods is the i_GoodsThe product of position, then y_GoodsI-th_GoodsPosition is 1, other are to (GoodsN for the evaluation function of 0. network (Goods)-y_Goods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization in GoodsN1 Parameter value remain unchanged, the number of iterations be 500 times.

Claims

1. a kind of supermarket's intelligence vending system, which is characterized in that based on the monitoring camera being fixed in supermarket and on shelf The video image taken the photograph is as input；It is made of following 6 functional modules: image pre-processing module, module of target detection, shopping Action recognition module, product identification module, individual identification module, recognition result processing module；This 6 respective realities of functional module Existing method is as follows:

The video image that image pre-processing module takes the photograph monitoring camera pre-processes, first to possible in the image of input The noise that contains carries out denoising, then carries out illumination compensation to the image after denoising, then to the image after illumination compensation into Data after image enhancement are finally passed to module of target detection by row image enhancement；

Module of target detection carries out target detection to the image received, detects that the human body in current video image is whole respectively Then hand region and product area are sent to shopping movement and known by region, face facial area, hand region and product area Human body image region and face facial area are sent to individual identification module by other module, and product area is passed to product and is known Other module；

Shopping action recognition module carries out static action recognition to the hand region information received, finds the starting for grasping video Frame, it is then lasting that movement is identified until finding the movement for putting down article as end frame, then video is used dynamic State action recognition classifier is identified, identifies that current action is to take out article, put back to article, take out and put back to, taken out Article does not put back to either suspicious stealing；Then recognition result is sent to recognition result processing module, by only grasp motion Video and only put down the video of movement and be sent to recognition result processing module；

Product identification module identifies the video of the product area received, identify currently by it is mobile be any production Product, are then sent to recognition result processing module for recognition result, and product identification module can also increase at any time or delete some Product；

Individual identification module identifies the human face region and human region that receive, believes in conjunction with human face region and human region Breath, for identification out current individual be in entire supermarket who individual, then recognition result is sent at recognition result Manage module；

Recognition result processing module integrates the recognition result received, according to individual identification module transmit come customer ID determines the corresponding customer of current shopping information, according to product identification module transmit come recognition result determine current customer Shopping acts corresponding product, determined according to the recognition result that shopping action recognition module transmitting comes current shopping act whether It modifies to shopping cart；To obtain the shopping list of current customer；Suspicious stealing to shopping action recognition module identification Behavior sounds an alarm.

2. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the image pre-processing module Concrete methods of realizing are as follows:

In initial phase, the module does not work；In the detection process: the first step, the monitoring image that monitoring camera is taken the photograph into Row mean denoising, thus the monitoring image after being denoised；Second step carries out illumination compensation to the monitoring image after denoising, from And obtain the image after illumination compensation；Image after illumination compensation is carried out image enhancement, by the number after image enhancement by third step According to passing to module of target detection；

The monitoring image that the monitoring camera is taken the photograph carries out mean denoising, and method is: setting the prison that monitoring camera is taken the photograph Control image is X_src, because of X_srcFor color RGB image, therefore there are X_src-R, X_src-G, X_src-BThree components, for each point Measure X_src', it proceeds as follows respectively: the window of one 3 × 3 dimension being set first, considers image X_src' each pixel X_src' (i, j), it is respectively [X that pixel value corresponding to matrixes is tieed up in 3 × 3 put centered on the point_src′(i-1,j-1),X_src′ (i-1,j),X_src′(i-1,j+1),X_src′(i,j-1),X_src′(i,j),X_src′(i,j+1),X_src′(i+1,j-1),X_src′(i+ 1,j),X_src' (j+1, j+1)] it is arranged from big to small, take it to come intermediate value as image X after denoising_src" pixel (i, J) value is assigned to X after corresponding filtering_src″(i,j)；For X_src' boundary point, it may appear that its 3 × 3 dimension window corresponding to The case where certain pixels are not present, then the median for falling in existing pixel in window need to be only calculated, if window Interior is even number point, is assigned to X for the average value for coming intermediate two pixel values as the pixel value after pixel denoising_src″ (i, j), thus, new image array X_srcIt " is X_srcImage array after the denoising of current RGB component, for X_src-R, X_src-G, X_src-BAfter three components carry out denoising operation respectively, the X that will obtain_src-R", X_src-G", X_src-B" component, by this three A new component is integrated into a new color image X_DenResulting image after as denoising；

Described carries out illumination compensation to the monitoring image after denoising, if the monitoring image X after denoising_Den, because of X_DenFor colour RGB image, therefore X_DenThere are tri- components of RGB, for each component X_Den', illumination compensation is carried out respectively, then will be obtained X_cpst' integration obtains colored RBG image X_cpst, X_cpstAs X_DenImage after illumination compensation, to each component X_Den' point Not carry out illumination compensation the step of are as follows: the first step, if X_Den' arranged for m row n, construct X_Den′^sumAnd Num_DenFor same m row n column Matrix, initial value are 0,Step-lengthWindow size is l, wherein function min (m, n) indicates to take the minimum value of m and n,Indicate round numbers part, sqrt (l) indicates the square root of l, the l=1 if l < 1；The Two steps, if X_DenTop left co-ordinate is (1,1), is started from coordinate (1,1), is that l and step-length s determines each according to window size Candidate frame, which is [(a, b), (a+l, b+l)] area defined, for X_Den' corresponding in candidate frame region Image array carries out histogram equalization, the image array after obtaining the equalization of candidate region [(a, b), (a+l, b+l)] X_Den", then X_Den′^sumEach element in the corresponding region [(a, b), (a+l, b+l)] calculates X_Den′^sum(a+i_Xsum,b+ j_Xsum)=X_Den′^sum(a+i_Xsum,b+j_Xsum)+X_Den″(i_Xsum,j_Xsum), wherein (i_Xsum,j_Xsum) it is integer and 1≤i_Xsum≤ l, 1 ≤j_Xsum≤ l, and by Num_DenEach element in the corresponding region [(a, b), (a+l, b+l)] adds 1；Finally, calculating Wherein (i_XsumNum,j_XsumNum) it is X_DenEach corresponding point, to obtain X_cpstAs to present component X_Den' carry out illumination Compensation；

If monitoring image is m row n column, (a, b) is the top left co-ordinate in selected region, and (a+l, b+l) is the right side of selection area Lower angular coordinate, the region are indicated that the initial value of (a, b) is (1,1) by [(a, b), (a+l, b+l)]；

As a+l≤m:

B=1；

As b+l≤n:

Selected region is [(a, b), (a+l, b+l)]；

B=b+s；

Interior loop terminates；

A=a+s；

Outer loop terminates；

In the above process, selected region [(a, b), (a+l, b+l)] is candidate frame every time；

It is described for X_Den' image array corresponding in candidate frame region carries out histogram equalization, if candidate frame region For [(a, b), (a+l, b+l)] area defined, X_DenIt " is X_Den' image the letter in the region [(a, b), (a+l, b+l)] Breath the steps include: the first step, construct vector I, I (i_I) it is X_Den" middle pixel value is equal to i_INumber, 0≤i_I≤255；Second Step calculates vectorThird step, for X_Den" on each point (i_XDen,j_XDen), pixel value is X_Den″(i_XDen,j_XDen), calculate X "_Den(i_XDen,j_XDen)=I ' (X "_Den(i_XDen,j_XDen))；To X_Den" all pixels in image Histogram equalization process terminates after point value is all calculated and changed, X_Den" the result of the interior as histogram equalization saved；

Described carries out image enhancement for the image after illumination compensation, if the image after illumination compensation is X_cpst, corresponding RGB Channel is respectively X_cpstR,X_cpstG,X_cpstB, to X_cpstThe image obtained after image enhancement is X_enh；Image enhancement is carried out to it Step are as follows: the first step, for X_cpstThe important X of institute_cpstR,X_cpstG,X_cpstBIt calculates it and carries out the figure after obscuring by specified scale Picture；Second step, structural matrix LX_enhR,LX_enhG,LX_enhBFor with X_cpstRThe matrix of identical dimensional, for image X_cpstRGB it is logical The channel R in road calculates LX_enhR(i, j)=log (X_cpstR(i,j))-LX_cpstR(i, j), the value range of (i, j) are image moment All points in battle array, for image X_cpstRGB channel in the channel G and channel B obtained using algorithm same as the channel R LX_enhGAnd LX_enhR；Third step, for image X_cpstRGB channel in the channel R, calculate LX_enhRThe mean value of middle all the points value MeanR and mean square deviation VarR (attention is mean square deviation) calculates MinR=MeanR-2 × VarR and MaxR=MeanR+2 × VarR, Then X is calculated_enhR(i, j)=Fix ((LX_cpstR(i, j)-MinR)/(MaxR-MinR) × 255), wherein Fix indicates round numbers Part is assigned a value of 0 if value<0, and value>255 are assigned a value of 255；For in RGB channel the channel G and channel B use and R The same algorithm in channel obtains X_enhGAnd X_enhB, the X of RGB channel will be belonging respectively to_enhR、X_enhG、X_enhBIt is integrated into a cromogram As X_enh；

It is described for X_cpstThe important X of institute_cpstR,X_cpstG,X_cpstBIt calculates it and carries out the image after obscuring by specified scale, it is right The channel R X in RGB channel_cpstR, the steps include: the first step, define Gaussian function G (x, y, σ)=k × exp (- (x²+y²)/ σ²), σ is scale parameter, k=1/ ∫ ∫ G (x, y) dxdy, then for X_cpstREach point X_cpstR(i, j) is calculated, WhereinIndicate convolution algorithm, for being lower than the point of scale σ apart from boundary, only Calculate X_cpstRWith the convolution of G (x, y, σ) corresponding part, Fix () indicates round numbers part, is assigned a value of 0 if value<0, value> 255 are assigned a value of 255；For in RGB channel the channel G and channel B using algorithm same as the channel R update X_cpstGWith X_cpstG。

3. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the module of target detection Concrete methods of realizing are as follows:

During initialization, using with having demarcated human body image region, face facial area, hand region and product area Image to algorithm of target detection carry out parameter initialization；In the detection process, receive what image pre-processing module transmitted Then image is handled it, carry out target detection using algorithm of target detection to each frame image, obtain present image Human body image region, face facial area, hand region and product area, are then sent to purchase for hand region and product area Human body image region and face facial area are sent to individual identification module, product area are transmitted by object action recognition module Give product identification module；

The use have demarcated human body image region, face facial area, hand region and product area image pair Algorithm of target detection carries out parameter initialization, the steps include: that the first step, construction feature extract depth network；Second step, tectonic province Domain selects network, third step, according to each figure in database used in the construction feature extraction depth network As X and the corresponding each human region manually demarcatedThen by ROI layers, input For image X and regionOutputIt is 7 × 7 × 512 dimensions；Third step, building coordinate refine network；

The construction feature extracts depth network, which is deep learning network structure, network structure are as follows: first layer: Convolutional layer, inputting is 768 × 1024 × 3, and exporting is 768 × 1024 × 64, port number channels=64；The second layer: convolution Layer, inputting is 768 × 1024 × 64, and exporting is 768 × 1024 × 64, port number channels=64；Third layer: pond layer, Input first layer output 768 × 1024 × 64 is connected in third dimension with third layer output 768 × 1024 × 64, exports It is 384 × 512 × 128；4th layer: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number Channels=128；Layer 5: convolutional layer, inputting is 384 × 512 × 128, and exporting is 384 × 512 × 128, port number Channels=128；Layer 6: pond layer inputs the 4th layer of output 384 × 512 × 128 and layer 5 384 × 512 × 128 It is connected in third dimension, exporting is 192 × 256 × 256；Layer 7: convolutional layer, inputting is 192 × 256 × 256, defeated It is out 192 × 256 × 256, port number channels=256；8th layer: convolutional layer, inputting is 192 × 256 × 256, output It is 192 × 256 × 256, port number channels=256；9th layer: convolutional layer, inputting is 192 × 256 × 256, exports and is 192 × 256 × 256, port number channels=256；Tenth layer: pond layer inputs as layer 7 output 192 × 256 × 256 It is connected in third dimension with the 9th layer 192 × 256 × 256, exporting is 96 × 128 × 512；Eleventh floor: convolutional layer, Input is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；Floor 12: convolutional layer, it is defeated Entering is 96 × 128 × 512, and exporting is 96 × 128 × 512, port number channels=512；13rd layer: convolutional layer, input It is 96 × 128 × 512, exporting is 96 × 128 × 512, port number channels=512；14th layer: pond layer inputs and is Eleventh floor output 96 × 128 × 512 is connected in third dimension with the 13rd layer 96 × 128 × 512, export as 48 × 64×1024；15th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 512, port number channels =512；16th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels= 512；17th layer: convolutional layer, inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels=512； 18th layer: pond layer inputs and exports 48 × 64 × 512 and the 17th layer 48 × 64 × 512 in third dimension for the 15th layer It is connected on degree, exporting is 48 × 64 × 1024；19th layer: convolutional layer, inputting is 48 × 64 × 1024, and exporting is 48 × 64 × 256, port number channels=256；20th layer: pond layer, inputting is 48 × 64 × 256, export as 24 × 62 × 256；Second eleventh floor: convolutional layer, inputting is 24 × 32 × 1024, and exporting is 24 × 32 × 256, port number channels= 256；Second Floor 12: pond layer, inputting is 24 × 32 × 256, and exporting is 12 × 16 × 256；23rd layer: convolutional layer, Input is 12 × 16 × 256, and exporting is 12 × 16 × 128, port number channels=128；24th layer: pond layer, it is defeated Entering is 12 × 16 × 128, and exporting is 6 × 8 × 128；25th layer: full articulamentum, first by 6 × 8 × 128 dimensions of input Data be launched into the vectors of 6144 dimensions, then input into full articulamentum, output vector length is 768, and activation primitive is Relu activation primitive；26th layer: full articulamentum, input vector length are 768, and output vector length is 96, activation primitive For relu activation primitive；27th layer: full articulamentum, input vector length are 96, and output vector length is 2, activation primitive For soft-max activation primitive；The parameter of all convolutional layers is size=3 convolution kernel kernel, and step-length stride=(1,1) swashs Function living is relu activation primitive；All pond layers are maximum pond layer, and parameter is pond section size kernel_size =2, step-length stride=(2,2)；If setting the depth network as Fconv27, for a width color image X, by the depth net The obtained feature set of graphs of network indicates that the evaluation function of the network is to calculate it to (Fconv27 (X)-y) with Fconv27 (X) Cross entropy loss function, convergence direction are to be minimized, and y inputs corresponding classification；Database is to include what nature acquired The image of passerby and non-passerby, every image are the color image of 768 × 1024 dimensions, whether include pedestrian point according in image At two classes, the number of iterations is 2000 times；After training, takes first layer to be characterized to the 17th layer and extract depth network Fconv indicates a width color image X by the obtained output of the depth network with Fconv (X)；

The structure realm selects network, receives Fconv depth network and extracts 512 48 × 64 feature set of graphs Fconv (X), then the first step obtains Conv by convolutional layer₁(Fconv (X)), the parameter of the convolutional layer are as follows: convolution kernel kernel size =1, step-length stride=(1,1), inputting is 48 × 64 × 512, and exporting is 48 × 64 × 512, port number channels= 512；Then by Conv₁(Fconv (X)) is separately input to two convolutional layer (Conv_2-1And Conv_2-2), Conv_2-1Structure are as follows: Input is 48 × 64 × 512, and exporting is 48 × 64 × 18, port number channels=18, and the output that this layer obtains is Conv_2-1 (Conv₁(Fconv (X))), then softmax (Conv is obtained using activation primitive softmax to the output_2-1(Conv₁(Fconv (X))))；Conv_2-2Structure are as follows: inputting is 48 × 64 × 512, and exporting is 48 × 64 × 36, port number channels=36； There are two the loss functions of the network: first error function loss1 is to W_shad-cls⊙(Conv_2-1(Conv₁(Fconv (X)))-W_cls(X)) softmax error is calculated, second error function loss2 is to W_shad-reg(X)⊙(Conv_2-1(Conv₁ (Fconv(X)))-W_reg(X)) smooth L1 error, loss function=loss1/sum (W of regional choice network are calculated_cls (X))+loss2/sum(W_cls(X)), the sum of sum () representing matrix all elements, convergence direction are to be minimized, W_cls(X) And W_regIt (X) is respectively the corresponding positive and negative sample information of database images X, ⊙ representing matrix is multiplied according to corresponding position, W_shad-cls (X) and W_shad-regIt (X) is mask, it acts as selection W_shad(X) part that weight is 1 in is trained, to avoid positive and negative Sample size gap is excessive, and when each iteration regenerates W_shad-cls(X) and W_shad-reg(X), algorithm iteration 1000 times；

The construction feature extracts database used in depth network, for each image in database, first Step: each human body image-region, face facial area, hand region and product area are manually demarcated, if it is in input picture Centre coordinate is (a_{bas_tr},b_{bas_tr}), centre coordinate is l in the distance of fore-and-aft distance upper and lower side frame_{bas_tr}, centre coordinate is in cross It is w to the distance apart from left and right side frame_{bas_tr}, then it corresponds to Conv₁Position be that center coordinate isHalf is a length ofHalf-breadth is Indicate round numbers part；The Two steps: positive negative sample is generated at random；

The positive negative sample of generation at random, method are as follows: the first step constructs 9 regional frames, second step, for database Each image X_trIf W_clsFor 48 × 64 × 18 dimensions, W_regFor 48 × 64 × 36 dimensions, all initial values are 0, to W_cls And W_regIt is filled；

Described 9 regional frames of construction, this 9 regional frames are respectively as follows: Ro₁(y_Ro,y_Ro)=(x_Ro,y_Ro, 64,64), Ro₂(x_Ro, y_Ro)=(x_Ro,y_Ro,45,90),Ro₃(x_Ro,y_Ro)=(x_Ro,y_Ro,90,45),Ro₄(x_Ro,y_Ro)=(x_Ro,y_Ro, 128,128), Ro₅(x_Ro,y_Ro)=(x_Ro,y_Ro,90,180),Ro₆(x_Ro,y_Ro)=(x_Ro,y_Ro,180,90),Ro₇(x_Ro,y_Ro)=(x_Ro,y_Ro, 256,256), Ro₈(x_Ro,y_Ro)=(x_Ro,y_Ro,360,180),Ro₉(x_Ro,y_Ro)=(x_Ro,y_Ro, 180,360), for each Region unit, Ro_i(x_Ro,y_Ro) indicate for ith zone frame, the centre coordinate (x of current region frame_Ro,y_Ro), third position indicates Pixel distance of the central point apart from upper and lower side frame, the 4th indicates pixel distance of the central point apart from left and right side frame, the value of i from 1 to 9；

It is described to W_clsAnd W_regIt is filled, method are as follows:

For the body compartments that each is manually demarcated, if it is (a in the centre coordinate of input picture_{bas_tr},b_{bas_tr}), center Coordinate is l in the distance of fore-and-aft distance upper and lower side frame_{bas_tr}, centre coordinate is w in the distance of lateral distance left and right side frame_{bas_tr}, Then it corresponds to Conv₁Position be that center coordinate isHalf is a length ofHalf-breadth For

For the upper left cornerBottom right angular coordinateEach point in the section surrounded (x_Ctr,y_Ctr):

For i value from 1 to 9:

For point (x_Ctr,y_Ctr), it is upper left angle point (16 (x in the mapping range of database images_Ctr-1)+1,16(y_Ctr-1)+ 1) bottom right angle point (16x_Ctr,16y_Ctr) 16 × 16 sections that are surrounded, for each point (x in the section_Otr,y_Otr):

Calculate (x_Otr,y_Otr) corresponding to region Ro_i(x_Otr,y_Otr) with current manual calibration section coincidence factor；

Select the highest point (x of coincidence factor in current 16 × 16 section_IoUMax,y_IoUMax), if coincidence factor > 0.7, W_cls(x_Ctr, y_Ctr, 2i-1)=1, W_cls(x_Ctr,y_Ctr, 2i)=0, which is positive sample, W_reg(x_Ctr,y_Ctr, 4i-3) and=(x_Otr-16x_Ctr+ 8)/8, W_reg(x_Ctr,y_Ctr, 4i-2) and=(y_Otr-16y_Ctr+ 8)/8, W_reg(x_Ctr,y_Ctr, 4i-2) and=Down1 (l_{bas_tr}/Ro_i's Third position), W_reg(x_Ctr,y_Ctr, 4i) and=Down1 (w_{bas_tr}/Ro_iThe 4th), Down1 () is indicated if value greater than taking if 1 Value is 1；If coincidence factor < 0.3, W_cls(x_Ctr,y_Ctr, 2i-1)=0, W_cls(x_Ctr,y_Ctr, 2i)=1；Otherwise W_cls(x_Ctr,y_Ctr, 2i-1)=- 1, W_cls(x_Ctr,y_Ctr, 2i)=- 1；

If the human region of current manual's calibration does not have the Ro of coincidence factor > 0.6_i(x_Otr,y_Otr), then select the highest Ro of coincidence factor_i (x_Otr,y_Otr) to W_clsAnd W_regAssignment, assignment method are identical as the assignment method of coincidence factor > 0.7；

Calculating (the x_Otr,y_Otr) corresponding to region Ro_i(x_Otr,y_Otr) with current manual calibration section coincidence factor, side Method are as follows: set the body compartments that manually demarcate in the centre coordinate of input picture as (a_{bas_tr},b_{bas_tr}), centre coordinate longitudinal direction away from It is l with a distance from upper and lower side frame_{bas_tr}, centre coordinate is w in the distance of lateral distance left and right side frame_{bas_tr}If Ro_i(x_Otr, y_Otr) third position be l_Otr, the 4th is w_OtrIf meeting | x_Otr-a_{bas_tr}|≤l_Otr+l_{bas_tr}- 1 and | y_Otr-b_{bas_tr}|≤ w_Otr+w_{bas_tr}- 1, illustrate that there are overlapping region, overlapping regions=(l_Otr+I_{bas_tr}-1-|x_Otr-a_{bas_tr}|)×(w_Otr+w_{bas_tr}- 1-|y_Otr-b_{bas_tr}|), otherwise overlapping region=0；Calculate whole region=(2l_Otr-1)×(2w_Otr-1)+(2a_{bas_tr}-1)× (2w_{bas_tr}- 1)-overlapping region；To obtain coincidence factor=overlapping region/whole region, | | expression takes absolute value；

The W_shad-cls(X) and W_shad-reg(X), building method are as follows: for image X, corresponding positive and negative sample information For W_cls(X) and W_reg(X), the first step constructs W_shad-cls(X) with and W_shad-reg(X), W_shad-cls(X) and W_cls(X) dimension phase Together, W_shad-reg(X) and W_reg(X) dimension is identical；Second step records the information of all positive samples, for i=1 to 9, if W_cls(X) (a, b, 2i-1)=1, then W_shad-cls(X) (a, b, 2i-1)=1, W_shad-cls(X) (a, b, 2i)=1, W_shad-reg(X)(a,b, 4i-3)=1, W_shad-reg(X) (a, b, 4i-2)=1, W_shad-reg(X) (a, b, 4i-1)=1, W_shad-reg(X) (a, b, 4i)=1, Positive sample has selected altogether sum (W_shad-cls(X)) a, sum () indicates to sum to all elements of matrix, if sum (W_shad-cls(X)) > 256, retain 256 positive samples at random；Third step randomly chooses negative sample, randomly chooses (a, b, i), if W_cls(X) (a, b, 2i-1)=1, then W_shad-cls(X) (a, b, 2i-1)=1, W_shad-cls(X) (a, b, 2i)=1, W_shad-reg(X) (a, b, 4i-3)=1, W_shad-reg(X) (a, b, 4i-2)=1, W_shad-reg(X) (a, b, 4i-1)=1, W_shad-reg(X)(a,b, 4i)=1, if the negative sample quantity chosen is 256-sum (W_shad-cls(X)) a, although negative sample lazy weight 256- sum(W_shad-cls(X)) a but be all unable to get negative sample in 20 generation random numbers (a, b, i), then algorithm terminates；

The ROI layer, input are image X and regionIts method are as follows: for image X By feature extraction depth network Fconv it is obtained output Fconv (X) dimension be 48 × 64 × 512, for each 48 × 64 matrix Vs_{ROI_I}Information (512 matrixes altogether), extract V_{ROI_I}The upper left corner in matrix The lower right cornerIt is surrounded Region,Indicate round numbers part；Output is roi_I(X) dimension is 7 × 7, then step-length

For i_ROI=1: to 7:

For j_ROI=1 to 7:

Construct section

roi_I(X)(i_ROI,j_ROIThe value of maximum point in)=section；

When 512 48 × 64 matrix whole after treatments, output splicing is obtained into the output of 7 × 7 × 512 dimensionsParameter is indicated for image X, in regional frame ROI in range；

The building coordinate refines network, method are as follows: the first step, extending database: extended method is in database Each image X and the corresponding each region manually demarcatedIts corresponding ROI isThe BClass=[1,0,0,0,0] if current interval is human body image-region, BBox=[0,0,0,0], the BClass=[0,1,0,0,0] if current interval is people's face facial area, BBox=[0,0,0, 0], BClass=[0,0,1,0,0], BBox=[0,0,0,0] if current interval is hand region, if current interval is product Region then [0,0,0,1,0] BClass=, BBox=[0,0,0,0]；It is random to generate value random number a between -1 to 1_rand, b_rand,l_rand,w_rand, to obtain new section Indicate round numbers part, the BBox=[a in the section_rand,b_rand,l_rand, w_rand], if new section withCoincidence factor > 0.7 item BClass=current region BClass, if new section withCoincidence factor < 0.3, then [0,0,0,0,1] BClass=, The two is not satisfied, then not assignment；Each section at most generates 10 positive sample regions, if generating Num₁A positive sample region, Then generate Num₁+ 1 negative sample region, if the inadequate Num in negative sample region₁+ 1, then expand a_rand,b_rand,l_rand,w_randModel It encloses, until finding enough negative sample numbers；Second step, building coordinate refine network: for each in database Image X and the corresponding each human region manually demarcatedIts corresponding ROI isThe ROI of 7 × 7 × 512 dimensions will be launched into 25088 dimensional vectors, then passed through Cross two full articulamentum Fc², obtain output Fc²(ROI), then by Fc²(ROI) micro- by classification layer FClass and section respectively Layer FBBox is adjusted, output FClass (Fc is obtained²And FBBox (Fc (ROI))²(ROI)), classification layer FClass is full articulamentum, Input vector length is 512, and output vector length is 5, and it is full articulamentum that layer FBBox is finely tuned in section, and input vector length is 512, output vector length is 4；There are two the loss functions of the network: first error function loss1 is to FClass (Fc² (ROI))-BClass calculates softmax error, and second error function loss2 is to (FBBox (Fc²(ROI))-BBox) meter Euclidean distance error is calculated, then whole loss function=loss1+loss2 of the refining network, algorithm iteration process are as follows: change first 1000 convergence error function loss2 of generation, then 1000 convergence whole loss functions of iteration；

The full articulamentum Fc of described two², structure are as follows: first layer: full articulamentum, input vector length be 25088, export to Measuring length is 4096, and activation primitive is relu activation primitive；The second layer: full articulamentum, input vector length be 4096, export to Measuring length is 512, and activation primitive is relu activation primitive；

Described carries out target detection using algorithm of target detection to each frame image, obtains the human body image area of present image Domain, face facial area, hand region and product area, the steps include:

Second step, for each subgraph X_s:

2.1st step is converted using the feature extraction depth network Fconv constructed in initialization, obtains 512 feature Set of graphs Fconv (X_s)；

2.2nd step, to Fconv (X_s) using area selection network in first layer Conv₁, second layer Conv_2-1+ softmax activation Function and Conv_2-2Into transformation, output softmax (Conv is respectively obtained_2-1(Conv₁(Fconv(X_s)))) and Conv_2-2(Conv₁ (Fconv(X_s))), all preliminary candidate sections in the section are then obtained according to output valve；

2.3.1 step, is chosen according to the score size in its current candidate region, chooses maximum 50 preliminary candidate sections As candidate region；

2.3.2 step adjusts candidate section of crossing the border all in candidate section set, then weeds out and is overlapped in candidate section Frame, to obtain final candidate section；

2.3.3 step, by subgraph X_sROI layers are input to each final candidate section, corresponding ROI output is obtained, if currently Final candidate section be (a_BB(1),b_BB(2),l_BB(3),w_BB(4)) FBBox (Fc, is then calculated²(ROI)) obtain four it is defeated (Out out_BB(1),Out_BB(2),Out_BB(3),Out_BB(4)) to obtain updated coordinate (a_BB(1)+8×Out_BB(1),b_BB (2)+8×Out_BB(2),l_BB(3)+8×Out_BB(3),w_BB(4)+8×Out_BB(4))；Then FClass (Fc is calculated²(ROI)) To output, current interval is human body image-region if exporting first maximum, if output second maximum current interval is Face facial area, current interval is hand region if exporting third position maximum, if the 4th maximum current interval of output For product area, current interval, which is negative, if exporting the 5th maximum sample areas and deletes the final candidate section；

Third step, the coordinate in the final candidate section after updating the refining of all subgraphs, the method for update is to set current candidate area The coordinate in domain is (TLx, TLy, RBx, RBy), and the top left co-ordinate of corresponding subgraph is (Sea_sub,Seb_sub), it is updated Coordinate is (TLx+Sea_sub-1,TLy+Seb_sub-1,RBx,RBy)；

It is described by input picture X_cpstIt is divided into the subgraph of 768 × 1024 dimensions, the steps include: the step-length for setting segmentation as 384 Hes 512, if window size is m row n column, (a_sub,b_sub) be selected region top left co-ordinate, the initial value of (a, b) be (1, 1)；Work as a_subWhen < m:

b_sub=1；

Work as b_subWhen < n:

Selected region is [(a_sub,b_sub),(a_sub+384,b_sub+ 512)], by input picture X_cpstFigure corresponding to the upper section It is copied to as the information in region in new subgraph, and is attached to top left co-ordinate (a_sub,b_sub) it is used as location information；

If selection area exceeds input picture X_cpstSection then will exceed the corresponding equal assignment of rgb pixel value of pixel in range It is 0；

b_sub=b_sub+512；

Interior loop terminates；

a_sub=a_sub+384；

Outer loop terminates；

Described obtains all preliminary candidate sections in the section, method according to output valve are as follows: step 1: for softmax(Conv_2-1(Conv₁(Fconv(X_s)))) its output be 48 × 64 × 18, for Conv_2-2(Conv₁(Fconv (X_s))), output is 48 × 64 × 36, for any point (x, y) on 48 × 64 dimension spaces, softmax (Conv_2-1 (Conv₁(Fconv(X_s)))) (x, y) be 18 dimensional vector II, Conv_2-2(Conv₁(Fconv(X_s))) (x, y) be 36 dimensional vectors IIII, if II (2i-1) > II (2i), for i value from 1 to 9, l_OtrFor Roi (x_Otr,y_Otr) third position, w_OtrFor Ro_i (x_Otr,y_Otr) the 4th, then preliminary candidate section be [II (2i-1), (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, l_Otr×IIII(4i-1),w_Otr× IIII (4i))], wherein the score in first II (2i-1) expression current candidate region, second Position (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y, IIII (4i-1), IIII (4i)) indicates the center in current candidate section Point is (8 × IIII (4i-3)+x, 8 × IIII (4i-2)+y), and the long half-breadth of the half of candidate frame is respectively l_Otr× IIII (4i-1) and w_Otr×IIII(4i))；

All candidate sections of crossing the border, method in the candidate section set of the adjustment are as follows: it sets monitoring image and is arranged as m row n, it is right In each candidate section, if its [(a_ch,b_ch)], the long half-breadth of the half of candidate frame is respectively l_chAnd w_chIf a_ch+l_ch> m, thenThen its a is updated_ch=a '_ch, l_ch=l '_ch； If b_ch+w_ch> n, thenThen its b is updated_ch =b '_ch, w_ch=w '_ch；

If candidate section set is not sky:

Calculate candidate section i_outWith candidate section set each of candidate section i_cCoincidence factor, if coincidence factor > 0.7, Gather from candidate section and deletes candidate section i_c；

By candidate section i_outIt is put into the candidate section set of output；

When candidate section set is empty, exporting candidate section contained in candidate section set is to weed out in candidate section Obtained candidate section set after the frame of overlapping；

The calculating candidate section i_outWith candidate section set each of candidate section i_cCoincidence factor, method are as follows: If candidate section i_cCoordinate section centered on point [(a_ic,b_ic)], the long half-breadth of the half of candidate frame is respectively l_icAnd w_ic, candidate regions Between i_cCoordinate section centered on point [(a_iout,b_icout)], the long half-breadth of the half of candidate frame is respectively l_ioutAnd w_iout；Calculate xA= max(a_ic,a_iout)；YA=max (b_ic,b_iout)；XB=min (l_ic,l_iout), yB=min (w_ic,w_iout)；If meeting | a_ic- a_iout|≤l_ic+l_iout- 1 and | b_ic-b_iout|≤w_ic+w_iout- 1, illustrate that there are overlapping region, overlapping regions=(l_ic+l_iout- 1-|a_ic-a_iout|)×(w_ic+w_iout-1-|b_ic-b_iout|), otherwise overlapping region=0；Calculate whole region=(2l_ic-1)× (2w_ic-1)+(2l_iout-1)×(2w_iout- 1)-overlapping region；To obtain coincidence factor=overlapping region/whole region.

4. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the shopping action recognition mould The concrete methods of realizing of block are as follows:

In initialization, static action recognition classifier is initialized using the hand motion image of standard first, thus Static action recognition classifier is set to can recognize that the grasping of hand, put down movement；Then dynamic to dynamic using hand motion video It is initialized as recognition classifier, so that dynamic action recognition classifier be made to can recognize that the taking-up article of hand, put back to object Product, take out again put back to, taken out article do not put back to either suspicious stealing；In the detection process: the first step, it is every to what is received One hand region information identified using static action recognition classifier, recognition methods are as follows: set the image inputted each time For Handp1, exporting as StaticN (Handp1) is 3 bit vectors, is identified as grasping if first maximum, if second is maximum It is then identified as putting down, is identified as other if the maximum of third position；Second step moves current grasp after recognizing grasp motion Make corresponding region and carry out target following, if using static action recognition point corresponding to the next frame tracking box of current hand region The recognition result of class device is when putting down movement, and target following terminates, and is opened for video by currently available from recognizing grasp motion Begin, recognize and put down movement and terminate for video, is complete view by the video marker to obtain the continuous videos of hand motion Frequently；If tracking during tracking lose, by it is currently available since recognize grasp motion be video, from tracking lose before Image terminate as video, be then only grasp motion by the video marker to obtain the video of only grasp motion Video；Movement is put down when recognizing, and the movement illustrates that the grasping of the movement is dynamic not in the obtained image of target following Lose, then terminate using the corresponding hand region of present image as video, using method for tracking target since present frame forward It is tracked, until tracking is lost, then start frame of the next frame of lost frames as video, is only to put down by the video marker The video of movement；Third step identifies the obtained complete video of second step using dynamic action recognition classifier, identifies Method are as follows: set the image inputted each time as Handv1, exporting as DynamicN (Handv1) is 5 bit vectors, if first most It is big then be identified as take out article, be identified as putting back to article if second maximum, if third position maximum be identified as take out again put It returns, is identified as having taken out article if the 4th maximum and not put back to, if the 5th maximum is identified as the movement of suspicious stealing, so The recognition result is sent to recognition result processing module afterwards, by the video of only grasp motion and only puts down the video of movement It is sent to recognition result processing module, the video of complete video and only grasp motion is sent to product identification module and individual Identification module；

The hand motion image using standard initializes static action recognition classifier, method are as follows: first Step arranges video data: firstly, choosing the video that a large amount of people does shopping in supermarket, these videos include extract product, put back to Article takes out and puts back to again, taken out article and do not put back to movement with suspicious stealing；Manually each section of video clip is cut It takes, commodity is encountered as start frame using manpower, commodity are left as end frame using manpower, target then is used for each frame of video Detection module extracts its hand region, the color image for being then 256 × 256 by each frame image scaling of hand region, will Scaling rear video is put into hand motion video collection, and the video is marked to put back to, again to take out article, putting back to article, taking-up Article is taken out one of not put back to the movement of suspicious stealing；It is taking-up article for classification, puts back to article, takes out and puts It returns, taken out each video that article is not put back to, the first frame of the video is put into the merging of hand motion image set and is labeled as The last frame of the video is put into hand motion image set and merged labeled as putting down movement by grasp motion, removes the from the video It takes a frame to be put into hand motion image set outside one frame and last needle at random to merge labeled as other；To obtain hand motion view Frequency set and hand motion image collection；Second step constructs static action recognition classifier StaticN；Third step, it is dynamic to static state Make recognition classifier StaticN to be initialized, the hand motion image collection constructed by the first step is inputted, if each time The image of input is Handp, is exported as StaticN (Handp), classification y_Handp, y_HandpRepresentation method are as follows: grasp: y_Handp=[1,0,0], puts down: y_Handp=[0,1,0], other: y_Handp=[0,0,1], the evaluation function of the network are pair (StaticN(Handp)-y_Handp) its cross entropy loss function is calculated, convergence direction is to be minimized, the number of iterations 2000 It is secondary；

The construction static state action recognition classifier StaticN, network structure are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, exporting is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, input as 256 × 256 × 64, exporting is 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are connected in third dimension with third layer output 256 × 256 × 64, and exporting is 128 × 128 × 128；4th layer: Convolutional layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 5: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 6: Chi Hua Layer inputs the 4th layer of output 128 × 128 × 128 and is connected in third dimension with layer 5 128 × 128 × 128, exports It is 64 × 64 × 256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；9th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number Channels=256；Tenth layer: pond layer, input for layer 7 output 64 × 64 × 256 and the 9th layer 64 × 64 × 256 It is connected in third dimension, exporting is 32 × 32 × 512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, exports and is 32 × 32 × 512, port number channels=512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512；13rd layer: convolutional layer, inputting is 32 × 32 × 512, export as 32 × 32 × 512, port number channels=512；14th layer: pond layer inputs as eleventh floor output 32 × 32 × 512 and the 13rd Layer 32 × 32 × 512 is connected in third dimension, and exporting is 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512；17th layer: convolutional layer, input as 16 × 16 × 512, exporting is 16 × 16 × 512, port number channels=512；18th layer: pond layer is inputted and is exported for the 15th layer 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024；19th Layer: convolutional layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256；20th layer: Chi Hua Layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 × 256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128；Second Floor 12: pond layer, inputting is 4 × 4 × 128, export as 2 × 2 × 128；23rd layer: the data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, so by full articulamentum first After input into full articulamentum, output vector length is 128, and activation primitive is relu activation primitive；24th layer: full connection Layer, input vector length are 128, and output vector length is 32, and activation primitive is relu activation primitive；25th layer: Quan Lian Layer is connect, input vector length is 32, and output vector length is 3, and activation primitive is soft-max activation primitive；All convolutional layers Parameter is size=3 convolution kernel kernel, and step-length stride=(1,1), activation primitive is relu activation primitive；All pond layers It is maximum pond layer, parameter is pond section size kernel_size=2, step-length stride=(2,2)；

Described initializes dynamic action recognition classifier using hand motion video, method are as follows: the first step, construction Data acquisition system: the first step when hand motion image using standard initializes static action recognition classifier The hand motion video collection constructed uniformly extracts 10 frame images, as input；Second step, construction dynamic action identification classification Device DynamicN；Third step initializes dynamic action recognition classifier DynamicN, and input is the first step to each The set that 10 frame images of a video extraction are constituted exports if the 10 frame images inputted each time are Handv as DynamicN (Handv), classification y_Handv, y_HandvRepresentation method are as follows: take out article: y_HandvArticle is put back to in=[1,0,0,0,0]: y_HandvIt takes out and puts back to again in=[0,1,0,0,0]: y_HandvIt has taken out article and has not put back to in=[0,0,1,0,0]: y_Handv=[0,0, 0,1,0] and the movement of suspicious stealing: y_Handv=[0,0,0,0,1], the evaluation function of the network are to (DynamicN (Handv)-y_Handv) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times；

The 10 frame images of uniform extraction, method are as follows: for one section of video image, if the length is Nf frames；It first will view 1st frame image zooming-out of frequency image comes out the 1st frame as extracted set, by the last frame image zooming-out of video image Out as the 10th frame of extracted set, the i-th of extracted set_cktFrame is the of video image Frame, wherein i_ckt=2 to 9:,Indicate round numbers part；

First layer: convolutional layer, inputting is 256 × 256 × 30, and exporting is 256 × 256 × 512, port number channels=512；

The second layer: convolutional layer, inputting is 256 × 256 × 512, and exporting is 256 × 256 × 128, port number channels= 128；

Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128；4th layer: convolutional layer, input It is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, input the 4th Layer output 128 × 128 × 128 is connected in third dimension with layer 5 128 × 128 × 128, export as 64 × 64 × 256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；The Eight layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, it is defeated Enter and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, exporting is 32 × 32 ×512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels= 512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512； 13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Four layers: pond layer inputs and exports 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 in third dimension for eleventh floor It is connected, exporting is 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, export as 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, Port number channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, channel Number channels=512；18th layer: pond layer inputs and exports 16 × 16 × 512 and the 17th layer 16 × 16 for the 15th layer × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, Output is 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, export as 4 × 4×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 128, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 128, and output vector is long Degree is 32, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 32, and output vector is long Degree is 3, and activation primitive is soft-max activation primitive；The parameter of all convolutional layers is size=3 convolution kernel kernel, step-length Stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, parameter Chi Huaqu Between size kernel_size=2, step-length stride=(2,2)；

It is described after recognizing grasp motion, target following, method are as follows: set and work as are carried out to current grasp motion corresponding region Before the image of grasp motion that recognizes be Hgrab, current tracking area is region corresponding to image Hgrab；The first step mentions Take the ORB feature ORB of image Hgrab_Hgrab；Second step, for the corresponding image of all hand regions in the next frame of Hgrab Its ORB feature is calculated to obtain ORB characteristic set, and deletes the ORB feature chosen by other tracking box；Third step, will ORB_HgrabIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HgrabThe Hamming distance of feature is the smallest ORB feature is the ORB feature chosen, if the ORB feature and ORB chosen_HgrabSimilarity > 0.85 of feature, similarity=(1- two The Hamming distance of a ORB feature/ORB characteristic length), then the corresponding hand region of ORB feature chosen is that image Hgrab exists The tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost；

The ORB feature, the method that ORB feature is extracted from an image have been relatively mature, and regard in OpenCV computer Feel inside library have realization；Its ORB feature is extracted to a picture, input value is current image, is exported as several groups length phase Same character string, each group represents an ORB feature；

Described terminates using the corresponding hand region of present image as video, using method for tracking target since present frame forward It is tracked, until tracking is lost, method are as follows: set the image for putting down movement currently recognized as Hdown, current tracking area Domain is region corresponding to image Hdown；

If not tracking loss:

The first step extracts the ORB feature ORB of image Hdown_Hdown, due to the process described after recognizing grasp motion, It has calculated during carrying out target following to current grasp motion corresponding region, has been counted again so being not required here It calculates；

Second step calculates its ORB feature for the corresponding image of all hand regions in the former frame of image Hdown to obtain To ORB characteristic set, and delete the ORB feature chosen by other tracking box；

Third step, by ORB_HdownIts Hamming distance compared with each value of ORB characteristic set, selection and ORB_HdownThe Chinese of feature The smallest ORB feature of prescribed distance is the ORB feature chosen, if the ORB feature and ORB chosen_HdownSimilarity > 0.85 of feature, Similarity=(Hamming distance/ORB characteristic lengths of two ORB features of 1-), then the corresponding hand region of ORB feature chosen is i.e. It is image Hdown in the tracking box of next frame, if otherwise similarity < 0.85 shows that tracking is lost, algorithm terminates.

5. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the product identification module Concrete methods of realizing are as follows:

In initialization, product identification classifier is initialized using the product image set of all angles first, and right Product image generates product list；When changing product list: if deleting certain product, from the product image set of all angles The middle image for deleting the product, and corresponding position in product list is deleted, if increasing certain product, by each of current production The product image of angle is put into the product image set of all angles, and by product list, last back addition current increases The title of product, then with the product image set of new all angles and new product list upgrading products recognition classifier； In the detection process, the first step, according to shopping action recognition module transmitting come complete video and only grasp motion video, First in the obtained position of module of target detection according to corresponding to current video first frame, to the input video figure of the position As detecting forward from current video first frame, the frame that the region is not blocked is detected, finally by region corresponding to frame Image is identified as the input of product identification classifier, to obtain the recognition result of current production, recognition methods are as follows: set The image inputted each time is Goods1, and exporting as GoodsN (Goods1) is a vector, if the i-th of the vector_goodsPosition is most Greatly, then show that current recognition result is i-th in product list_goodsThe product of position, recognition result is sent at recognition result Manage module；

Described first initializes product identification classifier using the product image set of all angles, and to product figure As generating product list, method are as follows: the first step, construct data acquisition system and product list: the data acquisition system is each angle of product The image of degree, product list list_GOods is a vector, and each of vector corresponds to a product name；Second step, construction Product identification classifier GoodsN；Third step initializes construction product identification classifier GoodsN, and input is each The product image set of angle exports if input picture is Goods as GoodsN (Goods), classification y_Goods, y_GoodsFor One group of vector, length are equal to the number of product in product list, y_GoodsRepresentation method are as follows: if image Goods be i-th_GoodsPosition Product, then y_GoodsI-th_GoodsPosition is 1, other positions are 0；The evaluation function of the network is to (GoodsN (Goods)- y_Goods) its cross entropy loss function is calculated, convergence direction is to be minimized, and the number of iterations is 2000 times；

The construction product identification classifier GoodsN, two groups of GoodsN1 and GoodsN2 of network layer structure, wherein The network structure of GoodsN1 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, and exporting is 256 × 256 × 64, port number Channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, and exporting is 256 × 256 × 128, port number Channels=128；Third layer: pond layer, inputting is 256 × 256 × 128, and exporting is 128 × 128 × 128；4th layer: volume Lamination, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolution Layer, inputting is 128 × 128 × 128, and exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, It inputs the 4th layer of output 128 × 128 × 128 to be connected in third dimension with layer 5 128 × 128 × 128, exporting is 64 ×64×256；Layer 7: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels= 256；8th layer: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；The Nine layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond Change layer, inputs and be connected in third dimension for layer 7 output 64 × 64 × 256 with the 9th layer 64 × 64 × 256, export It is 32 × 32 × 512；Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；Floor 12: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；13rd layer: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number Channels=512；14th layer: pond layer, input for eleventh floor export 32 × 32 × 512 and the 13rd layer 32 × 32 × 512 are connected in third dimension, and exporting is 16 × 16 × 1024；15th layer: convolutional layer, input as 16 × 16 × 1024, exporting is 16 × 16 × 512, port number channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, Output is 16 × 16 × 512, port number channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, output It is 16 × 16 × 512, port number channels=512；18th layer: pond layer, input for the 15th layer output 16 × 16 × 512 are connected in third dimension with the 17th layer 16 × 16 × 512, and exporting is 8 × 8 × 1024；19th layer: convolution Layer, inputting is 8 × 8 × 1024, and exporting is 8 × 8 × 256, port number channels=256；20th layer: pond layer, input It is 8 × 8 × 256, exporting is 4 × 4 × 256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, export as 4 × 4 × 128, port number channels=128；The parameter of all convolutional layers be size=3 convolution kernel kernel, step-length stride=(1, 1), activation primitive is relu activation primitive；All pond layers are maximum pond layer, and parameter is pond section size Kernel_size=2, step-length stride=(2,2)；The network structure of GoodsN2 are as follows: inputting is 4 × 4 × 128, first will be defeated The data entered are launched into the vector of 2048 dimensions, then input into first layer；First layer: full articulamentum, input vector length are 2048, output vector length is 1024, and activation primitive is relu activation primitive；The second layer: full articulamentum, input vector length are 1024, output vector length is 1024, and activation primitive is relu activation primitive；Third layer: full articulamentum, input vector length are 1024, output vector length is len (list_Goods), activation primitive is soft-max activation primitive；len(list_Goods) indicate The length of product list；For any input Goods2, GoodsN (Goods2)=GoodsN2 (GoodsN1 (Goods2))；

The product image set and new product list upgrading products recognition classifier with new all angles, method Are as follows: the first step modifies network structure: for the network structure of product identification the classifier GoodsN ', GoodsN1 ' of neotectonics Constant, identical as GoodsN1 network structure when initialization, the first layer and second layer structure of GoodsN2 ' network structure are protected Hold constant, the output vector length of third layer becomes the length of updated product list；Second step, for the product of neotectonics Recognition classifier GoodsN ' is initialized: its product image set inputted as new all angles, if input picture is Goods3 exports as GoodsN ' (Goods3)=GoodsN2 ' (GoodsN1 (Goods3)), classification y_Goods3, y_GoodsFor One group of vector, length are equal to the number of updated product list, y_GoodsRepresentation method are as follows: if image Goods be i-th_Goods The product of position, then y_GoodsI-th_GoodsPosition is 1, other positions are 0；The evaluation function of the network is to (GoodsN (Goods)- y_Goods) its cross entropy loss function is calculated, convergence direction is to be minimized, during initialization the parameter value in GoodsN1 It remains unchanged, the number of iterations is 500 times；

The input according to corresponding to current video first frame in the obtained position of module of target detection, to the position Video image detects forward from current video first frame, detects the frame that the region is not blocked, method are as follows: set and work as forward sight Corresponding to frequency first frame the obtained position of module of target detection be (a_goods,b_goods,l_goods,w_goods), if current Video first frame is i-th_crgsFrame, frame under process i_cr=i_crgs: the first step, i-th_crFrame is in the obtained institute of module of target detection Having detection zone is Task_icr；Second step, for Task_icrEach of regional frame (a_task,b_task,l_task,w_task), it calculates Its distance d_gt=(a_task-a_goods)²+(b_task-b_goods)²-(l_task+l_goods)²-(w_task+w_goods)²；Distance < 0 if it does not exist, Then i-th_crCorresponding (a of frame_goods,b_goods,l_goods,w_goods) region is that region detected for detecting is not blocked Frame, algorithm terminates；Otherwise, distance < 0 if it exists, the then d (i in recording distance list d_cr)=minimum range, and i_cr=i_cr- 1, if i_cr> 0, then algorithm jumps to the first step, if i_cr≤ 0, then selection takes the record pair apart from the maximum record of list d intermediate value Answer the corresponding (a of frame_goods,b_goods,l_goods,w_goods) it is the frame that the region detected detected is not blocked, algorithm Terminate.

6. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the individual identification module Tool

Body implementation method are as follows:

In initialization, face characteristic extractor FaceN is initialized using the face image set of all angles first And μ face is calculated, then characteristics of human body's extractor BodyN is initialized using the human body image of all angles and calculates μ body；In the detection process, when user enters supermarket, current human region Body1 and people are obtained by module of target detection Then the image Face1 of face in body region uses characteristics of human body's extractor BodyN and face characteristic extractor respectively FaceN extracts characteristics of human body BodyN (Body1) and face characteristic FaceN (Face1), saves BodyN (Body1) in BodyFtu In set, FaceN (Face1) is saved in FaceFtu set, and save the id information of existing customer, id information can be use The unduplicated number that family is randomly assigned when either user enters supermarket in the account of supermarket, id information are used to distinguish different Gus Visitor then extracts its characteristics of human body and face characteristic whenever there is customer to enter supermarket；When user's mobile product in supermarket, according to Do shopping action recognition module transmitting come complete video and only grasp motion video, search out its corresponding human region with Human face region carries out recognition of face or human bioequivalence side using face feature extractor FaceN and characteristics of human body's extractor BodyN Formula obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes；

The face image set using all angles initializes face characteristic extractor FaceN and calculates μ Face, method are as follows: the first step, the face image set for choosing all angles constitute human face data collection；Second step constructs face Feature extractor FaceN is simultaneously initialized using face data set；Step 3:

Everyone i concentrated for human face data_Peop, obtain human face data and concentrate all to belong to i_PeopFace image set FaceSet(i_Peop):

For FaceSet (i_Peop) in each facial image Face (j_iPeop):

Calculate face characteristic FaceN (Face (j_iPeop))；

Count current face's image collection FaceSet (i_Peop) in all face characteristics average value as current face's image In

Heart center (FaceN (Face (j_iPeop))), calculate current face's image collection FaceSet (i_Peop) in all faces it is special Center center (FaceN (Face (the j of sign and current face's image_iPeop))) distance constitute i_PeopCorresponding distance set； The owner concentrated to human face data obtains its corresponding distance set, after distance set is arranged from small to large, if distance Set length is ndiset, μ face=distance set theThe corresponding value in position,Indicate round numbers part；

The construction face characteristic extractor FaceN is simultaneously initialized using face data set, if human face data collection by N_facesetIndividual is constituted, and network layer structure FaceN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128；4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256；The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512； Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 512, output vector Length is N_faceset, activation primitive is soft-max activation primitive；The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2)；Its initialization procedure are as follows: set for each face Face4 exports as FaceN25 (face4), classification y_face, y_faceIt is equal to N for length_facesetVector, y_faceExpression Method are as follows: if face face4 belongs to i-th in face image set_face4Personal face, then y_faceI-th_face4Position is 1, other Position is 0；The evaluation function of the network is to (FaceN25 (face4)-y_face) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times；After iteration, face characteristic extractor FaceN be FaceN25 network from First layer is to the 24th layer；

The human body image using all angles initializes to characteristics of human body's extractor BodyN and calculates μ body, Method are as follows: the first step, the human body image set for choosing all angles constitute somatic data collection；Second step, construction characteristics of human body mention It takes device BodyN and user's volumetric data set initializes；Step 3:

Everyone i concentrated for somatic data_Peop1, obtain somatic data and concentrate all to belong to i_Peop1Human body image set BodySet(i_Peop1):

For BodySet (i_Peop1) in each human body image Body (j_iPeop1):

Calculate characteristics of human body BodyN (Body (j_iPeop1))；

Count current human's image collection BodySet (i_Peop1) in all characteristics of human body average value as current human's image Center center (BodyN (Body (j_iPeop1))), calculate current human's image collection BodySet (i_Peop1) in all human bodies it is special Center center (BodyN (Body (the j of sign and current human's image_iPeop1))) distance constitute i_Peop1Corresponding distance set；

The owner concentrated to somatic data obtains its corresponding distance set, after distance set is arranged from small to large, if Distance set length is n_diset1, μ body=distance setThe corresponding value in position,Indicate round numbers part；

Construction characteristics of human body's extractor BodyN and user's volumetric data set initializes, if somatic data collection by N_bodysetIndividual is constituted, and network layer structure BodyN25 are as follows: first layer: convolutional layer, inputting is 256 × 256 × 3, is exported and is 256 × 256 × 64, port number channels=64；The second layer: convolutional layer, inputting is 256 × 256 × 64, export as 256 × 256 × 64, port number channels=64；Third layer: pond layer, input first layer output 256 × 256 × 64 are defeated with third layer 256 × 256 × 64 are connected in third dimension out, and exporting is 128 × 128 × 128；4th layer: convolutional layer inputs and is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 5: convolutional layer, inputting is 128 × 128 × 128, exporting is 128 × 128 × 128, port number channels=128；Layer 6: pond layer, the 4th layer of input are defeated 128 × 128 × 128 are connected in third dimension with layer 5 128 × 128 × 128 out, and exporting is 64 × 64 × 256；The Seven layers: convolutional layer, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；8th layer: volume Lamination, inputting is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；9th layer: convolutional layer, it is defeated Entering is 64 × 64 × 256, and exporting is 64 × 64 × 256, port number channels=256；Tenth layer: pond layer, inputting is Seven layers of output 64 × 64 × 256 are connected in third dimension with the 9th layer 64 × 64 × 256, and exporting is 32 × 32 × 512； Eleventh floor: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；Tenth Two layers: convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；13rd layer: Convolutional layer, inputting is 32 × 32 × 512, and exporting is 32 × 32 × 512, port number channels=512；14th layer: Chi Hua Layer inputs and is connected in third dimension for eleventh floor output 32 × 32 × 512 with the 13rd layer 32 × 32 × 512, defeated It is out 16 × 16 × 1024；15th layer: convolutional layer, inputting is 16 × 16 × 1024, and exporting is 16 × 16 × 512, port number Channels=512；16th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；17th layer: convolutional layer, inputting is 16 × 16 × 512, and exporting is 16 × 16 × 512, port number Channels=512；18th layer: pond layer, input for the 15th layer export 16 × 16 × 512 and the 17th layer 16 × 16 × 512 are connected in third dimension, and exporting is 8 × 8 × 1024；19th layer: convolutional layer, inputting is 8 × 8 × 1024, defeated It is out 8 × 8 × 256, port number channels=256；20th layer: pond layer, inputting is 8 × 8 × 256, and exporting is 4 × 4 ×256；Second eleventh floor: convolutional layer, inputting is 4 × 4 × 256, and exporting is 4 × 4 × 128, port number channels=128； Second Floor 12: pond layer, inputting is 4 × 4 × 128, and exporting is 2 × 2 × 128；23rd layer: full articulamentum first will The data of 2 × 2 × 128 dimensions of input are launched into the vector of 512 dimensions, then input into full articulamentum, output vector length It is 512, activation primitive is relu activation primitive；24th layer: full articulamentum, input vector length are 512, and output vector is long Degree is 512, and activation primitive is relu activation primitive；25th layer: full articulamentum, input vector length are 512, output vector Length is N_faceset, activation primitive is soft-max activation primitive；The parameter of all convolutional layers be convolution kernel kernel size= 3, step-length stride=(1,1), activation primitive are relu activation primitive；All pond layers are maximum pond layer, and parameter is Pond section size kernel_size=2, step-length stride=(2,2)；Its initialization procedure are as follows: set for each Zhang Renti Body4 exports as BodyN25 (body4), classification y_body, y_bodyIt is equal to N for length_bodysetVector, y_bodyExpression Method are as follows: if human body body4 belongs to i-th in human body image set_body4Personal human body, then y_bodyI-th_body4Position is 1, other Position is 0；The evaluation function of the network is to (BodyN25 (body4)-y_body) its cross entropy loss function is calculated, restrain direction To be minimized, the number of iterations is 2000 times；After iteration, characteristics of human body's extractor BodyN be BodyN25 network from First layer is to the 24th layer；

It is described according to the transmitting of shopping action recognition module come complete video and only grasp motion video, it is right to search out its The human region and human face region answered carry out face knowledge using face feature extractor FaceN and characteristics of human body's extractor BodyN Other or human bioequivalence mode obtains the ID of customer corresponding to the video that currently transmitting of shopping action recognition module comes；Its process Are as follows: according to shopping action recognition module transmitting come video, from the first frame of video begin look for corresponding human region with Human face region, until algorithm terminates or handled the last frame of video:

By corresponding human region image Body2 and human face region image Face2 use respectively characteristics of human body's extractor BodyN and Face characteristic extractor FaceN extracts characteristics of human body BodyN (Body2) and face characteristic FaceN (Face2)；

Then face identification information is used first: comparing the Europe of all face characteristics in FaceN (Face2) and FaceFtu set Family name's distance d_Face, feature when selecting Euclidean distance minimum in corresponding FaceFtu set, if this feature is FaceN (Face3), if d_Face< μ face then identifies that current face's image belongs to the client of facial image corresponding to FaceN (Face3) ID is the ID corresponding to the video actions that action recognition module transmitting comes that does shopping, and current identification process terminates；

If d_Face>=μ Face shows only to identify current individual with face identification method, then compares BodyN (Body2) the Euclidean distance d of all characteristics of human body in gathering with BodyFtu_Body, select Euclidean distance minimum when it is corresponding Feature in BodyFtu set, if this feature is BodyN (Face3), if d_Body+d_Face< μ face+ μ body, then identify and work as The Customer ID that preceding human body image belongs to human body image corresponding to BodyN (Face3) is that shopping action recognition module transmitting comes ID corresponding to video actions；

If still not finding ID corresponding to video actions after all frames for having handled video, in order to avoid wrong identification purchase Owner's body causes the book keeping operation of mistake, therefore the video come to current shopping action recognition module transmitting is no longer handled；

It is described according to the transmitting of shopping action recognition module come video, begin look for from the first frame of video to corresponding human body Region and human face region, method are as follows: according to shopping action recognition module transmitting come video, carried out from the first frame of video from Reason；If currently processed to i-th_fRgFrame, if it is (a that the frame, which corresponds to video in the obtained position of module of target detection,_ifRg,b_ifRg, l_ifRg,w_ifRg), the frame is corresponding to be combined into BodyFrameSet in the obtained human region collection of module of target detection_ifRgHuman body area Domain collection is combined into FaceFrameSet_ifRg, for BodyFrameSet_ifRgEach of human region (a_BFSifRg,b_BFSifRg, l_BFSifRg,w_BFSifRg), calculate its distance d_gbt=(a_BFSifRg-a_ifRg)²+(b_BFSifRg-b_ifRg)²-(l_BFSifRg-l_ifRg)²- (w_BFSifRg-w_ifRg)², selecting the smallest human region of distance in all human region set is the corresponding human body area of current video Domain, if it is (a that the human region chosen, which is position,_BFS1,b_BFS1,l_BFS1,w_BFS1), human face region collection is combined into FaceFrameSet_ifRgEach of human face region (a_FFSifRg,b_FFSifRg,l_FFSifRg,w_FFSifRg), calculate its distance d_gft= (a_BFS1-a_FFSifRg)²+(b_BFS1-b_FFifRg)²-(l_BFS1-l_FFSifRg)²-(w_BFS1-w_FFSifRg)², select in all face regional ensembles It is the corresponding human face region of current video apart from the smallest human face region.

7. a kind of supermarket's intelligence vending system according to claim 1, it is characterised in that the recognition result handles mould The concrete methods of realizing of block are as follows:

It does not work in initialization；In identification process, integration is carried out to generate each Gu to the recognition result received The corresponding shopping list of visitor: first according to individual identification module transmit come the ID of customer determine the corresponding Gu of current shopping information Visitor, so that choosing the shopping list number modified is ID, then according to product identification module transmit come recognition result determine The shopping of current customer acts corresponding product and sets product as GoodA, then according to shopping action recognition module transmit come knowledge Whether other result modifies to shopping cart to determine that current shopping acts, if being identified as taking out article on shopping list ID Increase product G oodA, accelerating is 1, reduces product G oodA on shopping list ID if being identified as putting back to article, subtracts Small number is 1, and shopping list does not change if be identified as " take out and put back to " or " taken out article and do not put back to " again, if identification It as a result is " suspicious stealing " then to supermarket's monitoring transmission alarm signal and the corresponding location information of current video.