WO2020228446A1 - 模型训练方法、装置、终端及存储介质 - Google Patents

模型训练方法、装置、终端及存储介质 Download PDF

Info

Publication number
WO2020228446A1
WO2020228446A1 PCT/CN2020/083523 CN2020083523W WO2020228446A1 WO 2020228446 A1 WO2020228446 A1 WO 2020228446A1 CN 2020083523 W CN2020083523 W CN 2020083523W WO 2020228446 A1 WO2020228446 A1 WO 2020228446A1
Authority
WO
WIPO (PCT)
Prior art keywords
tracking
response
test
recognition model
image
Prior art date
Application number
PCT/CN2020/083523
Other languages
English (en)
French (fr)
Inventor
王宁
宋奕兵
刘威
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20805250.6A priority Critical patent/EP3971772B1/en
Priority to KR1020217025275A priority patent/KR102591961B1/ko
Priority to JP2021536356A priority patent/JP7273157B2/ja
Publication of WO2020228446A1 publication Critical patent/WO2020228446A1/zh
Priority to US17/369,833 priority patent/US11704817B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/246Analysis of motion using feature-based methods, e.g. the tracking of corners or segments
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/62Extraction of image or video features relating to a temporal dimension, e.g. time-based feature extraction; Pattern tracking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/48Matching video sequences
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30241Trajectory

Definitions

  • This application relates to the field of Internet technology, in particular to the field of visual target tracking, and in particular to a model training method, a model training device, a terminal, and a storage medium.
  • Visual object tracking is an important research direction in the field of computational vision.
  • the so-called visual target tracking refers to predicting the size and position of the tracking object in other images when the size and position of the tracking object in a certain image are known.
  • Visual target tracking is usually used in video surveillance, human-computer interaction and unmanned driving and other application scenarios that require high real-time performance, such as: the size and position of the tracking object in a certain frame of image in a given video sequence In this case, predict the size and position of the tracking object in subsequent frame images of the video sequence.
  • the embodiments of the application provide a model training method, device, terminal and storage medium, which can better train the first object recognition model, so that the first object recognition model obtained by the update training has better visual target tracking performance , Make it more suitable for visual target tracking scene, improve the accuracy of visual target tracking.
  • an embodiment of the present application provides a model training method, which is executed by a computing device, and the model training method includes:
  • the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate that the tracking object is The marked position in the test image;
  • first object recognition model to perform recognition processing on the features of the tracking object in the test image to obtain a first test response
  • second object recognition model to track the tracking in the test image Identify the characteristics of the object, and obtain the second test response
  • the difference information between the first reference response and the second reference response Based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the difference between the tracking tag and the tracking response Update the first object recognition model.
  • an embodiment of the present application provides a model training device, and the model training device includes:
  • the acquiring unit is configured to acquire a training template image and a test image, the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate the Tracking the marked position of the object in the test image;
  • the processing unit is configured to call a first object recognition model to perform recognition processing on the characteristics of the tracking object in the template image to obtain a first reference response, and call a second object recognition model to perform a recognition process on the features of the template image Perform recognition processing on the characteristics of the tracked object to obtain a second reference response;
  • the processing unit is further configured to call the first object recognition model to perform recognition processing on the characteristics of the tracked object in the test image, obtain a first test response, and call the second object recognition model to Performing identification processing on the feature of the tracking object in the test image to obtain a second test response;
  • the processing unit is further configured to perform tracking processing on the first test response to obtain a tracking response on the tracking object, where the tracking response is used to indicate the tracking position of the tracking object in the test image;
  • the update unit is configured to be based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the tracking label and the And update the first object recognition model according to the difference information between the tracking responses.
  • an embodiment of the present application provides a terminal, the terminal includes an input device and an output device, and the terminal further includes:
  • a computer storage medium stores one or more instructions, and the one or more instructions are used to be loaded by the processor and execute the following steps:
  • the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate that the tracking object is The marked position in the test image;
  • first object recognition model to perform recognition processing on the features of the tracking object in the test image to obtain a first test response
  • second object recognition model to track the tracking in the test image Identify the characteristics of the object, and obtain the second test response
  • the difference information between the first reference response and the second reference response Based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the difference between the tracking tag and the tracking response Update the first object recognition model.
  • an embodiment of the present application provides a computer storage medium, the computer storage medium stores one or more instructions, and the one or more instructions are used to be loaded by a processor and execute the following steps:
  • the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate that the tracking object is The marked position in the test image;
  • first object recognition model to perform recognition processing on the features of the tracking object in the test image to obtain a first test response
  • second object recognition model to track the tracking in the test image Identify the characteristics of the object, and obtain the second test response
  • the difference information between the first reference response and the second reference response Based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the difference between the tracking tag and the tracking response Update the first object recognition model.
  • Fig. 1a is a scene diagram of visual target tracking based on a first object recognition model provided by an embodiment of the present application
  • FIG. 1b is a schematic diagram of the implementation environment of the model training method provided by the embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a model training method provided by an embodiment of the present application.
  • Fig. 3a is a structural diagram of a convolutional neural network provided by an embodiment of the present application.
  • Figure 3b is a schematic diagram of determining a tracking response and a tracking label provided by an embodiment of the present application
  • FIG. 4 is a schematic flowchart of another model training method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of acquiring a first object recognition model provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of joint optimization of a first object recognition model provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of obtaining positive samples and negative samples according to another embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a model training device provided by an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of a terminal provided by an embodiment of the present application.
  • visual target tracking mainly uses traditional image processing models to achieve tracking processing.
  • traditional image processing models are designed to achieve image classification tasks and are obtained by training with image classification data.
  • Visual target tracking is not to achieve image classification tasks, so traditional image processing models are not suitable for application in visual target tracking scenes, resulting in low accuracy of visual target tracking.
  • the embodiments of the present application provide a first object recognition model.
  • the first object recognition model refers to an image recognition model with image recognition performance, such as a super-resolution test sequence (Visual Geometry Group, VGG) model, Google Net Model and deep residual network (Deep residual network, ResNet) model, etc.
  • VGG Visual Geometry Group
  • ResNet deep residual network
  • the first object recognition model can accurately extract features from the image and the extracted features are more suitable for visual target tracking scenes. Therefore, the first object recognition model combined with related tracking algorithms is applied to the visual target tracking scenes, It can improve the accuracy and real-time of visual target tracking.
  • the step of using the first object recognition model and the tracking algorithm to achieve visual target tracking may include: (1) Obtaining a to-be-processed image and a reference image including a tracking object, the tracking object being the reference image that needs to be tracked
  • the image elements of, such as people, animals, etc. in the reference image; the reference image may include tag information of the tracking object, and the tag information is used to indicate the size and position of the tracking object.
  • the labeling information may be represented in the form of a labeling box, for example as shown in 101 in Figure 1 described below; (2) Determine the prediction tracking included in the image to be processed according to the labeling information in the reference image Object, the predicted tracking object mentioned here refers to the image element that may be the tracking object in the image to be processed.
  • the predicted tracking object mentioned here refers to the image element that may be the tracking object in the image to be processed.
  • multiple candidate frames can be generated in the image to be processed according to the size of the labeled frame in the reference image, and each candidate frame represents a predicted tracking object, such as the figure described below.
  • A, B, and C in 1 indicate the three predicted tracking objects determined; (2) Invoking the first object recognition model to recognize the tracking objects in the reference image to obtain the first recognition feature, the first Recognition feature refers to the feature of the tracking object, such as the facial contour feature, eye feature or posture feature of the tracking object, etc.; (3) Invoke the first object recognition model to predict the tracking object included in the image to be processed Perform recognition processing to obtain the second recognition feature, which refers to the feature of each predicted tracking object, such as the facial contour feature, eye feature, nose feature, or posture feature of each predicted tracking object; (4) Determine the target feature for tracking processing based on the first recognition feature and the second recognition feature, and use a tracking algorithm to track the target feature to obtain the position of the tracking object in the image to be processed .
  • the tracking algorithm may include a correlation filter tracking algorithm, a dual network-based tracking algorithm, a sparse representation algorithm, etc.
  • the correlation filter algorithm is taken as an example in this embodiment. After the correlation filter algorithm performs tracking processing on the target feature, a Gaussian-shaped response map is obtained, and the position of the peak on the response map represents the position of the tracked object in the image to be processed.
  • the determination of the target feature for tracking processing based on the first identification feature and the second identification feature can be understood as: by analyzing the feature of the tracking object and the feature of each predicted tracking object, it is determined that each predicted tracking Which predicted tracking object in the object is used as the tracking object included in the image to be processed, so that the feature of the predicted tracking object is processed by the tracking algorithm to obtain the position of the tracking object in the image to be processed, thereby completing the tracking Object tracking.
  • the implementation of step (4) may include: scoring the first identification feature and each second identification feature respectively, and determining the second identification feature with the highest matching score as the target feature.
  • the implementation of step (4) may further include: performing fusion processing on each of the second identification features, and determining the result of the fusion processing as the target feature.
  • a scene of visual target tracking provided by this embodiment of the application
  • 101 represents a reference image
  • 102 is an image to be processed
  • 1011 represents the label information of the tracking object in the form of a label frame
  • the label frame 1101 The size represents the size of the tracking object in the reference image
  • the position of the label frame 1101 represents the location of the tracking object in the reference image
  • 103 represents the first object recognition model.
  • the first object recognition model 103 is called to perform recognition processing on 1011, the first recognition feature is obtained, and the first object recognition model is called Recognize the predicted tracking objects A, B, and C respectively, and obtain three second recognition features.
  • the target feature is determined based on the first identification feature and the three second identification features, assuming that the second identification feature corresponding to the predicted tracking object C is determined as the target feature; then a tracking algorithm such as a correlation tracking filter algorithm is used to perform the target feature
  • the tracking process obtains a Gaussian-shaped response map, and the peak point on the response map indicates the position of the tracking object in the image to be processed, as shown in 104.
  • an embodiment of the application also proposes a model training method, which is used to train the first object recognition model to ensure that the first object recognition model can accurately extract features from the image And the extracted features are more suitable for tracking scenes.
  • the model training method may be executed by a computing device such as a terminal, and specifically may be executed by a processor of the terminal.
  • the terminal may include, but is not limited to: a smart terminal, a tablet computer, a laptop computer, a desktop computer, and so on.
  • Fig. 1b is a schematic diagram of an implementation environment of the model training method provided by an embodiment of the application.
  • the terminal device 10 and the server device 20 are communicatively connected through a network 30, and the network 30 may be a wired network or a wireless network.
  • the terminal device 10 and the server device 20 are integrated with the model training device provided in any embodiment of the present application, which is used to implement the model training method provided in any embodiment of the present application.
  • the model training method proposed in the embodiment of the present application may include the following steps S201-S205:
  • Step S201 Obtain a template image and a test image for training.
  • the template image and the test image are images used to train and update the model, and both the template image and the test image include tracking objects, and the template image may also include tag information of the tracking objects
  • the annotation information of the tracking object is used to indicate the size and position of the tracking object in the template image, and the annotation information may be annotated by the terminal for the template image
  • the test image also includes a response corresponding to the test image A label, the response label is used to indicate the label position of the tracking object in the test image.
  • the label position may refer to the true position of the tracking object in the test image marked by the terminal; the test image may also include the tracking object Labeling information.
  • the labeling information of the tracking object is used to indicate the size and position of the tracking object in the test image.
  • the template image and the test image may be two frames of images in the same video sequence.
  • a video sequence including the tracking object is recorded by a camera, and any frame in the video sequence is selected to include the tracking object.
  • the image is used as a template image, and a frame of image including the tracking object in the video sequence is selected as the test image in addition to the template image.
  • the template image and the test image may not be images in the same video sequence.
  • the template image may be an image obtained by shooting a first shooting scene including a tracking object by a shooting device. Before or after the template image is obtained, the image obtained by shooting the second shooting scene including the tracking object with the shooting device, that is, the template image and the test image are two independent images.
  • the template image and the test image are in the same video sequence as an example for description.
  • Step S202 Invoke the first object recognition model to identify the features of the tracking object in the template image to obtain a first reference response, and invoke the second object recognition model to perform the identification process on the tracking object in the template image The characteristics of the recognition process are performed to obtain the second reference response.
  • Step S203 Invoke the first object recognition model to perform identification processing on the characteristics of the tracking object in the test image to obtain a first test response, and invoke the second object recognition model to perform a The characteristics of the tracked object are identified to obtain a second test response.
  • the same point of the first object recognition model and the second object recognition model is that both are image recognition models with image recognition performance.
  • the convolutional neural network model has become a commonly used image recognition model due to its strong feature extraction performance.
  • the first object recognition model and the second object recognition in the embodiment of the present application may be convolutional Neural network models, such as VGG model, GoogleNet model, and ResNet model.
  • the difference between the first object recognition model and the second object recognition model is that the second object recognition model is an updated image recognition model, or the second object recognition model is pre-trained and tested for An image recognition model, where the first object recognition model is an image recognition model to be updated.
  • the convolutional neural network model is mainly used in image recognition, face recognition, and text recognition.
  • the network structure of the convolutional neural network can be shown in Figure 3a: it mainly includes a convolutional layer 301, a pooling layer 302, and a full connection. ⁇ 303.
  • Each convolutional layer is connected to a pooling layer.
  • the convolutional layer 301 is mainly used for feature extraction.
  • the pooling layer 302 is also called a sub-sampling layer and is mainly used to reduce the scale of input data.
  • the layer 303 calculates the classification value of the classification according to the features extracted by the convolutional layer, and finally outputs the classification and its corresponding classification value. It can be seen from this that the network structure of the first object recognition model and the second object recognition model also includes a convolutional layer, a pooling layer, and a fully connected layer.
  • Each convolutional neural network model includes multiple convolutional layers, each convolutional layer is responsible for extracting different features of the image, the features extracted by the previous convolutional layer are used as the input of the latter convolutional layer, and each convolutional layer is responsible for
  • the extracted features can be set according to a specific function, or set manually.
  • the first convolutional layer can be set to be responsible for extracting the overall shape features of the graphics; the second convolutional layer is responsible for extracting the line features of the graphics; the third convolutional layer is responsible for extracting the discontinuities of the graphics Features etc.
  • the first convolutional layer when recognizing images containing human faces, can be set to be responsible for extracting contour features of human faces; the second convolutional layer can be responsible for extracting facial features of human faces.
  • Each convolutional layer includes multiple filters of the same size for convolution calculation, each filter corresponds to a filter channel, and each filter obtains a set of features after convolution calculation. Therefore, each The convolutional layer recognizes the input image and extracts multi-dimensional features. In the convolutional layer, the more the number of convolutional layers, the deeper the network structure of the convolutional neural network model, and the greater the number of features extracted; the more filters included in each convolutional layer, each Each convolutional layer is extracted to the higher the feature dimension.
  • a model includes more convolutional layers, and/or there are more filters in each convolutional layer, a larger storage space is required when storing the model, which will require more storage space.
  • the model is called a heavyweight model; conversely, if a model includes fewer convolutional layers, and/or a small number of filters in each convolutional layer, the model does not require large storage space when storing it , The model that requires less storage space is called a lightweight model.
  • the first object recognition model and the second object recognition model may both be heavyweight models, or the second object recognition model is a heavyweight model, and the first object recognition model is a second object recognition model.
  • Lightweight model obtained by model compression processing If the first object recognition model is a heavyweight model, the updated first object recognition model can extract high-dimensional features and has better recognition performance. When it is applied to the visual target tracking scene, it can improve the tracking performance accuracy.
  • the first object recognition model is a lightweight model obtained by performing model compression processing on the second object recognition model, the updated first object recognition model has similar feature extraction performance to the second object recognition model, because it is more Less storage space allows it to be effectively used in mobile devices and other low-power products.
  • feature extraction can be performed quickly to achieve real-time visual target tracking. In practical applications, it is possible to choose whether the first object recognition model is a heavyweight model or a lightweight model according to specific scene requirements.
  • step S202 calling the first object recognition model to perform recognition processing on the feature of the tracking object in the template image to obtain the first reference response is essentially calling the convolutional layer of the first object recognition model to the template The feature of the tracking object in the image is subjected to feature extraction processing to obtain the first reference response.
  • the first reference response is used to represent the features of the tracking object in the template image recognized by the first object recognition model, such as size, shape, contour, etc., and the first reference response can be represented by a feature map; It can be understood that the second reference response is used to represent the characteristics of the tracking object in the template image recognized by the second object recognition model; the first test response is used to represent the features recognized by the first object recognition model The feature of the tracked object in the test image; the second test response is used to represent the feature of the tracked object in the test image recognized by the second object recognition model.
  • the template image may include labeling information of the tracking object
  • the function of the labeling information may be: determining the size and location of the tracking object to be recognized by the first object recognition model in the template image So that the first object recognition model can accurately determine who needs to be recognized; the label information of the tracking object in the template image can be represented in the form of a label box.
  • the calling of the first object recognition model to perform recognition processing on the characteristics of the tracking object in the template image to obtain the first reference response may refer to calling the first object recognition model in combination with the annotations in the template image The information recognizes the template image.
  • the calling the first object recognition model to identify the features of the tracking object in the template image to obtain the first reference response may refer to the template image The features of the label box in the box are recognized.
  • the template image only includes the tracking object, or includes the tracking object and the background that does not affect the recognition processing of the tracking object, such as wall, ground, sky, etc.
  • whether the terminal is a template or not Setting the label information of the tracking object in the image can enable the first object recognition model to accurately determine who needs to be recognized.
  • the implementation manner of calling the first object recognition model to recognize the features of the tracking object in the template image to obtain the first reference response may be: using the template image as the input of the first object recognition model ,
  • the first convolution layer of the first object recognition model uses multiple filters of a certain size to perform convolution calculation on the template image, and extract the first feature of the tracking object in the template image; use the first feature as the second convolution Layer input, the second convolutional layer uses multiple filters to perform convolution calculation on the first feature, and extract the second feature of the tracking object in the template image; input the second feature to the third convolutional layer, the third volume
  • the product layer uses multiple filters to perform convolution calculation on the second feature to obtain the fourth feature of the tracking object in the template image, and so on, until the last convolution layer completes the convolution calculation, the output result is the first reference response.
  • the embodiment of calling the second object recognition model to perform the recognition processing on the test image to obtain the second reference response may be the same as the implementation manner described above, and will not be repeated here.
  • Step S204 Perform tracking processing on the first test response to obtain the tracking response of the tracking object.
  • the embodiment of the present application implements the tracking training of the first object recognition model through step S204.
  • the step S204 may include: using a tracking training algorithm to perform tracking processing on the first test response to obtain a tracking response at the tracking object.
  • the tracking training algorithm is an algorithm for tracking and training the first object recognition model, and may include a correlation filter tracking algorithm, a tracking algorithm based on a dual network, a sparse representation algorithm, and the like.
  • the tracking response is used to indicate the tracking position of the tracking object in the test image that is determined according to the tracking training algorithm and the first test response. In fact, the tracking position can be understood as predicted according to the tracking training algorithm and the first test response. Track the position of the object in the test image.
  • the tracking training algorithm is a correlation filter algorithm
  • the tracking training algorithm is used to track the first test response
  • the way to obtain the tracking response of the tracking object may be: A test response is tracked to obtain a Gaussian-shaped response graph, and the tracking response is determined according to the response graph.
  • the implementation manner of determining the tracking response according to the response graph may be: using the response graph as the tracking response.
  • the response map can reflect the tracking position of the tracking object in the test image.
  • the maximum point or peak point in the response map can be used as the tracking position of the tracking object in the test image.
  • the tracking tag is used to indicate the marked position of the tracked object in the test image, and the marked position may refer to the real position of the tracked object in the test image pre-marked by the terminal.
  • the tracking label may also be a Gaussian-shaped response graph, and the peak point on the response graph represents the true position of the tracking object in the test image.
  • FIG. 3b a schematic diagram of determining the tracking label and tracking response provided by this embodiment of the application is shown.
  • 304 represents a test image and 3041 represents a tracking object.
  • the tracking label that the terminal pre-marks for the test image can be as shown in Figure 3b.
  • the peak point 3061 on 306 indicates the marked position of the tracking object in the test object.
  • the tracking response may be determined according to the characteristics of the specific tracking training algorithm.
  • Step S205 based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the tracking tag and the tracking In response to the difference information, the first object recognition model is updated.
  • the first reference response is used to represent the characteristics of the tracking object in the template image recognized by the first object recognition model, such as size, shape, contour, etc.
  • the second reference response is used to Represents the feature of the tracking object in the template image recognized by the second object recognition model; it can be seen that the difference information between the first reference response and the second reference response may include the first object recognition model and When the second object recognition model performs feature extraction on the template image, the size of the difference between the extracted features.
  • the size of the difference between the features can be represented by the distance between the features.
  • the first reference response includes the face contour of the tracking object in the template image recognized by the first object recognition model, which is represented as Facial contour 1 and the second reference response include the facial contour of the tracking object in the template image recognized by the second object recognition model, denoted as facial contour 2; the first reference response and the second reference response
  • the difference information between may include the distance between the facial contour 1 and the facial contour 2.
  • the size of the difference between the features can also be represented by the similarity value between the features. The larger the similarity value, the smaller the difference between the features, the smaller the similarity value, the smaller the similarity between the features. The greater the difference.
  • the difference information between the first test response and the second test response may include the difference between the extracted features when the first object recognition model and the second object recognition model perform feature extraction on the test image.
  • the size of the difference It can be seen from the description in step S204 that the difference information between the tracking tag and the tracking response reflects the distance between the tracking position of the tracking object in the test image and the marked position.
  • the process may be based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the tracking
  • the difference information between the tag and the tracking response is determined, the value of the loss optimization function corresponding to the first object recognition model is determined, and then the first object recognition model is updated according to the principle of reducing the value of the loss optimization function .
  • the update here refers to: update each model parameter in the first object recognition model.
  • the model parameters of the first object recognition model may include but are not limited to: gradient parameters, weight parameters, and so on.
  • the first object recognition model and the second object recognition model are first called respectively to recognize the features of the tracking object in the template image.
  • Obtain the first reference response and the second reference response and then call the first object recognition model and the second object recognition model to identify the characteristics of the tracked object in the test image to obtain the first test response and the second test response; further Ground, the first test response is tracked to obtain the tracking response of the tracked object; furthermore, the difference information between the first reference response and the second reference response, the difference between the first test response and the second test response Difference information, determine the loss of feature extraction performance of the first object recognition model compared to the second object recognition model; and determine the loss of tracking performance of the first object recognition model based on the difference information between the tracking tag and the tracking response .
  • FIG. 4 is a schematic flowchart of another model training method provided by an embodiment of the present application.
  • the model training method can be executed by computing devices such as terminals; the terminals here can include, but are not limited to: smart terminals, tablet computers, laptop computers, desktop computers, and so on.
  • the model training method may include the following steps S401-S408:
  • Step S401 Obtain a second object recognition model, and crop the second object recognition model to obtain a first object recognition model.
  • the second object recognition model is a trained heavyweight model for image recognition
  • the first object recognition model is a lightweight model for image recognition to be trained.
  • the model compression refers to compressing the trained heavyweight model in time and space to remove some unimportant filters or parameters included in the heavyweight model and improve the feature extraction speed.
  • the model compression may include model cropping and model training.
  • the model cropping refers to reducing the network structure of the second object recognition model by cropping the number of filters and feature channels included in the model.
  • the model training refers to the use of the second object recognition model and the template image and test image for training to update and train the cropped first object recognition model based on the transfer learning technology.
  • the first object recognition model is made to have the same or similar feature recognition performance as the second object recognition model.
  • the transfer learning technology refers to transferring the performance of one model to another model.
  • transfer learning refers to calling the second object recognition model to identify the features of the tracking object in the template image to obtain the first Second reference response, using the second reference response as a supervisory label to train the first object recognition model to recognize the features of the tracking object in the template image, and then calling the second object recognition model to the tracking object in the test image
  • the second test response is obtained by identifying the features of, and the second test response is used as a supervision label to train the first object recognition model to recognize the feature of the tracking object in the test image.
  • the teacher-learning model is a typical model compression method based on the transfer learning technology.
  • the second object recognition model is equivalent to the teacher model
  • the first object recognition model is equivalent to the student model.
  • cropping may refer to subtracting a certain number of filters included in each convolutional layer in the second object recognition model. And/or the number of feature channels corresponding to each convolutional layer is also subtracted by the corresponding number. For example, the number of filters and the number of feature channels in each convolutional layer of the second object recognition model is subtracted by three-fifths, or seven-eighths or any number; after practice, the second object is recognized The number of filters included in each convolutional layer in the model and the number of feature channels corresponding to each convolutional layer minus seven-eighths, a better first object recognition model can be obtained through training and updating. For example, referring to FIG.
  • FIG. 5 a schematic diagram of cutting the second object recognition model to obtain the first object recognition model provided by this embodiment of the application.
  • the cutting process of the second object recognition model by the above method only involves Convolutional layer, so for the convenience of description, only the convolutional layers of the first object recognition model and the second object recognition model are shown in FIG. 5.
  • the second object recognition model is a VGG-8 model
  • the first object recognition model is also a VGG-8 model.
  • the VGG-8 model includes 5 convolutional layers, 501 represents the convolutional layer of the second object recognition model, 502 represents the convolutional layer of the first object recognition model, and 503 represents each convolutional layer of the second object recognition model
  • the number of filters included, the number of characteristic channels, and the size of the filter are subtracted by seven-eighths to obtain the number of filters in each convolutional layer of the first object recognition model , The number of characteristic channels and the size of the filter, as shown in 504.
  • Step S402 Obtain a template image and a test image for training, where both the template image and the test image include a tracking object, and the test image includes a tracking tag of the tracking object, and the tracking tag is used to represent the tracking object The label position in the test image.
  • Step S403 Invoke the first object recognition model to identify the features of the tracking object in the template image to obtain a first reference response, and invoke the second object recognition model to analyze the features in the template image The characteristics of the tracked object are identified and processed to obtain a second reference response.
  • Step S404 calling the first object recognition model to perform recognition processing on the characteristics of the tracking object in the test image to obtain a first test response, and calling the second object recognition model The characteristics of the tracked object are identified to obtain a second test response.
  • Step S405 Perform tracking processing on the first test response to obtain the tracking response of the tracking object.
  • step S405 may include using a tracking training algorithm to track the first test response to obtain the tracking response of the tracking object.
  • the tracking training algorithm may include tracking algorithm parameters, and the implementation of using the tracking training algorithm to track the first test response to obtain the tracking response to the tracked object in the test image may be:
  • the first test response is substituted into the tracking training algorithm with known tracking algorithm parameters for calculation, and the tracking response is determined according to the calculated result.
  • the tracking algorithm parameters in the tracking training algorithm described in the embodiment of the present application are obtained by training the tracking training algorithm according to the second object recognition model and the template image.
  • the following takes the tracking training algorithm as the correlation filter algorithm as an example to introduce the process of using the second object recognition model and template image to train the tracking training algorithm to obtain the tracking algorithm parameters of the correlation filter tracking algorithm.
  • the tracking algorithm parameter of the correlation filter tracking algorithm refers to the filter parameter of the correlation filter parameter, and the training process of the correlation filter algorithm may include steps S11-13:
  • Step S11 generating training samples according to the template image, and obtaining tracking labels corresponding to the training samples
  • the template image includes the tracking object and the tracking label corresponding to the tracking object
  • the training sample generated from the template image also includes the tracking object.
  • the tracking label corresponding to the tracking object included in the template image may refer to the real position of the tracking object in the template image
  • the tracking label including the tracking object in the template image may be pre-marked by the terminal.
  • the method of generating training samples based on the template image may be: cutting out image blocks including the tracking object in the template image, performing cyclic shift processing on the image blocks to obtain training samples, and tracking labels corresponding to the training samples Determined according to the tracking label included in the template image and the degree of cyclic shift operation.
  • the method of cyclically shifting the template image can be: pixelizing the image blocks of the template image to determine the pixels used to represent the tracking object. These pixels form the pixel matrix of the tracking object. For each pixel matrix Rows are cyclically shifted to obtain multiple new pixel matrices. In the above cyclic shift process, the value of each pixel does not change, but the position of the pixel changes, and the value of the pixel does not change. Therefore, the matrix after the cyclic shift is also used to represent the tracking object, and the position of the pixel occurs. Change, the position of the tracking object rendered by the new pixel matrix has changed.
  • each row of the pixel matrix can be expressed as an nx1 vector, and each vector element in the vector corresponds to a pixel; each pixel in the nx1 vector is sequentially Move right or left, and get a new set of vectors each time.
  • Step S12 call the second object recognition model to perform feature extraction processing on the training sample, and obtain the feature of the tracking object in the training sample;
  • Calling the second object recognition model to perform feature extraction processing on multiple training samples is essentially a process of calling the convolutional layer of the second object recognition model to perform feature extraction on the training samples.
  • the second object recognition model includes multiple convolutional layers, and each convolutional layer includes multiple filters for convolution calculation, so the features extracted by each convolutional layer are multi-dimensional, and each convolutional layer The extracted multi-dimensional features are successively used as the input of the next convolutional layer until the output of the last convolutional layer is obtained.
  • the second object recognition model includes 5 convolutional layers.
  • the feature dimension of the obtained training samples is D, assuming Represents the feature of the i-th dimension extracted by the second object recognition model, and finally the trained feature extracted by the second object recognition model is expressed as
  • Step S13 Obtain a ridge regression equation for determining the relevant filter parameters, and solve the ridge regression equation to obtain the relevant filter parameters.
  • the working principle of the correlation filter algorithm is: extract the features of the image including the tracking object; convolve the extracted features with the correlation filter to obtain a response map, and determine the location of the tracking object in the image from the response map .
  • the convolution operation is required between two quantities of the same size. Therefore, it is necessary to ensure that the dimensions of the correlation filter and the characteristics of the training sample are the same.
  • the ridge regression equation corresponding to the correlation filter algorithm can be shown in formula (1):
  • represents the convolution operation
  • D represents the feature dimension of the training sample extracted by the second object recognition model
  • w i represents the i-th dimension filter parameter of the correlation filter
  • x represents the training sample
  • y represents the tracking of the training sample x label
  • represents the regularization coefficient
  • the filter parameters of each dimension of the relevant filter can be obtained.
  • the equation (1) is minimized, and the equation (1) is solved in the frequency domain to obtain the filter parameters of the relevant filter in each dimension.
  • the formula for solving the filter parameters in the frequency domain is introduced.
  • the formula for solving the filter parameters of the d dimension in the frequency domain is expressed as (2):
  • w d represents the relevant filter parameter corresponding to the dth convolutional layer
  • represents the dot multiplication operation
  • * represents the complex conjugate operation.
  • the filter parameters of the correlation filter of each dimension can be calculated, and the filter parameters of each dimension constitute the filter parameter of the correlation filter algorithm.
  • the first test response can be tracked based on the correlation filter algorithm to obtain the tracking response of the tracking object in the test image .
  • the correlation filter algorithm is used to track the first test response, and the tracking response to the tracking object in the test image can be expressed by formula (3),
  • w represents the filter parameter of the correlation filter
  • r represents the tracking response.
  • Step S406 Obtain a loss optimization function corresponding to the first object recognition model.
  • this embodiment of the application proposes to recognize the first object
  • the model performs joint optimization of feature recognition loss and tracking loss.
  • the loss optimization function corresponding to the first object recognition model can be expressed as formula (4):
  • Step S407 based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the tracking tag and the tracking In response to the difference information between the responses, the value of the loss optimization function is determined.
  • the loss optimization function of the first object recognition model includes the feature recognition loss function and the tracking loss function.
  • the value of the loss optimization function is determined in step S407, the value of the feature recognition loss function and the tracking loss function can be determined first. Value, and then determine the value of the optimized loss function according to the value of the feature recognition loss function and the value of the tracking loss function.
  • the tracking the difference information between the responses and determining the value of the loss optimization function includes: acquiring the feature recognition loss function, and based on the difference information between the first reference response and the second reference response, The difference information between the first test response and the second test response, determine the value of the feature recognition loss function; obtain the tracking loss function, and based on the difference between the tracking tag and the tracking response The information determines the value of the tracking loss function; the value of the loss optimization function is determined based on the value of the feature recognition loss function and the value of the tracking loss function.
  • the first reference response is used to represent the feature of the tracking object in the template image recognized by the first object recognition model
  • the second is used to represent the second object
  • the characteristics of the tracking object in the template image recognized by the recognition model, and the difference information between the first reference response and the second reference response reflects the comparison of the first object recognition model and the second object recognition model to the template image
  • the difference between the extracted features can be expressed by distance, that is, the difference between the first reference response and the second reference response
  • the information includes the distance between the first reference response and the second reference response;
  • the difference information between the first test response and the second test response includes the distance between the first test response and the second test response.
  • the feature recognition loss function restricts the distance between the above-mentioned features, so that the first object recognition model and the second object recognition model have the same or similar feature extraction performance. It can be seen that the feature loss optimization function includes two parts of loss, one is the feature recognition loss of the test image, and the other is the feature recognition loss of the template image.
  • the loss value of the feature recognition loss of the test image is determined by the distance between the first reference response and the second reference response, and the loss value of the feature recognition loss of the template image is determined by the first test response and the second test response.
  • the distance between is determined, and the loss value of the feature recognition loss about the test image and the loss value of the recognition loss about the reference image are substituted into the feature recognition loss function, and the value of the feature recognition loss function can be calculated.
  • the feature recognition loss function can be expressed as formula (5):
  • the difference between the tracking label and the tracking response reflects the Euclidean distance between the tracking response and the tracking label.
  • the tracking performance of the first object recognition model is optimized.
  • the value of the tracking loss function can be obtained.
  • the tracking loss function can be expressed as formula (6):
  • r represents the tracking response
  • g represents the tracking label
  • r can be obtained by formula (7)
  • w in formula (7) represents the filter parameter of the tracking training algorithm, which can be obtained by the steps of S11-S13.
  • the first object recognition model includes multiple convolutional layers, and the first test response is to perform fusion processing on each sub-test response obtained by performing recognition processing on the test image by each convolutional layer of the first object recognition model Obtained later.
  • the first object recognition model includes a first convolutional layer, a second convolutional layer, and a third convolutional layer
  • the first test response is the first test sub-response corresponding to the first convolutional layer
  • the The second test sub-response corresponding to the second convolution layer and the third test sub-response corresponding to the third convolution layer are fused.
  • the first object recognition model can be optimized for multi-scale tracking loss.
  • multi-scale tracking loss optimization refers to: calculating the tracking loss values of multiple convolutional layers of the first object recognition model, and then determining the first object recognition based on the tracking loss values of the multiple convolutional layers The value of the model's tracking loss function.
  • the tracking loss is determined based on the difference information between the tracking tag and the tracking response
  • the value of the function includes: based on the difference information between the first tracking label corresponding to the first convolutional layer and the first tracking response obtained by tracking the first test sub-response, determining the first The tracking loss value of the convolutional layer; based on the difference information between the second tracking label corresponding to the second convolutional layer and the second tracking response obtained by tracking the second test sub-response, determine the first The tracking loss value of the second convolutional layer; based on the difference information between the third tracking label corresponding to the third convolutional layer and the third tracking response obtained by tracking the third test sub-response, determining the The tracking loss value of the third convolutional layer; the tracking loss value corresponding to the first convolutional layer, the tracking loss value corresponding to the second convolutional layer, and the
  • the first tracking sub-response, the second tracking sub-response, and the third tracking sub-response may be the first test corresponding to the first convolutional layer, the second convolutional layer, and the third convolutional layer using a tracking training algorithm.
  • the sub-response, the second test sub-response, and the third test sub-response are tracked. Since the features extracted by different convolutional layers are different, the first tracking sub-response, the second tracking sub-response, and the third tracking sub-response have different resolutions.
  • the tracking algorithm parameters used by the tracking training algorithm to track the test subresponses of different convolutional layers are different.
  • the tracking algorithm parameters under a certain convolutional layer are through the second object recognition model and the corresponding convolutional layer.
  • the corresponding template image is obtained through training, and the specific training process can refer to steps S11-S13, which will not be repeated here.
  • the multiple convolutional layers included in the first object recognition model are connected together in the order of connection, and the first, second, and third convolutional layers mentioned above may be the first Any three convolutional layers in the convolutional layer of an object recognition model, or the first convolutional layer is the first convolutional layer indicated by the connection order, and the third convolutional layer is the connection The last convolutional layer indicated by the order, the second convolutional layer is any convolutional layer except the first convolutional layer and the last convolutional layer, at this time the first convolutional layer It can be called the high-level convolutional layer of the first object recognition model, the second object recognition model is the middle convolutional layer of the first object recognition model, and the third convolutional layer is the low-level convolution layer of the first object recognition model.
  • l represents the lth convolutional layer of the first object recognition model
  • r l represents the lth tracking sub-response obtained by tracking the lth test sub-response of the lth convolutional layer by the tracking algorithm
  • g l represents The tracking label of the tracking object included in the test image corresponding to the lth convolutional layer.
  • the tracking algorithm performs tracking processing on the first test sub-response of the first convolutional layer to obtain the first tracking sub-response
  • the tracking algorithm parameters corresponding to the first convolutional layer used are obtained through the second object recognition model and the first l
  • the template image corresponding to the convolutional layer is trained.
  • FIG. 6 a schematic diagram of joint optimization of a first object recognition model provided by an embodiment of this application.
  • the feature recognition loss optimization shown in the figure is as shown in formula (5) and the multi-scale tracking loss optimization is as shown in formula ( 8)
  • 601 represents the first object recognition model
  • 602 represents the second object recognition model.
  • Step S408 According to the principle of reducing the value of the loss optimization function, the first object recognition model is updated.
  • the value of the loss optimization function is continuously reduced, and the value of the feature recognition loss function and the value of the tracking loss function can be deduced according to the value of the loss optimization function, and then the model parameters of the first object recognition model are adjusted to make the first The distance between the reference response and the second reference response, and the distance between the first test response and the second test response meet the value of the feature recognition loss function; at the same time, adjust the model parameters of the first object recognition model so that the tracking response is The Euclidean distance between the tracking labels satisfies the value of the tracking loss function.
  • the template image and test image used in the foregoing steps S401-S408 to update the first object recognition model are both images that include the tracking object, this ensures that the updated first object recognition model can perform better on the tracking object.
  • the ability to extract features may include other backgrounds in addition to the tracking object. Therefore, in order to further improve the capability of the first object recognition model, the embodiment of the present application After the first object recognition model is updated through steps S401-S408, the positive and negative samples are also used to update the first object recognition model, so that the first object recognition model has better feature discrimination ability, that is, it can Better distinguish the tracked objects and background included in the image.
  • using the positive sample and the negative sample to update the first object recognition model may include: acquiring a reference image including the tracking object, and determining the positive sample and the negative sample for training based on the reference image, the reference The image may be the first frame of image in the video sequence to be tracked using the first object recognition model, the positive sample refers to an image that includes the tracking object, and the negative sample refers to an image that does not include the tracking object , The positive sample includes the positive sample tracking label of the tracking object, and the negative sample includes the negative sample tracking label of the tracking object; calling the updated first object recognition model to identify the positive sample , Obtain a positive sample recognition response, and call the updated first object recognition model to perform recognition processing on the negative sample to obtain a negative sample recognition response; track the positive sample recognition response to obtain the positive Tracking response to the positive sample of the tracking object in the sample; tracking the negative sample identification response to obtain the negative sample tracking response to the tracking object in the negative sample; based on the positive sample The difference information between the tracking response and the positive sample tracking label, and the difference
  • the method of obtaining positive samples and negative samples based on the reference image may be: randomly cropping the reference image to obtain multiple image blocks, taking the image block containing the tracking object as the positive sample, and will not include The image block of the tracking object is taken as a negative sample.
  • the positive sample tracking label corresponding to the positive sample is the true position of the tracking object in the positive sample, and the negative sample does not contain the tracking object, and the corresponding negative sample tracking label is 0.
  • Figure 7 shows a schematic diagram of obtaining positive samples and negative samples.
  • Figure 7 701 is a reference image.
  • the reference image is randomly cropped to obtain multiple image blocks, such as multiple labeled frames included in 701, each The labeled box represents an image block; assuming that the tracking object is 702, select the image block including 702 from the multiple image blocks of 701 as positive samples, such as 703 and 704 in the figure, and select the image block not including 702 as negative samples. 705 and 706 in the figure.
  • the positive sample tracking labels corresponding to 703 and 704 are the true positions of the tracking objects in 703 and 704, as shown by the dots in the lower figure of 703 and 704. Since the negative samples 705 and 706 do not include tracking objects, their corresponding tracking labels are 0, so no dots appear.
  • the training is based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference information between the negative sample tracking response and the negative sample tracking label.
  • the updated first object recognition model includes: obtaining a tracking loss optimization function; based on difference information between the positive sample tracking response and the positive sample tracking label, and the negative sample tracking response and the negative sample. The difference information between the tracking tags is determined, and the value of the tracking loss optimization function is determined; and the updated first object recognition model is trained according to the principle of reducing the value of the tracking loss optimization function.
  • the difference information between the positive sample tracking response and the positive sample tracking label includes that the first object recognition model performs tracking processing on the positive sample to obtain the Euclidean distance between the position of the tracking object and the true position of the tracking object in the positive sample.
  • the difference information between the negative sample tracking response and the negative sample tracking label includes the tracking processing of the negative sample by the first object recognition model, and the obtained position of the tracking object is the difference between the tracking object and the true position of the tracking object in the negative sample. Euclidean distance between. Bring the above two into the tracking loss optimization function to obtain the value of the tracking loss optimization function, and then update the updated first object recognition model again according to the principle of reducing the value of the tracking loss optimization function. By repeating the steps of tracking loss optimization, the update of the updated first object recognition model is completed.
  • multi-scale tracking optimization may also be adopted.
  • the first object recognition model includes a first convolutional layer, a second convolutional layer, and a third convolutional layer
  • the positive sample tracking label includes the first positive sample tracking label corresponding to the first convolutional layer, and the The second positive sample tracking label corresponding to the second convolutional layer and the third positive sample tracking label corresponding to the third convolutional layer
  • the positive sample recognition response is the first sub-recognition response of the positive sample corresponding to the first convolutional layer
  • the second The second sub-recognition response of the positive sample corresponding to the convolutional layer and the third sub-recognition response of the positive sample corresponding to the third convolutional layer are fused
  • the negative sample recognition response is the first negative sample corresponding to the first convolutional layer.
  • the positive sample tracking response may include a first positive sample tracking response obtained by using a tracking training algorithm to track the first sub-identification response of the positive sample, and a second positive sample tracking obtained by tracking the second sub-identification response of the positive sample.
  • the negative sample tracking response may include the first negative tracking response obtained when the tracking training algorithm is used to track the first negative sample identification response, and the second negative sample tracking response obtained when the tracking training algorithm tracks the second negative sample identification response.
  • the negative sample sub-tracking response, and the third negative sample sub-tracking response obtained when the tracking training algorithm performs tracking processing on the third negative sample identification response.
  • the implementation of the multi-scale tracking loss optimization may be based on the difference information between the first positive sample tracking response and the first positive sample tracking label, and the difference information between the first negative sample tracking response and the negative sample tracking response , Determine the value of the tracking loss optimization function of the first convolutional layer; based on the difference information between the second positive sample tracking response and the second positive sample tracking label, and the difference between the second negative sample tracking response and the negative sample tracking response Difference information, determine the value of the tracking loss optimization function of the second convolutional layer, and based on the difference information between the third positive sample tracking response and the third positive sample tracking label, and the third negative sample tracking response and the negative sample tracking response Determine the value of the tracking loss optimization function of the third convolutional layer; finally, according to the value of the tracking loss optimization function of the first convolutional layer, the value of the tracking loss optimization function of the second convolutional layer, and the third The value of the tracking loss optimization function of the convolutional layer determines the value of the tracking loss optimization function. It is assumed that the tracking loss optimization function of multi-
  • g l indicates the tracking label of the positive sample corresponding to the positive sample under the first convolutional layer
  • w l represents the tracking algorithm parameter corresponding to the lth convolutional layer.
  • the tracking algorithm parameters corresponding to different convolutional layers are trained by the second object recognition model and the corresponding positive samples under the corresponding convolutional layer.
  • the corresponding positive samples under different convolutional layers have the same size and different resolution.
  • the first object recognition model and certain tracking algorithms can be combined and applied in scene analysis, monitoring equipment, human-computer interaction and other scenes that require visual target tracking.
  • the implementation of combining the first object recognition model and certain tracking algorithms in the visual target tracking scene may include: acquiring the image to be processed, and determining the image to be processed according to the annotation information of the tracking object in the reference image
  • the image to be processed may be an image other than the first frame in the video sequence to be used for visual target tracking using the first object recognition model; call the updated first object recognition model to the reference Perform recognition processing on the tracking object in the image to obtain a first recognition feature; call the updated first object recognition model to recognize the predicted tracking object in the image to be processed to obtain a second recognition feature;
  • the first identification feature and the second identification feature determine a target feature for tracking processing, and use a tracking algorithm to track the target feature to obtain position information of the tracking object in the image to be processed .
  • the heavyweight second object recognition model is used to train the lightweight first object recognition model in the embodiment of this application
  • the first object recognition model and the second object recognition model are respectively called to the template image used for training.
  • the feature of the tracking object is identified by the first reference response and the second reference response, and then the first object recognition model and the second object recognition model are called to identify the feature of the tracking object in the test image used for training.
  • the first object recognition model is optimized according to the loss in feature extraction performance and the loss in tracking performance, so that the updated lightweight first object recognition model has the same or the same value as the second object recognition model. Relatively similar feature extraction performance, faster feature extraction speed, and ensure that the features extracted by the first object recognition model are more suitable for visual target tracking scenes, thereby improving the accuracy and real-time performance of visual target tracking.
  • an embodiment of the present application also discloses a model training device, which can execute the methods shown in FIG. 2 and FIG. 4. Please refer to Figure 8.
  • the model training device can run the following units:
  • the acquiring unit 801 is configured to acquire a training template image and a test image, the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate the Track the marked position of the object in the test image;
  • the processing unit 802 is configured to call a first object recognition model to perform recognition processing on the characteristics of the tracking object in the template image, obtain a first reference response, and call a second object recognition model to analyze all the features in the template image. Performing recognition processing on the characteristics of the tracking object to obtain a second reference response;
  • the processing unit 802 is further configured to call the first object recognition model to perform recognition processing on the characteristics of the tracked object in the test image, obtain a first test response, and call the second object recognition model to Performing identification processing on the feature of the tracking object in the test image to obtain a second test response;
  • the processing unit 802 is further configured to perform tracking processing on the first test response to obtain a tracking response of the tracking object, where the tracking response is used to indicate the tracking position of the tracking object in the test image;
  • the update unit 803 is configured to be based on the difference information between the first reference response and the second reference response, the difference information between the first test response and the second test response, and the tracking tag and And update the first object recognition model by tracking the difference information between the responses.
  • the acquisition unit 801 is further configured to: acquire a second object recognition model; the processing unit 802 is also configured to; crop the second object recognition model to obtain the first object recognition model.
  • the update unit 803 is based on the difference information between the first reference response and the second reference response, and the difference information between the first test response and the second test response.
  • the difference information between the tracking tag and the tracking response when the first object recognition model is updated, the following operations are performed: obtain the loss optimization function corresponding to the first object recognition model; based on the first reference The difference information between the response and the second reference response, the difference information between the first test response and the second test response, and the difference information between the tracking tag and the tracking response, determine all The value of the loss optimization function; according to the principle of reducing the value of the loss optimization function, the first object recognition model is updated.
  • the loss optimization function includes a feature recognition loss function and a tracking loss function
  • the update unit 803 is based on the difference information between the first reference response and the second reference response, the first The difference information between a test response and the second test response and the difference information between the tracking tag and the tracking response, when determining the value of the loss optimization function, perform the following operations: obtain the feature recognition Loss function, and based on the difference information between the first reference response and the second reference response, and the difference information between the first test response and the second test response, determine the feature recognition loss function Obtain the tracking loss function, and determine the value of the tracking loss function based on the difference information between the tracking tag and the tracking response; Identify the value of the loss function and the tracking loss function based on the feature The value of determines the value of the loss optimization function.
  • the first object recognition model includes a first convolutional layer, a second convolutional layer, and a third convolutional layer
  • the first test response is the first convolutional layer corresponding to the first convolutional layer.
  • a test sub-response, a second test sub-response corresponding to the second convolutional layer, and a third test sub-response corresponding to the third convolutional layer are fused; the update unit 803 is based on the tracking label and When the difference information between the tracking responses determines the value of the tracking loss function, the following operations are performed:
  • the first object recognition model includes a plurality of convolutional layers, the plurality of convolutional layers are connected in a connection order, and the first convolutional layer is the first convolutional layer indicated by the connection order.
  • Convolutional layers the third convolutional layer is the last convolutional layer indicated by the connection order, and the second convolutional layer is divided by the first convolutional layer and the last convolutional layer Any convolutional layer outside the layer.
  • the acquiring unit 801 is further configured to acquire a reference image including a tracking object, and determine a positive sample and a negative sample for training based on the reference image, and the positive sample refers to including the tracking An image of an object, the negative sample refers to an image that does not include the tracking object, the positive sample includes the positive sample tracking label of the tracking object, and the negative sample includes the negative sample tracking label of the tracking object, so
  • the reference image includes the annotation information of the tracking object;
  • the processing unit 802 is further configured to call the updated first object recognition model to perform recognition processing on the positive sample to obtain a positive sample recognition response, and call the updated first object recognition model to Recognize the negative sample, and get the negative sample recognition response;
  • the processing unit 802 is further configured to track the positive sample identification response to obtain a positive sample tracking response to the tracking target in the positive sample; and track the negative sample identification response, Obtaining the negative sample tracking response to the tracking object in the negative sample;
  • the update unit 803 is further configured to train based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference information between the negative sample tracking response and the negative sample tracking label The updated first object recognition model.
  • the update unit 803 is based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference between the negative sample tracking response and the negative sample tracking label Information, when training the updated first object recognition model, perform the following steps:
  • Obtain a tracking loss optimization function determine the tracking based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference information between the negative sample tracking response and the negative sample tracking label
  • the value of the loss optimization function according to the principle of reducing the value of the tracking loss function, the updated first object recognition model is updated.
  • the acquiring unit 801 is further configured to acquire an image to be processed; the processing unit 802 is further configured to determine that the image to be processed is in the reference image according to the annotation information of the tracking object
  • the processing unit 802 is also used to call the updated first object recognition model to recognize the tracking object in the reference image to obtain the first recognition feature; the processing unit 803 , Is also used to call the updated first object recognition model to recognize the predicted tracking object in the to-be-processed image to obtain the second recognition feature;
  • the processing unit 802 is also used to The first identification feature and the second identification feature determine a target feature used for tracking processing, and use a tracking algorithm to track the target feature to obtain position information of the tracking object in the image to be processed.
  • each step involved in the method shown in FIG. 2 or FIG. 4 may be executed by each unit in the model training device shown in FIG. 8.
  • step S201 shown in FIG. 2 may be executed by the acquiring unit 801 shown in FIG. 8
  • steps S202-S204 may be executed by the processing unit 802 shown in FIG. 8
  • step S205 may be executed by the updating unit 803 shown in FIG.
  • steps S401, S402, and S406 shown in FIG. 4 can be executed by the acquiring unit 801 shown in FIG. 8, and steps S403-S405, and S407 can be executed by the processing unit 802 in FIG. 8, and step S408 It can be executed by the update unit 803 shown in FIG. 8.
  • each unit in the model training device shown in FIG. 8 can be separately or completely combined into one or several other units to form, or some of the units can be disassembled. It is composed of multiple units with smaller functions, which can achieve the same operation without affecting the realization of the technical effects of the embodiments of the present application.
  • the above-mentioned units are divided based on logical functions. In practical applications, the function of one unit may also be realized by multiple units, or the functions of multiple units may be realized by one unit. In other embodiments of the present application, the model-based training device may also include other units. In actual applications, these functions may also be implemented with the assistance of other units, and may be implemented by multiple units in cooperation.
  • a general-purpose computing device such as a computer including a central processing unit (CPU), a random access storage medium (RAM), a read-only storage medium (ROM) and other processing elements and storage elements
  • CPU central processing unit
  • RAM random access storage medium
  • ROM read-only storage medium
  • Run a computer program (including program code) capable of executing the steps involved in the corresponding method shown in FIG. 2 or FIG. 4 to construct the model training device as shown in FIG. 8 and to implement the embodiments of the present application
  • the model training method can be recorded on, for example, a computer-readable recording medium, and loaded into the aforementioned computing device via the computer-readable recording medium, and run in it.
  • the first object recognition model is first called separately, and the first object recognition model and the second object recognition model
  • the feature of the tracking object is identified to obtain the first reference response and the second reference response, and then the first object recognition model and the second object recognition model are called to identify the feature of the tracking object in the test image to obtain the first test response And the second test response; further, the first test response is tracked to obtain the tracking response of the tracked object; further, the difference information between the first reference response and the second reference response, the first test response and the The difference information between the second test responses is determined to determine the loss of feature extraction performance of the first object recognition model compared to the second object recognition model; and the first object recognition is determined based on the difference information between the tracking tag and the tracking response Loss of model tracking performance.
  • an embodiment of the present application also provides a computing device, such as the terminal shown in FIG. 9.
  • the terminal includes at least a processor 901, an input device 902, an output device 903, and a computer storage medium 904.
  • the input device 902 may also include a camera component, which can be used to obtain template images and/or test images, the camera component can also be used to obtain reference images and/or images to be processed, and the camera component can be a terminal
  • the components that are configured on the terminal at the factory can also be external components connected to the terminal.
  • the terminal may also be connected to other devices to receive template images and/or test images sent by other devices, or to receive reference images and/or images to be processed sent by other devices.
  • the computer storage medium 904 may be stored in the memory of the terminal.
  • the computer storage medium 904 is used to store a computer program.
  • the computer program includes program instructions.
  • the processor 901 is used to execute the program instructions stored in the computer storage medium 904. .
  • the processor 901, or CPU Central Processing Unit, central processing unit) is the computing core and control core of the terminal.
  • the processor 901 described in this embodiment of the application may be used to execute: obtain a template image and a test image for training, and both the template image and the test image include tracking objects ,
  • the test image includes a tracking tag of the tracking object, the tracking tag is used to indicate the marked position of the tracking object in the test image; calling the first object recognition model to the tracking object in the template image To obtain a first reference response, and call the second object recognition model to recognize the features of the tracking object in the template image to obtain a second reference response; call the first object
  • the recognition model performs recognition processing on the characteristics of the tracking object in the test image to obtain a first test response, and calls the second object recognition model to perform recognition processing on the characteristics of the tracking object in the test image , Obtain a second test response; perform tracking processing on the first test response to obtain a tracking response of the tracking object, the tracking response being used to indicate the tracking position of
  • the embodiment of the present application also provides a computer storage medium (Memory).
  • the computer storage medium is a memory device in a terminal for storing programs and data. It can be understood that the computer storage medium herein may include a built-in storage medium in the terminal, and of course, may also include an extended storage medium supported by the terminal.
  • the computer storage medium provides storage space, and the storage space stores the operating system of the terminal.
  • one or more instructions suitable for being loaded and executed by the processor 901 are stored in the storage space, and these instructions may be one or more computer programs (including program codes).
  • the computer storage medium here may be a high-speed RAM memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory; in an embodiment of the present application, it may also be at least one located in A computer storage medium remote from the aforementioned processor.
  • the processor 901 can load and execute one or more instructions stored in the computer storage medium to implement the corresponding steps of the method in the above-mentioned model training embodiment; in a specific implementation, one of the instructions in the computer storage medium Or multiple instructions are loaded by the processor 901 and execute the following steps:
  • the template image and the test image both include a tracking object, the test image includes a tracking label of the tracking object, and the tracking label is used to indicate that the tracking object is Test the label position in the image; call the first object recognition model to identify the features of the tracking object in the template image to obtain the first reference response, and call the second object recognition model to analyze the features in the template image
  • the feature of the tracked object is identified to obtain a second reference response; the first object recognition model is invoked to identify the feature of the tracked object in the test image to obtain the first test response, and call all
  • the second object recognition model performs recognition processing on the characteristics of the tracked object in the test image to obtain a second test response; performs tracking processing on the first test response to obtain the tracking response of the tracked object, so
  • the tracking response is used to indicate the tracking position of the tracking object in the test image; based on the difference information between the first reference response and the second reference response, the first test response and the second 2.
  • one or more instructions in the computer storage medium are loaded by the processor 901 and the following steps are also performed: obtaining a second object recognition model; cropping the second object recognition model to obtain the first object recognition model .
  • the processor 901 is based on the difference information between the first reference response and the second reference response, and the difference information between the first test response and the second test response. And the difference information between the tracking tag and the tracking response, when updating the first object recognition model, perform the following operations:
  • the loss optimization function corresponding to the first object recognition model based on the difference information between the first reference response and the second reference response, the difference between the first test response and the second test response.
  • the difference information and the difference information between the tracking tag and the tracking response determine the value of the loss optimization function; and update the first object recognition model according to the principle of reducing the value of the loss optimization function .
  • the loss optimization function includes a feature recognition loss function and a tracking loss function
  • the processor 901 is based on the difference information between the first reference response and the second reference response, the first When the difference information between a test response and the second test response and the difference information between the tracking tag and the tracking response are determined, the following operations are performed when the value of the loss optimization function is determined:
  • the value of the feature recognition loss function acquire the tracking loss function, and determine the value of the tracking loss function based on the difference information between the tracking tag and the tracking response; recognize the value of the loss function based on the feature and The value of the tracking loss function determines the value of the loss optimization function.
  • the first object recognition model includes a first convolutional layer, a second convolutional layer, and a third convolutional layer
  • the first test response is the first convolutional layer corresponding to the first convolutional layer.
  • a test sub-response, a second test sub-response corresponding to the second convolutional layer, and a third test sub-response corresponding to the third convolutional layer are fused; the processor 901 is based on the tracking label and When the difference information between the tracking responses determines the value of the tracking loss function, the following operations are performed:
  • the first object recognition model includes a plurality of convolutional layers, the plurality of convolutional layers are connected in a connection order, and the first convolutional layer is the first convolutional layer indicated by the connection order.
  • Convolutional layers the third convolutional layer is the last convolutional layer indicated by the connection order, and the second convolutional layer is divided by the first convolutional layer and the last convolutional layer Any convolutional layer outside the layer.
  • one or more instructions in the computer storage medium are loaded by the processor 901 and the following steps are also executed:
  • the positive sample refers to an image that includes the tracking object
  • the negative sample refers to an image that does not include the tracking object.
  • An image of an object the positive sample includes a positive sample tracking label of the tracking object, the negative sample includes a negative sample tracking label of the tracking object, and the reference image includes the annotation information of the tracking object;
  • the updated first object recognition model performs recognition processing on the positive sample to obtain a positive sample recognition response, and calls the updated first object recognition model to perform recognition processing on the negative sample to obtain a negative sample recognition response Tracing the positive sample identification response to obtain a positive sample tracking response to the tracking target in the positive sample; and tracking the negative sample identification response to obtain the negative sample Tracking response to the negative sample of the tracking object in the tracking object; based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference between the negative sample tracking response and the negative sample tracking label Information, training the updated first object recognition model.
  • the processor 901 is based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference between the negative sample tracking response and the negative sample tracking label Information, when training the updated first object recognition model, perform the following operations:
  • Obtain a tracking loss optimization function determine the tracking based on the difference information between the positive sample tracking response and the positive sample tracking label, and the difference information between the negative sample tracking response and the negative sample tracking label
  • the value of the loss optimization function according to the principle of reducing the value of the tracking loss function, the updated first object recognition model is updated.
  • one or more instructions in the computer storage medium are loaded by the processor 901 and the following steps are also executed:
  • the tracking object is recognized to obtain a first recognition feature
  • the updated first object recognition model is called to recognize the predicted tracking object in the image to be processed to obtain a second recognition feature
  • the first identification feature and the second identification feature determine a target feature used for tracking processing, and use a tracking algorithm to track the target feature to obtain position information of the tracking object in the image to be processed.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请实施例公开了一种模型训练方法、装置、终端及存储介质,方法包括:获取模板图像和测试图像;调用第一物体识别模型对模板图像中跟踪对象的特征处理得到第一参考响应,调用第二物体识别模型对模板图像中跟踪对象的特征处理得到第一参考响应;调用第一物体识别模型对测试图像中跟踪对象的特征处理得到第一测试响应,调用第二物体识别模型对测试图像中跟踪对象的特征处理得到第二测试响应;对第一测试响应进行跟踪处理得到在跟踪对象的跟踪响应;基于第一参考响应与第二参考响应之间差异信息、第一测试响应与第二测试响应之间差异信息和跟踪标签与跟踪响应之间差异信息更新第一物体识别模型。

Description

模型训练方法、装置、终端及存储介质
本申请要求于2019年5月13日提交国家知识产权局、申请号为201910397253.X,申请名称为“模型训练方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及互联网技术领域,具体涉及视觉目标跟踪领域,尤其涉及一种模型训练方法、一种模型训练装置、一种终端及一种存储介质。
背景技术
随着科技的发展,计算机视觉技术成为当前较为热门的研究领域,视觉目标跟踪是计算视觉领域中的一个重要研究方向。所谓视觉目标跟踪是指:在已知某图像中的跟踪对象的大小与位置的情况下,预测该跟踪对象在其他图像中的大小与位置。视觉目标跟踪通常应用于视频监控、人机交互以及无人驾驶等对实时性要求较高的应用场景中,例如:在给定某视频序列中的某帧图像中的跟踪对象的大小与位置的情况下,预测该视频序列的后续帧图像中的该跟踪对象的大小与位置。
发明内容
本申请实施例提供了一种模型训练方法、装置、终端及存储介质,可以更好的对第一物体识别模型进行训练,使得更新训练得到的第一物体识别模型具备较佳的视觉目标跟踪性能,使其更适用于视觉目标跟踪场景,提高视觉目标跟踪的准确性。
一方面,本申请实施例提供了一种模型训练方法,由计算设备执行,所述模型训练方法包括:
获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
对所述第一测试响应进行跟踪处理,得到在所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
另一方面,本申请实施例提供了一种模型训练装置,所述模型训练装置包括:
获取单元,用于获取训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
处理单元,用于调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
所述处理单元,还用于调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
所述处理单元,还用于对所述第一测试响应进行跟踪处理,得到在所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述 测试图像中的跟踪位置;
更新单元,用于基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
再一方面,本申请实施例提供了一种终端,所述终端包括输入设备和输出设备,所述终端还包括:
处理器,用于实现一条或多条指令;以及,
计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令用于由所述处理器加载并执行如下步骤:
获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
再一方面,本申请实施例提供了一种计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令用于由处理器加载并执行如下步骤:
获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
附图说明
为了更清楚地说明本申请实施例技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1a是本申请实施例提供的一种基于第一物体识别模型进行视觉目标跟踪的场景图;
图1b是本申请实施例提供的模型训练方法的实施环境示意图;
图2是本申请实施例提供的一种的模型训练方法的流程示意图;
图3a是本申请实施例提供的一种卷积神经网络的结构图;
图3b是本申请实施例提供的一种确定跟踪响应和跟踪标签的示意 图;
图4是本申请实施例提供的另一种的模型训练方法的流程示意图;
图5是本申请实施例提供的一种获取第一物体识别模型的示意图;
图6是本申请实施例提供的一种第一物体识别模型联合优化的示意图;
图7是本申请另一实施例提供的一种获取正样本和负样本的示意图;
图8是本申请实施例提供的一种模型训练装置的结构示意图;
图9是本申请实施例提供的一种终端的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
目前,视觉目标跟踪主要是采用传统的图像处理模型实现跟踪处理的,但发明人在实践中发现,传统的图像处理模型是为了实现图像分类任务而设计的,采用图像分类数据进行训练得到,然而,视觉目标跟踪并不是为了实现图像分类任务,因此传统的图像处理模型并不适合应用在视觉目标跟踪场景,导致视觉目标跟踪的准确性低。
本申请实施例提供了一种第一物体识别模型,所述第一物体识别模型是指具有图像识别性能的图像识别模型,例如超分辨率测试序列(Visual Geometry Group,VGG)模型、谷歌网络GoogleNet模型以及深度残差网络(Deep residual network,ResNet)模型等。所述第一物体识别模型可以准确地对图像进行特征提取并且其提取到的特征更适用于视觉目标跟踪场景,因此将所述第一物体识别模型结合相关跟踪算法应用在视觉目标跟踪场景中,可以提高视觉目标跟踪的准确性和实时性。
具体地,利用第一物体识别模型和跟踪算法实现视觉目标跟踪的步骤可包括:(1)获取待处理图像和包括跟踪对象的参考图像,所述跟踪对象为所述参考图像中的需要被跟踪的图像元素,例如参考图像中的人、动物等;所述参考图像中可包括跟踪对象的标注信息,所述标注信息用 于表示跟踪对象的大小和位置。在本申请一实施例中,所述标注信息可以标注框的形式表示,例如下文所述的图1中101所示;(2)根据参考图像中的标注信息确定待处理图像中包括的预测跟踪对象,此处所述的预测跟踪对象是指在待处理图像中可能为跟踪对象的图像元素。在本申请一实施例中,在(2)中可以根据参考图像中的标注框的大小在待处理图像中生成多个候选框,每个候选框代表一个预测跟踪对象,例如下文所述的图1中的A、B、C表示确定出的三个预测跟踪对象;(2)调用第一物体识别模型对所述参考图像中的跟踪对象进行识别处理,得到第一识别特征,所述第一识别特征是指跟踪对象的特征,例如跟踪对象的脸部轮廓特征、眼睛特征或者跟踪对象的姿态特征等等;(3)调用第一物体识别模型对所述待处理图像中包括的预测跟踪对象进行识别处理,得到第二识别特征,所述第二识别特征是指各个预测跟踪对象的特征,例如各个预测跟踪对象的脸部轮廓特征、眼睛特征、鼻子特征或者姿态特征等等;(4)基于所述第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征,并采用跟踪算法对所述目标特征进行跟踪处理,得到所述跟踪对象在所述待处理图像中的位置。在一个实施例中,所述跟踪算法可以包括相关滤波器跟踪算法、基于双网络的跟踪算法、稀疏表示算法等,本申请实施例中以相关滤波器算法为例。所述相关滤波器算法对目标特征进行跟踪处理后,得到一个高斯形状的响应图,该响应图上峰值的位置即表示跟踪到的跟踪对象在所述待处理图像中的位置。
其中,所述根据第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征可以理解为:通过对跟踪对象的特征和各个预测跟踪对象的特征的分析,确定出将各个预测跟踪对象中哪个预测跟踪对象作为待处理图像中包括的跟踪对象,以便于后续利用跟踪算法对该预测跟踪对象的特征进行处理,以得到跟踪对象在所述待处理图像中的位置,从而完成对跟踪对象的跟踪。在一个实施例中,步骤(4)的实施方式可以包括:将第一识别特征分别与各个第二识别特征进行匹配度评分,将匹配度评分最高的第二识别特征确定为目标特征。在其他实施例中,步骤 (4)的实施方式还可以包括:将各个第二识别特征进行融合处理,将融合处理的结果确定为目标特征。
例如,参考图1,为本申请实施例提供的一种视觉目标跟踪的场景,101表示参考图像,102为待处理图像,1011表示以标注框形式表示的跟踪对象的标注信息,标注框1101的大小表示参考图像中跟踪对象的大小,标注框1101的位置表示跟踪对象在参考图像中的位置,103表示第一物体识别模型。假设根据标注框1011在待处理图像102中生成A、B和C三个预测跟踪对象,然后调用第一物体识别模型103对1011进行识别处理,得到第一识别特征,并调用第一物体识别模型分别对预测跟踪对象A、B以及C进行识别处理,得到三个第二识别特征。进一步地,基于第一识别特征和三个第二识别特征确定目标特征,假设将预测跟踪对象C对应的第二识别特征确定为目标特征;再采用跟踪算法比如相关跟踪滤波器算法对目标特征进行跟踪处理,得到一个高斯形状的响应图,该响应图上峰值点表示跟踪对象在待处理图像中的位置如104所示。
基于上述的第一物体识别模型,本申请实施例还提出了一种模型训练方法,所述模型训练方法用于训练第一物体识别模型,以保证第一物体识别模型可以准确对图像进行特征提取并且提取到的特征更适用于跟踪场景。具体地,所述模型训练方法可以由终端等计算设备执行,具体地可由终端的处理器执行,所述终端可包括但不限于:智能终端、平板电脑、膝上计算机、台式电脑,等等。
图1b为本申请实施例提供的模型训练方法的实施环境示意图。其中,终端设备10与服务器设备20之间通过网络30通信连接,所述网络30可以是有线网络,也可以是无线网络。在终端设备10与服务器设备20上集成有本申请任一实施例提供的模型训练装置,用于实现本申请任一实施例提供的模型训练方法。
参见图2,本申请实施例提出的模型训练方法可包括以下步骤 S201-S205:
步骤S201、获取用于训练的模板图像和测试图像。
其中,所述模板图像和所述测试图像是用来对模型进行训练更新的图像,所述模板图像和所述测试图像中均包括跟踪对象,所述模板图像中还可以包括跟踪对象的标注信息,此处,所述跟踪对象的标注信息用于表示跟踪对象在模板图像中的大小和位置,所述标注信息可以是终端为模板图像标注的;所述测试图像中还包括测试图像对应的响应标签,所述响应标签用于表示跟踪对象在测试图像中的标注位置,所述标注位置可以指终端标注的、跟踪对象在测试图像中的真实位置;所述测试图像中也可以包括跟踪对象的标注信息,此处,所述跟踪对象的标注信息用于表示跟踪对象在测试图像中的大小和位置。
在一个实施例中,所述模板图像与测试图像可以是同一个视频序列中的两帧图像,例如,利用拍摄装置录制一段包括跟踪对象的视频序列,选择视频序列中任意一帧包括跟踪对象的图像作为模板图像,选择视频序列中除该模板图像之外,且包括跟踪对象的一帧图像作为测试图像。
在其他实施例中,所述模板图像与测试图像也可以不是同一个视频序列中的图像,例如,模板图像可以是通过拍摄装置对包括跟踪对象的第一拍摄场景进行拍摄得到的图像,测试图像可以在得到模板图像之前或者之后,利用拍摄装置对包括跟踪对象的第二拍摄场景进行拍摄得到的图像,也即,模板图像和测试图像是两张相互独立的图像。
由于同一视频序列的图像之间通常具备上下语义关系,相比于相互独立的模板图像及测试图像,更有利于对第一物体识别模型进行训练,且使得训练更新后的第一物体识别模型获得更佳的性能。因此,本申请实施例以模板图像和测试图像处于同一视频序列为例进行说明。
步骤S202、调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应。
步骤S203、调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应。
其中,所述第一物体识别模型和第二物体识别模型的相同点是:两者均为具有图像识别性能的图像识别模型。在本申请一实施例中,卷积神经网络模型由于其较强的特征提取性能成为目前常用的图像识别模型,本申请实施例中所述第一物体识别模型和第二物体识别可以为卷积神经网络模型,例如VGG模型、GoogleNet模型以及ResNet模型等。所述第一物体识别模型与所述第二物体识别模型的区别在于:所述第二物体识别模型是已更新的图像识别模型,或者说第二物体识别模型是预先训练并测试好的用于图像识别的模型,所述第一物体识别模型是待更新的图像识别模型。
所述卷积神经网络模型主要应用在图像识别、人脸识别以及文字识别等方向,卷积神经网络的网络结构可如图3a所示:主要包括卷积层301、池化层302和全连接层303。每个卷积层与一个池化层连接,所述卷积层301主要用于进行特征提取,所述池化层302也叫子采样层,主要用于缩减输入数据的规模,所述全连接层303根据卷积层提取到的特征来计算分类的分类值,最后输出分类及其对应的分类值。由此可知,所述第一物体识别模型和所述第二物体识别模型的网络结构也包括卷积层、池化层和全连接层。
每个卷积神经网络模型包括多个卷积层,每个卷积层负责提取图像的不同特征,前一个卷积层提取到的特征作为后一个卷积层的输入,每个卷积层负责提取的特征可以是根据特定函数设定的,或者是人为设定的。例如,对于图形类的图像识别时,可以设定第一卷积层负责提取图形的整体形状特征;第二卷积层负责提取图形的线条特征;第三卷积层负责提取图形的非连续性特征等。再如,对于包含人脸的图像识别时,可以设定第一卷积层负责提取人脸的轮廓特征;第二卷积层负责提取人 脸的五官特征。每个卷积层中包括多个相同尺寸的用于进行卷积计算的滤波器,每个滤波器对应一个滤波器通道,每个滤波器进行卷积计算后得到一组特征,因此,每个卷积层对输入图像进行识别处理后提取到多维特征。在卷积层中,卷积层的数量越多,卷积神经网络模型的网络结构越深,提取到的特征数量也就越多;每个卷积层中包括的滤波器数量越多,每个卷积层提取到特征维度越高。
应当理解,如果一个模型包括的卷积层较多,和/或每个卷积层中滤波器数量较多,则对该模型进行存储时需要较大的存储空间,将需要较多存储空间的模型称为重量级模型;相反地,如果一个模型包括的卷积层较少、和/或每个卷积层中滤波器数量较少,则对该模型进行存储时不需要较大的存储空间,将需要较少存储空间的模型称为轻量级模型。
在本申请一实施例中,第一物体识别模型与第二物体识别模型可以均为重量级模型,或者,第二物体识别模型为重量级模型,第一物体识别模型为第二物体识别模型进行模型压缩处理得到的轻量级模型。如果第一物体识别模型属于重量级模型,则更新后的第一物体识别模型能够提取到高维度的特征,具有更好的识别性能,将其应用在视觉目标跟踪场景中时,可提高跟踪的准确性。如果第一物体识别模型是通过对第二物体识别模型进行模型压缩处理得到的轻量级模型,则更新后的第一物体识别模型具有与第二物体识别模型相似的特征提取性能,由于其更少的存储空间使其能够有效的应用在移动设备以及其他低功耗产品中。另外,如果将其应用在视觉目标跟踪场景中时,可以快速的进行特征提取,实现视觉目标跟踪的实时性。在实际应用中,可以根据具体的场景需求,选择第一物体识别模型为重量级模型还是轻量级模型。
由图1的实施例描述可知,在视觉目标跟踪领域中,影响跟踪准确性的主要因素之一是第一物体识别模型提取到的特征是否准确,而第一物体识别模型的特征提取主要依赖于卷积层,所以本申请实施例中,所述对第一物体识别模型进行更新,实质上是训练第一物体识别模型的卷积层,以提高第一物体识别模型的特征提取性能。基于此,在步骤S202 中所述调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应实质是调用第一物体识别模型的卷积层对模板图像中跟踪对象的特征进行特征提取处理得到第一参考响应。
所述第一参考响应是用于表示第一物体识别模型识别到的模板图像中的所述跟踪对象的特征,比如大小、形状、轮廓等,所述第一参考响应可以用特征图表示;同理可知,所述第二参考响应是用于表示第二物体识别模型识别到的模板图像中的所述跟踪对象的特征;所述第一测试响应是用于表示第一物体识别模型识别到的测试图像中的额跟踪对象的特征;所述第二测试响应是用于表示第二物体识别模型识别到的测试图像中跟踪对象的特征。
在一个实施例中,由前述可知,模板图像中可包括跟踪对象的标注信息,所述标注信息的作用可以是:确定出模板图像中第一物体识别模型需要识别的跟踪对象的大小及其所在的位置,以便于第一物体识别模型可以准确的确定出需要对谁进行识别处理;模板图像中跟踪对象的标注信息可以是以标注框形式表示的。在本申请一实施例中,所述调用第一物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应可以指调用第一物体识别模型并结合模板图像中的标注信息对模板图像进行识别处理。例如,假设模板图像中的标注信息是以标注框的形式表示的,所述调用第一物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应可以指对模板图像中的标注框的特征进行识别处理。
在其他实施例中,如果模板图像中只包括跟踪对象,或者包括跟踪对象和对跟踪对象的识别处理无影响的背景,例如墙面、地面、天空等,此种情况下,终端无论是否为模板图像设置跟踪对象的标注信息,都能使得第一物体识别模型准确地确定出需要对谁进行识别处理。
在一个实施例中,所述调用第一物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应的实施方式可以为:将 模板图像作为第一物体识别模型的输入,第一物体识别模型的第一卷积层利用一定尺寸的多个滤波器对模板图像进行卷积计算,提取到模板图像中的跟踪对象的第一特征;将第一特征作为第二卷积层的输入,第二卷积层利用多个滤波器对第一特征进行卷积计算,提取到模板图像中的跟踪对象第二特征;将第二特征输入到第三卷积层,第三卷积层利用多个滤波器对第二特征进行卷积计算,得到模板图像中的跟踪对象第四特征,依次类推,直到最后一个卷积层完成卷积计算后,输出的结果即为第一参考响应。对于调用第一物体识别对测试图像进行识别处理得到第一测试响应的实施方式、调用所述第二物体识别模型对所述测试图像进行识别处理得到第二参考响应的实施方式,以及调用所述第二物体识别模型对所述测试图像进行识别处理得到第二测试响应的实施方式可与上述描述的实施方式相同,在此不一一赘述。
步骤S204、对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应。
为了保证第一物体识别模型适用于视觉目标跟踪场景中,除了要确保第一物体识别模型具有较强特征提取性能外,还要保证第一物体识别模型提取到的特征更好地适用于跟踪场景,或者说更好地用于跟踪算法中。基于此,本申请实施例通过步骤S204实现对第一物体识别模型的跟踪训练。
在一个实施例中,所述步骤S204可包括:采用跟踪训练算法对所述第一测试响应进行跟踪处理,得到在所述跟踪对象的跟踪响应。其中,所述跟踪训练算法是用于对第一物体识别模型进行跟踪训练的算法,可以包括相关滤波器跟踪算法、基于双网络的跟踪算法、稀疏表示算法等。所述跟踪响应用于表示根据跟踪训练算法和第一测试响应确定出的跟踪对象在测试图像中的跟踪位置,实际上所述跟踪位置可以理解为根据跟踪训练算法和第一测试响应预测到的跟踪对象在测试图像中所处的位置。
在一个实施例中,如果跟踪训练算法为相关滤波器算法,所述采用 跟踪训练算法对第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应的方式可以为:采用跟踪训练算法对第一测试响应进行跟踪处理得到一个高斯形状的响应图,根据所述响应图确定跟踪响应。在本申请一实施例中,所述根据所述响应图确定跟踪响应的实施方式可以为:将所述响应图作为跟踪响应。这样,所述响应图能够反映跟踪对象在测试图像中的跟踪位置,具体地,可以将所述响应图中最大值点或者峰值点作为跟踪对象在测试图像中的跟踪位置。
在步骤S401中,所述跟踪标签用于表示跟踪对象在测试图像中的标注位置,所述标注位置可以指终端预先标注的、跟踪对象在测试图像中真实的位置。在一个实施例中,所述跟踪标签也可以为一个高斯形状的响应图,该响应图上的峰值点表示跟踪对象在测试图像中真实的位置。
例如,参考图3b所示为本申请实施例提供的一种确定跟踪标签和跟踪响应的示意图,假设304表示测试图像,3041表示跟踪对象,终端预先为测试图像标注的跟踪标签可以如图3b中306所示,306上的峰值点3061表示跟踪对象在测试对象中的标注位置。调用第一物体识别模型对304进行识别处理得到第一测试响应;再采用跟踪训练算法例如相关滤波器算法对第一测试响应进行跟踪处理得到跟踪响应如305所示,305上的峰值点3051表示跟踪对象在测试图像中的跟踪位置。
在其他实施例中,如果采用其他跟踪训练算法对第一测试响应进行跟踪处理时,可以根据具体的跟踪训练算法的特征确定跟踪响应。
步骤S205、基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
由前述可知,所述第一参考响应是用于表示第一物体识别模型识别到的模板图像中的所述跟踪对象的特征,比如大小、形状、轮廓等,所述第二参考响应是用于表示第二物体识别模型识别到的模板图像中的所述跟踪对象的特征;由此可知,所述第一参考响应与所述第二参考响应之间的差异信息可以包括第一物体识别模型和第二物体识别模型对 模板图像进行特征提取时,提取到的特征之间的差异大小。
在一个实施例中,所述特征之间的差异大小可以通过特征之间的距离表示,例如假设第一参考响应包括第一物体识别模型识别到的模板图像中跟踪对象的脸部轮廓,表示为脸部轮廓1,以及第二参考响应包括第二物体识别模型识别到的模板图像中跟踪对象的脸部轮廓,表示为脸部轮廓2;所述第一参考响应与所述第二参考响应之间的差异信息可以包括脸部轮廓1与脸部轮廓2之间的距离。在其他实施例中,所述特征之间的差异大小还可以通过特征之间的相似度值来表示,相似度值越大表示特征之间的差异越小,相似度值越小表示特征之间的差异越大。
同理可知,所述第一测试响应与所述第二测试响应之间的差异信息可以包括第一物体识别模型和第二物体识别模型对测试图像进行特征提取时,提取到的特征之间的差异大小。由步骤S204中描述可知,所述跟踪标签与所述跟踪响应之间的差异信息反映了跟踪对象在测试图像中的跟踪位置和标注位置之间的距离。
在具体实施过程中,可以根据基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定第一物体识别模型对应的损失优化函数的值,然后按照减小所述损失优化函数的值的原则,对所述第一物体识别模型进行更新。此处的更新是指:更新第一物体识别模型中的各模型参数。其中,第一物体识别模型的模型参数可包括但不限于:梯度参数、权重参数等等。
本申请实施例中利用第二物体识别模型对第一物体识别模型进行训练过程中,首先分别调用第一物体识别模型和第二物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应和第二参考响应,再调用第一物体识别模型和第二物体识别模型对测试图像中的所述跟踪对象的特征进行识别处理得到第一测试响应和第二测试响应;进一步地,对第一测试响应进行跟踪处理,得到跟踪对象的跟踪响应;进而,便可以根据第一参考响应与第二参考响应之间的差异信息、 第一测试响应与第二测试响应之间的差异信息,确定第一物体识别模型相比于第二物体识别模型在特征提取性能上的损失;以及根据跟踪标签与跟踪响应之间的差异信息,确定第一物体识别模型在跟踪性能上的损失。基于第一物体识别模型在特征提取性能上的损失以及在跟踪性能上的损失更新第一物体识别模型,可以使得更新后的第一物体识别模型具有与第二物体识别模型相同或较相近的特征提取性能,并且提取到的特征更适用于视觉目标跟踪场景中,从而可提高视觉目标跟踪的准确性。
请参见图4,是本申请实施例提供的另一种模型训练方法的流程示意图。该模型训练方法可以由终端等计算设备执行;此处的终端可包括但不限于:智能终端、平板电脑、膝上计算机、台式电脑,等等。请参见图4,该模型训练方法可包括以下步骤S401-S408:
步骤S401,获取第二物体识别模型,并对所述第二物体识别模型进行裁剪,得到第一物体识别模型。
在本申请一实施例中,所述第二物体识别模型为已训练完成的用于图像识别的重量级模型,所述第一物体识别模型为待训练的用于图像识别的轻量级模型。由前述可知,通过对第二物体识别模型进行模型压缩处理得到轻量级的第一物体识别模型,再将轻量级的第一物体识别模型应用在视觉目标跟踪领域时可以实现实时的视觉目标跟踪。所述模型压缩是指对已训练好的重量级模型进行时间和空间上的压缩,以除去重量级模型中包括的一些不重要的滤波器或者参数,提升特征提取速度。在本申请实施例中,所述模型压缩可以包括模型裁剪和模型训练,所述模型裁剪是指可以通过裁剪模型中包括的滤波器数量和特征通道数的方式减轻第二物体识别模型的网络结构,以得到第一物体识别模型;所述模型训练是指基于迁移学习技术,采用第二物体识别模型和用于训练的模板图像和测试图像对裁剪得到的第一物体识别模型进行更新训练,以使得第一物体识别模型具有与第二物体识别模型相同或相似的特征识别性能。
所述迁移学习技术是指将一个模型的性能迁移到另一个模型上,本申请实施例中迁移学习是指调用第二物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第二参考响应,将所述第二参考响应作为监督标签训练第一物体识别模型对模板图像中的所述跟踪对象的特征的识别,再调用第二物体识别模型对测试图像中的所述跟踪对象的特征进行识别处理得到第二测试响应,将所述第二测试响应作为监督标签训练第一物体识别模型对测试图像中的所述跟踪对象的特征的识别。老师-学习模型是一种典型的基于迁移学习技术进行模型压缩的方法,在本申请实施例中,第二物体识别模型相当于老师模型,第一物体识别模型相当于学生模型。
在一个实施例中,在对所述第二物体识别模型裁剪得到第一物体识别模型过程中,裁剪可以指将第二物体识别模型中每个卷积层中包括的滤波器个数减去一定数量,和/或将每个卷积层对应的特征通道数也减去相应数量。例如,将第二物体识别模型的每个卷积层中滤波器个数和特征通道数减去五分之三,或者减去八分之七或者任意数量;经过实践证明,将第二物体识别模型中每个卷积层中包括的滤波器个数和每个卷积层对应的特征通道数减去八分之七,能够通过训练更新得到较好的第一物体识别模型。例如,参考图5,为本申请实施例提供的一种对第二物体识别模型进行裁剪得到第一物体识别模型的示意图,应当理解,通过上述方法对第二物体识别模型进行裁剪处理只涉及到卷积层,所以为方便描述,图5中只示出第一物体识别模型和第二物体识别模型的卷积层。假设第二物体识别模型为VGG-8模型,由前述可知第一物体识别模型也为VGG-8模型。VGG-8模型中包括5个卷积层,501表示的第二物体识别模型的卷积层,502表示第一物体识别模型的卷积层,503表示第二物体识别模型的每个卷积层中包括的滤波器个数、特征通道数、滤波器的尺寸。基于上述描述,对第二物体识别模型中每个卷积层包括的滤波器个数、特征通道数均减去八分之七,得到第一物体识别模型的各个卷积层中滤波器个数、特征通道数以及滤波器的尺寸,如504所示。
步骤S402,获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示跟踪对象在测试图像中的标注位置。
步骤S403,调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用所述第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应。
步骤S404,调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应。
步骤S405,对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应。
在一个实施例中,步骤S405的实施方式可包括采用跟踪训练算法对第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应。所述跟踪训练算法中可包括跟踪算法参数,所述采用跟踪训练算法对所述第一测试响应进行跟踪处理,得到在所述测试图像中对所述跟踪对象的跟踪响应的实施方式可以是:将第一测试响应代入已知跟踪算法参数的跟踪训练算法中进行计算,根据计算得到的结果确定跟踪响应。本申请实施例中所述跟踪训练算法中的跟踪算法参数是根据第二物体识别模型和模板图像对跟踪训练算法进行训练得到的。下面以跟踪训练算法为相关滤波器算法为例,介绍利用第二物体识别模型和模板图像对跟踪训练算法进行训练,得到相关滤波器跟踪算法的跟踪算法参数的过程。所述相关滤波器跟踪算法的跟踪算法参数是指相关滤波器参数的滤波器参数,对相关滤波器算法的训练过程可包括步骤S11-13:
步骤S11,根据模板图像生成训练样本,并获取训练样本对应的跟踪标签;
在一个实施例中,模板图像中包括跟踪对象以及跟踪对象对应的跟 踪标签,根据模板图像生成的训练样本中也包括跟踪对象。其中,所述模板图像中包括的跟踪对象对应的跟踪标签可以指跟踪对象在模板图像中的真实位置,所述模板图像中包括跟踪对象的跟踪标签可以是终端预先标注的。在本申请一实施例中,根据模板图像生成训练样本的方式可以为:在模板图像中裁剪出包括跟踪对象的图像块,对图像块进行循环移位处理得到训练样本,训练样本对应的跟踪标签根据模板图像中包括的跟踪标签和循环移位操作的程度决定。
对模板图像进行循环移位处理的方式可以为:将模板图像的图像块进行像素化处理,确定用于表示跟踪对象的像素点,这些像素点组成了跟踪对象的像素矩阵,对于像素矩阵中每行进行循环移位处理,得到多个新的像素矩阵。在上述循环移位过程中,每个像素点的值没有改变,只是像素点位置发生改变,像素点的值不变所以通过循环移位后的矩阵还用于表示跟踪对象,像素点的位置发生改变,新的像素点矩阵渲染出来的跟踪对象的位置发生了变化。
上述对像素矩阵的每行进行循环移位处理,可以包括:像素矩阵的每行可以表示为一个nx1的向量,向量中每个向量元素对应一个像素点;将nx1向量中的每个像素点依次向右或者向左移动,每移动一次得到一组新的向量。
步骤S12,调用第二物体识别模型对训练样本进行特征提取处理,得到训练样本中跟踪对象的特征;
调用第二物体识别模型对多个训练样本进行特征提取处理实质是调用第二物体识别模型的卷积层对训练样本进行特征提取的过程。第二物体识别模型包括多个卷积层,每个卷积层中包括多个用于卷积计算的滤波器,所以每个卷积层提取到的特征是多维的,经每个卷积层提取到的多维特征依次作为下一个卷积层的输入,直到得到最后一个卷积层的输出。例如,第二物体识别模型包括5个卷积层,通过5个卷积层对训练样本进行特征提取处理后,得到的训练样本的特征的维度为D,假设
Figure PCTCN2020083523-appb-000001
表示第二物体识别模型提取到的第i维的特征,最后第二物体识别 模型提取到的训练的特征表示为
Figure PCTCN2020083523-appb-000002
步骤S13,获取用于确定相关滤波器参数的岭回归方程,并对所述岭回归方程进行求解,得到相关滤波器参数。
相关滤波器算法的工作原理是:提取包括跟踪对象的图像的特征;将提取到的特征与相关滤波器进行卷积计算,得到响应图,从所述响应图中确定出图像中跟踪对象的位置。卷积计算时,要求两个相同大小的量之间才能进行卷积运算,因此要保证相关滤波器的维度和训练样本的特征的维度相同。相关滤波器算法对应的岭回归方程可如公式(1)所示:
Figure PCTCN2020083523-appb-000003
其中,★表示卷积运算,D表示第二物体识别模型提取到的训练样本的特征维度,w i表示相关滤波器的第i维滤波器参数,x表示训练样本,y表示训练样本x的跟踪标签,
Figure PCTCN2020083523-appb-000004
表示训练样本x的第i维特征,λ表示正则化系数。
通过最小化式(1)的岭回归方程,便可得到相关滤波器的各个维度的滤波器参数。具体地,最小化式(1),并将式(1)在频域进行求解,得到相关滤波器的各个维度的滤波器参数。以求解第d维度的滤波器参数为例,介绍在频域求解滤波器参数的公式。在频域求解第d维度的滤波器参数的公式表示为(2):
Figure PCTCN2020083523-appb-000005
在公式(2)中,w d表示第d个卷积层对应的相关滤波器参数,⊙表示点乘运算,
Figure PCTCN2020083523-appb-000006
表示离散傅里叶变换,· *表示复共轭运算。依据公式(2)可以计算得到各个维度的相关滤波器的滤波器参数,各个维度的滤波器参数组成相关滤波器算法的滤波器参数。
通过步骤S11-S13对相关滤波器算法训练得到相关滤波器的滤波器参数后,可以基于相关滤波器算法对第一测试响应进行跟踪处理,得到所述测试图像中对所述跟踪对象的跟踪响应。具体地,采用相关滤波器算法对第一测试响应进行跟踪处理,得到在所述测试图像中对所述跟踪 对象的跟踪响应可通过公式(3)表示,
Figure PCTCN2020083523-appb-000007
在公式(3)中,w表示相关滤波器的滤波器参数,
Figure PCTCN2020083523-appb-000008
表示第一测试响应,
Figure PCTCN2020083523-appb-000009
表示反离散傅里叶变化,r表示跟踪响应。
步骤S406,获取所述第一物体识别模型对应的损失优化函数。
为了保证第一物体识别模型和第二物体识别模型有相同或者相近的特征提取性能,同时保证第一物体识别模型提取的特征更适用于视觉跟踪场景,本申请实施例提出了对第一物体识别模型进行特征识别损失和跟踪损失的联合优化。对第一物体识别模型进行联合优化时,第一物体识别模型对应的损失优化函数可表示为公式(4):
Figure PCTCN2020083523-appb-000010
在公式(4)中,
Figure PCTCN2020083523-appb-000011
表示特征识别损失,
Figure PCTCN2020083523-appb-000012
表示跟踪损失,λ表示特征识别损失和跟踪损失对第一物体识别模型的优化重要性的参数,其取值可以在0-1范围内,λ越大表示跟踪损失对第一物体识别模型的损失优化影响越大,Θ表示第一物体识别模型的网络参数,Υ表示正则化系数,Υ‖Θ‖ 2防止第一物体识别模型过拟合。
步骤S407,基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值。
通过步骤S406可知,第一物体识别模型的损失优化函数包括特征识别损失函数和跟踪损失函数,在步骤S407中确定损失优化函数的值时,可以首先确定特征识别损失函数的值和跟踪损失函数的值,再根据特征识别损失函数的值和跟踪损失函数的值确定优化损失函数的值。
具体地,所述基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值,包括:获取所述特征识别损失函数,并基于所述第一参考响应与所 述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息,确定所述特征识别损失函数的值;获取所述跟踪损失函数,并基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值;基于所述特征识别损失函数的值和所述跟踪损失函数的值确定损失优化函数的值。
关于特征识别损失函数的值:由前述可知,所述第一参考响应用于表示第一物体识别模型识别到的模板图像中的所述跟踪对象的特征,所述第二用于表示第二物体识别模型识别到的模板图像中的所述跟踪对象的特征,所述第一参考响应与所述第二参考响应之间的差异信息反映了第一物体识别模型和第二物体识别模型对模板图像中的所述跟踪对象的特征进行特征提取时,提取到的特征之间的差异大小,所述差异大小可以用距离来表示,也即第一参考响应与所述第二参考响应之间的差异信息包括第一参考响应与所述第二参考响应之间的距离;
同理,所述第一测试响应与所述第二测试响应之间的差异信息包括第一测试响应与所述第二测试响应之间的距离。特征识别损失函数是通过约束上述的特征之间的距离,以使得第一物体识别模型和第二物体识别模型有相同或相近的特征提取性能。由此可知,特征损失优化函数包括两部分损失,一部分为关于测试图像的特征识别损失,一部分是关于模板图像的特征识别损失。
关于测试图像的特征识别损失的损失值由第一参考响应与所述第二参考响应之间的距离确定,关于模板图像的特征识别损失的损失值由第一测试响应与所述第二测试响应之间的距离确定,将关于测试图像的特征识别损失的损失值和关于参考图像的识别损失的损失值代入到特征识别损失函数中,便可计算得到特征识别损失函数的值。例如,特征识别损失函数可表示为公式(5):
Figure PCTCN2020083523-appb-000013
其中,
Figure PCTCN2020083523-appb-000014
表示特征识别损失函数,
Figure PCTCN2020083523-appb-000015
关于参考图像的特征识别 损失,
Figure PCTCN2020083523-appb-000016
表示关于测试对象的特征识别损失,
Figure PCTCN2020083523-appb-000017
表示第一参考响应,ψ(x)表示第二参考响应,
Figure PCTCN2020083523-appb-000018
表示第一测试响应,ψ(z)表示第二测试响应。
关于跟踪损失函数的值:跟踪标签与跟踪响应之间的差异反映了跟踪响应与跟踪标签之间的欧式距离,通过约束两者之间的欧式距离,优化第一物体识别模型的跟踪性能。将根据跟踪响应与跟踪标签之间的欧式距离代入到跟踪损失函数,便可求得跟踪损失函数的值。例如,跟踪损失函数可表示为公式(6):
Figure PCTCN2020083523-appb-000019
Figure PCTCN2020083523-appb-000020
其中,
Figure PCTCN2020083523-appb-000021
表示跟踪损失函数,r表示跟踪响应,g表示跟踪标签,r可以通过公式(7)得到,公式(7)中w表示跟踪训练算法的滤波器参数,可以通过前述S11-S13的步骤得到。
应当理解,由前述可知,第一物体识别模型包括多个卷积层,第一测试响应是将第一物体识别模型的各个卷积层对测试图像进行识别处理得到的各个子测试响应进行融合处理后得到的。例如,假设第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述第一测试响应是由第一卷积层对应的第一测试子响应、所述第二卷积层对应的第二测试子响应以及所述第三卷积层对应的第三测试子响应融合得到的。为了保证第一物体识别模型提取到的特征更适用于视觉目标跟踪场景中,可以对第一物体识别模型进行多尺度的跟踪损失优化。
在本申请一实施例中,多尺度的跟踪损失优化是指:计算第一物体识别模型的多个卷积层的跟踪损失值,再根据多个卷积层的跟踪损失值确定第一物体识别模型的跟踪损失函数的值。具体地,假设第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值,包括:基于所述第一卷积层对应的第一跟踪标签与法对所述第一测试子响应 进行跟踪处理得到的第一跟踪响应之间的差异信息,确定所述第一卷积层的跟踪损失值;基于所述第二卷积层对应的第二跟踪标签与对所述第二测试子响应进行跟踪处理得到的第二跟踪响应之间的差异信息,确定所述第二卷积层的跟踪损失值;基于所述第三卷积层对应的第三跟踪标签与对所述第三测试子响应进行跟踪处理得到的第三跟踪响应之间的差异信息,确定所述第三卷积层的跟踪损失值;将所述第一卷积层对应的跟踪损失值、所述第二卷积层对应的跟踪损失值以及所述第三卷积层对应的跟踪损失值进行多尺度融合处理,得到跟踪损失函数的值。
其中,第一跟踪子响应、第二跟踪子响应以及第三跟踪子响应可以是采用跟踪训练算法对分别对第一卷积层、第二卷积层以及第三卷积层对应的第一测试子响应、第二测试子响应以及第三测试子响应进行跟踪处理得到的。由于不同卷积层提取到的特征不相同,所以第一跟踪子响应、第二跟踪子响应以及第三跟踪子响具有不同的分辨率。其中,跟踪训练算法对不同卷积层的测试子响应进行跟踪处理时所使用的跟踪算法参数不相同,在某个卷积层下的跟踪算法参数是通过第二物体识别模型和相应卷积层对应的模板图像进行训练得到的,具体的训练过程可参考步骤S11-S13,在此不再赘述。
应当理解,第一物体识别模型中包括的多个卷积层是按照连接顺序连接在一起的,上述提及到的第一卷积层、第二卷积层以及第三卷积层可以是第一物体识别模型的卷积层中任意三个卷积层,或者所述第一卷积层为所述连接顺序所指示的第一个卷积层,所述第三卷积层为所述连接顺序所指示的最后一个卷积层,所述第二卷积层为除所述第一个卷积层和所述最后一个卷积层外的任意一个卷积层,此时第一卷积层可以称为第一物体识别模型的高层卷积层、第二物体识别模型为第一物体识别模型的中层卷积层,所述第三卷积层为第一物体识别模型的低层卷积层。经实践证明,对于只有5个卷积层的第一物体识别模型,选用所述连接顺序所指示的第一个卷积层、所述连接顺序所指示的最后一个卷积层以及第二卷积层进行多尺度跟踪损失优化,能够使得第一物体识别模型提 取到的特征更好的适用于跟踪场景中。
在多尺度跟踪损失优化的情况下,上述公式(6)可改写成公式(8)和(9):
Figure PCTCN2020083523-appb-000022
Figure PCTCN2020083523-appb-000023
其中,l表示第一物体识别模型的第l个卷积层,r l表示跟踪算法对第l个卷积层的第l个测试子响应进行跟踪处理得到的第l跟踪子响应,g l表示第l个卷积层对应的测试图像中包括的跟踪对象的跟踪标签。其中,跟踪算法对第l卷积层的第l测试子响应进行跟踪处理得到的第l跟踪子响应时,用到的第l卷积层对应的跟踪算法参数是通过第二物体识别模型和第l卷积层对应的模板图像训练得到的,具体的训练过程可参考步骤S11-S13部分的描述,在此不再赘述。
参考图6,为本申请实施例提供的一种对第一物体识别模型进行联合优化的示意图,图中示出的特征识别损失优化如公式(5)所示和多尺度跟踪损失优化如公式(8)所示,图6中601表示第一物体识别模型,602表示第二物体识别模型。
步骤S408,按照减小所述损失优化函数的值的原则,对所述第一物体识别模型进行更新。
通过步骤S406-S407确定了第一物体识别模型的特征识别损失函数的值和跟踪损失函数的值后,两者代入公式(4),计算得到损失优化函数的值,按照减小损失优化函数的值的原则,更新第一物体识别模型。换句话说,不断减小损失优化函数的值,根据损失优化函数的值可反推出特征识别损失函数的值和跟踪损失函数的值,再通过调整第一物体识别模型的模型参数以使第一参考响应与第二参考响应之间的距离,以及第一测试响应与第二测试响应之间的距离满足特征识别损失函数的值;同时,调整第一物体识别模型的模型参数以使得跟踪响应与跟踪标签之间的欧式距离满足跟踪损失函数的值。
重复执行上述步骤S401-S408可更新得到一个既具有良好特征识别 性能又使得提取到的特征更适用于视觉目标跟踪场景中的第一物体识别模型。经实践证明,采用本申请实施例提供的模型训练方法,通过结合对第二物体识别模型进行模型压缩和知识迁移处理,得到的第一物体识别模型的容量仅有第二物体识别模型的几十分之一,并且第一物体识别模型拥有与第二物体识别模型相近的特征提取性能、更好的跟踪性能,实现了视觉跟踪的实时性。
由于前述步骤S401-S408对第一物体识别模型进行更新时使用的模板图像和测试图像均为包括跟踪对象的图像,如此可保证更新后的第一物体识别模型能够具有较好的对跟踪对象进行特征提取的能力。但是在实际应用中,调用第一物体识别模型进行识别处理的待处理图像中可能除了包括有跟踪对象外,还包括其他背景,因此,为了进一步提高第一物体识别模型的能力,本申请实施例通过步骤S401-S408对第一物体识别模型进行更新后,还利用正样本和负样本对第一物体识别模型进行更新处理,以使得第一物体识别模型具有更好的特征辨别能力,也即能够更好的区分出图像中包括的跟踪对象和背景。
具体地,利用正样本和负样本对第一物体识别模型进行更新处理,可包括:获取包括跟踪对象的参考图像,并基于所述参考图像确定用于训练的正样本和负样本,所述参考图像可以为待使用第一物体识别模型实现跟踪的视频序列中的第一帧图像,所述正样本是指包括所述跟踪对象的图像,所述负样本是指不包括所述跟踪对象的图像,所述正样本包括所述跟踪对象的正样本跟踪标签,所述负样本包括所述跟踪对象的负样本跟踪标签;调用所述已更新的第一物体识别模型对所述正样本进行识别处理,得到正样本识别响应,并调用所述已更新的第一物体识别模型对所述负样本进行识别处理,得到负样本识别响应;对所述正样本识别响应进行跟踪处理,得到在所述正样本中对所述跟踪对象的正样本跟踪响应;并对所述负样本识别响应进行跟踪处理,得到所述在所述负样本中对所述跟踪对象的负样本跟踪响应;基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负 样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型。
在本申请一实施例中,基于参考图像获取正样本和负样本的方式可以为:通过对参考图像进行随机裁剪,得到多个图像块,将包含跟踪对象的图像块作为正样本,将不包括跟踪对象的图像块作为负样本。其中,正样本对应的正样本跟踪标签即为跟踪对象在正样本中的真实位置,负样本由于不包含跟踪对象,其对应的负样本跟踪标签为0。例如,图7所示为获取正样本和负样本的示意图,图7中701为参考图像,对参考图像进行随机的裁剪,得到多个图像块,如701中包括的多个标注框,每个标注框代表一个图像块;假设跟踪对象为702,从701的多个图像块中选择包括702的图像块作为正样本,如图中的703和704,选择不包括702的图像块为负样本,如图中的705和706。703和704对应的正样本跟踪标签为跟踪对象在703和704中的真实位置,如703和704下方图中的圆点所示。由于负样本705和706中不包括跟踪对象,因此其对应的跟踪标签为0,所以不出现圆点。
在一个实施例中,所述基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型,包括:获取跟踪损失优化函数;基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,确定所述跟踪损失优化函数的值;按照减小所述跟踪损失优化函数的值的原则,对所述已更新的第一物体识别模型进行训练。
正样本跟踪响应与正样本跟踪标签之间的差异信息包括第一物体识别模型对正样本进行跟踪处理,得到跟踪对象的位置与跟踪对象在该正样本中的真实位置之间的欧氏距离。同样地,负样本跟踪响应与负样本跟踪标签之间的差异信息包括第一物体识别模型对负样本进行跟踪处理,得到的跟踪对象的位置与跟踪对象与该负样本中跟踪对象的真实位置之间的欧式距离。将上述两者带入到跟踪损失优化函数中,得到跟踪损失优化函数的值,然后按照减小跟踪损失优化函数的值的原则,再 次更新已更新的第一物体识别模型。通过重复执行跟踪损失优化的步骤,完成对已更新的第一物体识别模型的更新。
在一个实施例中,基于步骤S407中关于多尺度跟踪优化的描述,此处基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,更新所述已更新的第一物体识别模型时,也可以是采用多尺度优化。
由前述可知,第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述正样本跟踪标签包括第一卷积层对应的第一正样本跟踪标签、第二卷积层对应的第二正样本跟踪标签以及第三卷积层对应的第三正样本跟踪标签;正样本识别响应是由第一卷积层对应的正样本第一子识别响应、第二卷积层对应的正样本第二子识别响应以及第三卷积层对应的正样本第三子识别响应融合得到的;所述负样本识别响应是由第一卷积层对应的负样本第一子识别响应、第二卷积层对应的负样本第二子识别响应以及第三卷积层对应的负样本第三子识别响应。
所述正样本跟踪响应可以包括采用跟踪训练算法对正样本第一子识别响应进行跟踪处理得到的第一正样本跟踪响应、对正样本第二子识别响应进行跟踪处理得到的第二正样本跟踪响应以及对正样本第三子识别响应进行跟踪处理得到的第三正样本跟踪响应。所述负样本跟踪响应可以包括采用跟踪训练算法对第一负样本识别响应进行跟踪处理时得到的第一负样子跟踪响应、跟踪训练算法对第二负样本识别响应进行跟踪处理时得到的第二负样本子跟踪相应,以及跟踪训练算法对第三负样本识别响应进行跟踪处理时得到的第三负样本子跟踪响应。
所述多尺度跟踪损失优化的实施方式可以为:基于第一正样本跟踪响应与第一正样本跟踪标签之间的差异信息、以及第一负样本跟踪响应与负样本跟踪响应之间的差异信息,确定第一卷积层的跟踪损失优化函数的值;基于第二正样本跟踪响应与第二正样本跟踪标签之间的差异信息、以及第二负样本跟踪响应与负样本跟踪响应之间的差异信息,确定第二卷积层的跟踪损失优化函数的值,以及基于第三正样本跟踪响应与 第三正样本跟踪标签之间的差异信息、以及第三负样本跟踪响应与负样本跟踪响应之间的差异信息,确定第三卷积层的跟踪损失优化函数的值;最后根据第一卷积层的跟踪损失优化函数的值、第二卷积层的跟踪损失优化函数的值以及第三卷积层的跟踪损失优化函数的值,确定跟踪损失优化函数的值。假设多尺度跟踪损失优化的跟踪损失优化函数可以表示为公式(10)所示:
Figure PCTCN2020083523-appb-000024
其中,
Figure PCTCN2020083523-appb-000025
其中,
Figure PCTCN2020083523-appb-000026
表示跟踪训练算法对第l卷积层对应的正样本第l子识别响应进行处理得到的第l正样本跟踪响应,g l表示第l卷积层下正样本对应的正样本跟踪标签,
Figure PCTCN2020083523-appb-000027
表示跟踪训练算法对第l卷积层对应的负样本第l子识别响应进行处理得到的第l负样本跟踪响应,w l表示第l卷积层对应的跟踪算法参数。
由前述可知,不同卷积层对应的跟踪算法参数由第二物体识别模型和相应的卷积层下对应的正样本训练得到的,不同卷积层下对应的正样本是具有相同尺寸不同分辨率的图像,对于具体的训练过程可参考上述步骤S11-S13,在此不再赘述。
通过利用参考图像对第一物体识别模型进行再次更新后,可以将第一物体识别模型和某些跟踪算法相结合应用在场景分析、监控设备以及人机交互等需要进行视觉目标跟踪的场景中。具体地,将第一物体识别模型和某些跟踪算法相结合应用在视觉目标跟踪场景中的实施方式可以包括:获取待处理图像,并根据参考图像中跟踪对象的标注信息确定所述待处理图像中包括的预测跟踪对象,所述待处理图像可以是待使用第一物体识别模型进行视觉目标跟踪的视频序列中除第一帧以外的图像;调用已更新的第一物体识别模型对所述参考图像中的所述跟踪对象进行识别处理,得到第一识别特征;调用所述已更新的第一物体识别模型对所述待处理图像中的预测跟踪对象进行识别处理,得到第二识别特 征;基于所述第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征,并采用跟踪算法对所述目标特征进行跟踪处理,得到所述跟踪对象在所述待处理图像中的位置信息。对于此部分具体的应用可参考图1部分相应的描述,在此不再赘述。
本申请实施例采用重量级的第二物体识别模型对轻量级的第一物体识别模型训练时,分别调用第一物体识别模型和第二物体识别模型对用于训练的模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应和第二参考响应,再调用第一物体识别模型和第二物体识别模型对用于训练的测试图像中的所述跟踪对象的特征进行识别处理得到第一测试响应和第二测试响应;然后对第一测试响应进行跟踪处理得到跟踪响应;最后根据第一参考响应与第二参考响应之间的差异信息、第一测试响应与第二测试响应之间的差异信息,确定第一物体识别模型相比于第二物体识别模型在特征提取性能上的损失,以及根据跟踪标签与跟踪响应之间的差异信息,确定第一物体识别模型在跟踪性能上的损失,进而再根据特征提取性能上的损失和跟踪性能上的损失联合对第一物体识别模型进行损失优化,使得更新后的轻量级第一物体识别模型具有与第二物体识别模型相同或较相近的特征提取性能,更快的特征提取速度,并且保证第一物体识别模型提取到的特征更适用于视觉目标跟踪场景中,从而提高了视觉目标跟踪的准确性和实时性。
基于上述模型训练方法实施例的描述,本申请实施例还公开了一种模型训练装置,该模型训练装置可以执行图2和图4所示的方法。请参见图8,所述模型训练装置可运行如下单元:
获取单元801,用于获取训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在测试图像中的标注位置;
处理单元802,用于调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体 识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
所述处理单元802,还用于调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
所述处理单元802,还用于对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
更新单元803,用于基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
在一个实施例中,所述获取单元801还用于:获取第二物体识别模型;所述处理单元802还用于;对所述第二物体识别模型进行裁剪,得到第一物体识别模型。
在一个实施例中,所述更新单元803在基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型时,执行如下操作:获取所述第一物体识别模型对应的损失优化函数;基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值;按照减小所述损失优化函数的值的原则,对所述第一物体识别模型进行更新。
在一个实施例中,所述损失优化函数包括特征识别损失函数和跟踪损失函数,所述更新单元803在基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差 异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值时,执行如下操作:获取所述特征识别损失函数,并基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息,确定所述特征识别损失函数的值;获取所述跟踪损失函数,并基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值;基于所述特征识别损失函数的值和所述跟踪损失函数的值确定损失优化函数的值。
在一个实施例中,所述第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述第一测试响应是由所述第一卷积层对应的第一测试子响应、所述第二卷积层对应的第二测试子响应以及所述第三卷积层对应的第三测试子响应融合得到的;所述更新单元803在基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值时,执行如下操作:
基于所述第一卷积层对应的第一跟踪标签与对所述第一测试子响应进行跟踪处理得到的第一跟踪响应之间的差异信息,确定所述第一卷积层的跟踪损失值;基于所述第二卷积层对应的第二跟踪标签与对所述第二测试子响应进行跟踪处理得到的第二跟踪响应之间的差异信息,确定所述第二卷积层的跟踪损失值;基于所述第三卷积层对应的第三跟踪标签与对所述第三测试子响应进行跟踪处理得到的第三跟踪响应之间的差异信息,确定所述第三卷积层的跟踪损失值;将所述第一卷积层对应的跟踪损失值、所述第二卷积层对应的跟踪损失值以及所述第三卷积层对应的跟踪损失值进行融合处理,得到跟踪损失函数的值;其中,所述第一跟踪响应、所述第二跟踪响应以及所述第三跟踪响应具有不同分辨率。
在一个实施例中,所述第一物体识别模型包括多个卷积层,所述多个卷积层按照连接顺序相连接,所述第一卷积层为所述连接顺序所指示的第一个卷积层,所述第三卷积层为所述连接顺序所指示的最后一个卷积层,所述第二卷积层为除所述第一个卷积层和所述最后一个卷积层外 的任意一个卷积层。
在一个实施例中,所述获取单元801,还用于获取包括跟踪对象的参考图像,并基于所述参考图像确定用于训练的正样本和负样本,所述正样本是指包括所述跟踪对象的图像,所述负样本是指不包括所述跟踪对象的图像,所述正样本包括所述跟踪对象的正样本跟踪标签,所述负样本包括所述跟踪对象的负样本跟踪标签,所述参考图像中包括所述跟踪对象的标注信息;
所述处理单元802,还用于调用所述已更新的第一物体识别模型对所述正样本进行识别处理,得到正样本识别响应,并调用所述已更新的第一物体识别模型对所述负样本进行识别处理,得到负样本识别响应;
所述处理单元802,还用于对所述正样本识别响应进行跟踪处理,得到在所述正样本中对所述跟踪对象的正样本跟踪响应;并对所述负样本识别响应进行跟踪处理,得到所述在所述负样本中对所述跟踪对象的负样本跟踪响应;
所述更新单元803,还用于基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型。
在一个实施例中,所述更新单元803在基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型时,执行如下步骤:
获取跟踪损失优化函数;基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,确定所述跟踪损失优化函数的值;按照减小所述跟踪损失函数的值的原则,对所述已更新的第一物体识别模型进行更新。
在一个实施例中,所述获取单元801,还用于获取待处理图像;所述处理单元802,还用于根据所述参考图像中的所述跟踪对象的标注信息确定所述待处理图像中包括的预测跟踪对象;所述处理单元802,还 用于调用已更新的第一物体识别模型对所述参考图像中的所述跟踪对象进行识别处理,得到第一识别特征;所述处理单元803,还用于调用所述已更新的第一物体识别模型对所述待处理图像中的所述预测跟踪对象进行识别处理,得到第二识别特征;所述处理单元802,还用于基于所述第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征,并采用跟踪算法对所述目标特征进行跟踪处理,得到所述跟踪对象在所述待处理图像中的位置信息。
根据本申请的一个实施例,图2或图4所示的方法所涉及的各个步骤均可以是由图8所示的模型训练装置中的各个单元来执行的。例如,图2所示的步骤S201可由图8中所示的获取单元801来执行,步骤S202-S204可由图8中所示的处理单元802来执行,步骤S205可由图8所示的更新单元803来执行;又如,图4中所示的步骤S401、S402以及S406可由图8中所示的获取单元801来执行,步骤S403-S405、以及S407可由图8中处理单元802来执行,步骤S408可由图8中所示的更新单元803来执行。
根据本申请的另一个实施例,图8所示的模型训练装置中的各个单元可以分别或全部合并为一个或若干个另外的单元来构成,或者其中的某个(些)单元还可以再拆分为功能上更小的多个单元来构成,这可以实现同样的操作,而不影响本申请的实施例的技术效果的实现。上述单元是基于逻辑功能划分的,在实际应用中,一个单元的功能也可以由多个单元来实现,或者多个单元的功能由一个单元实现。在本申请的其它实施例中,基于模型训练装置也可以包括其它单元,在实际应用中,这些功能也可以由其它单元协助实现,并且可以由多个单元协作实现。
根据本申请的另一个实施例,可以通过在包括中央处理单元(CPU)、随机存取存储介质(RAM)、只读存储介质(ROM)等处理元件和存储元件的例如计算机的通用计算设备上运行能够执行如图2或图4中所示的相应方法所涉及的各步骤的计算机程序(包括程序代码),来构造如图8中所示的模型训练装置设备,以及来实现本申请实施例的模型训练 方法。所述计算机程序可以记载于例如计算机可读记录介质上,并通过计算机可读记录介质装载于上述计算设备中,并在其中运行。
本申请实施例中利用第二物体识别模型对第一物体识别模型进行训练过程中,首先分别调用第一物体识别模型分别调用第一物体识别模型和第二物体识别模型对模板图像中的所述跟踪对象的特征进行识别处理得到第一参考响应和第二参考响应,再调用第一物体识别模型和第二物体识别模型对测试图像中的所述跟踪对象的特征进行识别处理得到第一测试响应和第二测试响应;进一步地,对第一测试响应进行跟踪处理,得到跟踪对象的跟踪响应;进而,便可以根据第一参考响应与第二参考响应之间的差异信息、第一测试响应与第二测试响应之间的差异信息,确定第一物体识别模型相比于第二物体识别模型在特征提取性能上的损失;以及根据跟踪标签与跟踪响应之间的差异信息,确定第一物体识别模型在跟踪性能上的损失。基于第一物体识别模型在特征提取性能上的损失以及在跟踪性能上的损失更新第一物体识别模型,可以使得更新后的第一物体识别模型具有与第二物体识别模型相同或较相近的特征提取性能,并且提取到的特征更适用于视觉目标跟踪场景中,从而可提高视觉目标跟踪的准确性。
基于上述方法实施例以及装置实施例的描述,本申请实施例还提供一种计算设备,例如图9所示的终端。请参见图9,该终端至少包括处理器901、输入设备902、输出设备903以及计算机存储介质904。所述输入设备902中还可包括摄像组件,摄像组件可用于获取模板图像和/或测试图像,所述拍摄组件还可以用于获取参考图像和/或待处理图像,所述摄像组件可以是终端出厂时配置在终端上的组件,也可以是与终端相连接的外部组件。在本申请一实施例中,该终端还可与其他设备相连接,以接收其他设备发送的模板图像和/或测试图像,或者接受其他设备发送的参考图像和/或待处理图像。
计算机存储介质904可以存储在终端的存储器中,所述计算机存储 介质904用于存储计算机程序,所述计算机程序包括程序指令,所述处理器901用于执行所述计算机存储介质904存储的程序指令。处理器901或称CPU(Central Processing Unit,中央处理器))是终端的计算核心以及控制核心,其适于实现一条或多条指令,具体适于加载并执行一条或多条指令从而实现相应方法流程或相应功能;在一个实施例中,本申请实施例所述的处理器901可以用于执行:获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在测试图像中的标注位置;调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用所述第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
本申请实施例还提供了一种计算机存储介质(Memory),所述计算机存储介质是终端中的记忆设备,用于存放程序和数据。可以理解的是,此处的计算机存储介质既可以包括终端中的内置存储介质,当然也可以包括终端所支持的扩展存储介质。计算机存储介质提供存储空间,该存储空间存储了终端的操作***。并且,在该存储空间中还存放了适于被处理器901加载并执行的一条或多条指令,这些指令可以是一个或多个计算机程序(包括程序代码)。需要说明的是,此处的计算机存储介质可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器;在本申请一实施例中还可以是至 少一个位于远离前述处理器的计算机存储介质。
在一个实施例中,可由处理器901加载并执行计算机存储介质中存放的一条或多条指令,以实现上述有关模型训练实施例中的方法的相应步骤;具体实现中,计算机存储介质中的一条或多条指令由处理器901加载并执行如下步骤:
获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在测试图像中的标注位置;调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
在一个实施例中,计算机存储介质中的一条或多条指令由处理器901加载还执行如下步骤:获取第二物体识别模型;对所述第二物体识别模型进行裁剪,得到第一物体识别模型。
在一个实施例中,所述处理器901在基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型时,执行如下操作:
获取所述第一物体识别模型对应的损失优化函数;基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述 第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值;按照减小所述损失优化函数的值的原则,对所述第一物体识别模型进行更新。
在一个实施例中,所述损失优化函数包括特征识别损失函数和跟踪损失函数,所述处理器901在基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值时,执行如下操作:
获取所述特征识别损失函数,并基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息,确定所述特征识别损失函数的值;获取所述跟踪损失函数,并基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值;基于所述特征识别损失函数的值和所述跟踪损失函数的值确定损失优化函数的值。
在一个实施例中,所述第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述第一测试响应是由所述第一卷积层对应的第一测试子响应、所述第二卷积层对应的第二测试子响应以及所述第三卷积层对应的第三测试子响应融合得到的;所述处理器901在基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值时,执行如下操作:
基于所述第一卷积层对应的第一跟踪标签与对所述第一测试子响应进行跟踪处理得到的第一跟踪响应之间的差异信息,确定所述第一卷积层的跟踪损失值;
基于所述第二卷积层对应的第二跟踪标签与对所述第二测试子响应进行跟踪处理得到的第二跟踪响应之间的差异信息,确定所述第二卷积层的跟踪损失值;基于所述第三卷积层对应的第三跟踪标签与对所述第三测试子响应进行跟踪处理得到的第三跟踪响应之间的差异信息,确定所述第三卷积层的跟踪损失值;将所述第一卷积层对应的跟踪损失 值、所述第二卷积层对应的跟踪损失值以及所述第三卷积层对应的跟踪损失值进行融合处理,得到跟踪损失函数的值;其中,所述第一跟踪响应、所述第二跟踪响应以及所述第三跟踪响应具有不同分辨率。
在一个实施例中,所述第一物体识别模型包括多个卷积层,所述多个卷积层按照连接顺序相连接,所述第一卷积层为所述连接顺序所指示的第一个卷积层,所述第三卷积层为所述连接顺序所指示的最后一个卷积层,所述第二卷积层为除所述第一个卷积层和所述最后一个卷积层外的任意一个卷积层。
在一个实施例中,计算机存储介质中的一条或多条指令由处理器901加载还执行如下步骤:
获取包括跟踪对象的参考图像,并基于所述参考图像确定用于训练的正样本和负样本,所述正样本是指包括所述跟踪对象的图像,所述负样本是指不包括所述跟踪对象的图像,所述正样本包括所述跟踪对象的正样本跟踪标签,所述负样本包括所述跟踪对象的负样本跟踪标签,所述参考图像中包括所述跟踪对象的标注信息;调用所述已更新的第一物体识别模型对所述正样本进行识别处理,得到正样本识别响应,并调用所述已更新的第一物体识别模型对所述负样本进行识别处理,得到负样本识别响应;对所述正样本识别响应进行跟踪处理,得到在所述正样本中对所述跟踪对象的正样本跟踪响应;并对所述负样本识别响应进行跟踪处理,得到所述在所述负样本中对所述跟踪对象的负样本跟踪响应;基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型。
在一个实施例中,所述处理器901在基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型时,执行如下操作:
获取跟踪损失优化函数;基于所述正样本跟踪响应与所述正样本跟 踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,确定所述跟踪损失优化函数的值;按照减小所述跟踪损失函数的值的原则,对所述已更新的第一物体识别模型进行更新。
在一个实施例中,计算机存储介质中的一条或多条指令由处理器901加载还执行如下步骤:
获取待处理图像,并根据所述参考图像中的所述跟踪对象的标注信息确定所述待处理图像中包括的预测跟踪对象;调用已更新的第一物体识别模型对所述参考图像中的所述跟踪对象进行识别处理,得到第一识别特征;调用所述已更新的第一物体识别模型对所述待处理图像中的所述预测跟踪对象进行识别处理,得到第二识别特征;基于所述第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征,并采用跟踪算法对所述目标特征进行跟踪处理,得到所述跟踪对象在所述待处理图像中的位置信息。
以上所揭露的仅为本申请示例性实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (12)

  1. 一种模型训练方法,由计算设备执行,包括:
    获取用于训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
    调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;
    调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;
    对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
    基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
  2. 如权利要求1所述的方法,还包括:
    获取第二物体识别模型;
    对所述第二物体识别模型进行裁剪,得到第一物体识别模型。
  3. 如权利要求1所述的方法,所述基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型,包括:
    获取所述第一物体识别模型对应的损失优化函数;
    基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值;
    按照减小所述损失优化函数的值的原则,对所述第一物体识别模型进行更新。
  4. 如权利要求3所述的方法,所述损失优化函数包括特征识别损失函数和跟踪损失函数,所述基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,确定所述损失优化函数的值,包括:
    获取所述特征识别损失函数,并基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息,确定所述特征识别损失函数的值;
    获取所述跟踪损失函数,并基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值;
    基于所述特征识别损失函数的值和所述跟踪损失函数的值确定损失优化函数的值。
  5. 如权利要求4所述的方法,所述第一物体识别模型包括第一卷积层、第二卷积层和第三卷积层,所述第一测试响应是由所述第一卷积层对应的第一测试子响应、所述第二卷积层对应的第二测试子响应以及所述第三卷积层对应的第三测试子响应融合得到的;所述基于所述跟踪标签与所述跟踪响应之间的差异信息确定所述跟踪损失函数的值,包括:
    基于所述第一卷积层对应的第一跟踪标签与对所述第一测试子响应进行跟踪处理得到的第一跟踪响应之间的差异信息,确定所述第一卷积层的跟踪损失值;
    基于所述第二卷积层对应的第二跟踪标签与对所述第二测试子响应进行跟踪处理得到的第二跟踪响应之间的差异信息,确定所述第二卷积层的跟踪损失值;
    基于所述第三卷积层对应的第三跟踪标签与对所述第三测试子响应进行跟踪处理得到的第三跟踪响应之间的差异信息,确定所述第三卷积层的跟踪损失值;
    将所述第一卷积层对应的跟踪损失值、所述第二卷积层对应的跟踪损失值以及所述第三卷积层对应的跟踪损失值进行融合处理,得到跟踪损失函数的值;
    其中,所述第一跟踪响应、所述第二跟踪响应以及所述第三跟踪响应具有不同分辨率。
  6. 权利要求5所述的方法,所述第一物体识别模型包括多个卷积层,所述多个卷积层按照连接顺序相连接,所述第一卷积层为所述连接顺序所指示的第一个卷积层,所述第三卷积层为所述连接顺序所指示的最后一个卷积层,所述第二卷积层为除所述第一个卷积层和所述最后一个卷积层外的任意一个卷积层。
  7. 如权利要求1所述的方法,还包括:
    获取包括跟踪对象的参考图像,并基于所述参考图像确定用于训练的正样本和负样本,所述正样本是指包括所述跟踪对象的图像,所述负样本是指不包括所述跟踪对象的图像,所述正样本包括所述跟踪对象的正样本跟踪标签,所述负样本包括所述跟踪对象的负样本跟踪标签,所述参考图像中包括所述跟踪对象的标注信息;
    调用所述已更新的第一物体识别模型对所述正样本进行识别处理,得到正样本识别响应,并调用所述已更新的第一物体识别模型对所述负样本进行识别处理,得到负样本识别响应;
    对所述正样本识别响应进行跟踪处理,得到在所述正样本中对所述跟踪对象的正样本跟踪响应;并对所述负样本识别响应进行跟踪处理,得到所述在所述负样本中对所述跟踪对象的负样本跟踪响应;
    基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型。
  8. 如权利要求7所述的方法,所述基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,训练所述已更新的第一物体识别模型, 包括:
    获取跟踪损失优化函数;
    基于所述正样本跟踪响应与所述正样本跟踪标签之间的差异信息,以及所述负样本跟踪响应与所述负样本跟踪标签之间的差异信息,确定所述跟踪损失优化函数的值;
    按照减小所述跟踪损失优化函数的值的原则,对所述已更新的第一物体识别模型进行更新。
  9. 如权利要求7或8所述的方法,还包括:
    获取待处理图像,并根据所述参考图像中的所述跟踪对象的标注信息确定所述待处理图像中包括的预测跟踪对象;
    调用已更新的第一物体识别模型对所述参考图像中的所述跟踪对象进行识别处理,得到第一识别特征;
    调用所述已更新的第一物体识别模型对所述待处理图像中的所述预测跟踪对象进行识别处理,得到第二识别特征;
    基于所述第一识别特征和所述第二识别特征确定用于跟踪处理的目标特征,并采用跟踪算法对所述目标特征进行跟踪处理,得到所述跟踪对象在所述待处理图像中的位置信息。
  10. 一种模型训练装置,包括:
    获取单元,用于获取训练的模板图像和测试图像,所述模板图像和所述测试图像均包括跟踪对象,所述测试图像包括所述跟踪对象的跟踪标签,所述跟踪标签用于表示所述跟踪对象在所述测试图像中的标注位置;
    处理单元,用于调用第一物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第一参考响应,并调用第二物体识别模型对所述模板图像中的所述跟踪对象的特征进行识别处理,得到第二参考响应;调用所述第一物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第一测试响应,并调用所述第 二物体识别模型对所述测试图像中的所述跟踪对象的特征进行识别处理,得到第二测试响应;对所述第一测试响应进行跟踪处理,得到所述跟踪对象的跟踪响应,所述跟踪响应用于表示所述跟踪对象在所述测试图像中的跟踪位置;
    更新单元,用于基于所述第一参考响应与所述第二参考响应之间的差异信息、所述第一测试响应与所述第二测试响应之间的差异信息以及所述跟踪标签与所述跟踪响应之间的差异信息,更新所述第一物体识别模型。
  11. 一种终端,包括输入设备和输出设备,还包括:
    处理器,用于实现一条或多条指令;以及,
    计算机存储介质,所述计算机存储介质存储有一条或多条指令,所述一条或多条指令用于由所述处理器加载并执行如权利要求1-9中的任一项所述的模型训练方法。
  12. 一种计算机存储介质,所述计算机存储介质中存储有计算机程序指令,所述计算机程序指令被处理器执行时,用于执行如权利要求1-9中的任一项所述的模型训练方法。
PCT/CN2020/083523 2019-05-13 2020-04-07 模型训练方法、装置、终端及存储介质 WO2020228446A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP20805250.6A EP3971772B1 (en) 2019-05-13 2020-04-07 Model training method and apparatus, and terminal and storage medium
KR1020217025275A KR102591961B1 (ko) 2019-05-13 2020-04-07 모델 트레이닝 방법 및 장치, 및 이를 위한 단말 및 저장 매체
JP2021536356A JP7273157B2 (ja) 2019-05-13 2020-04-07 モデル訓練方法、装置、端末及びプログラム
US17/369,833 US11704817B2 (en) 2019-05-13 2021-07-07 Method, apparatus, terminal, and storage medium for training model

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910397253.XA CN110147836B (zh) 2019-05-13 2019-05-13 模型训练方法、装置、终端及存储介质
CN201910397253.X 2019-05-13

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/369,833 Continuation US11704817B2 (en) 2019-05-13 2021-07-07 Method, apparatus, terminal, and storage medium for training model

Publications (1)

Publication Number Publication Date
WO2020228446A1 true WO2020228446A1 (zh) 2020-11-19

Family

ID=67595324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/083523 WO2020228446A1 (zh) 2019-05-13 2020-04-07 模型训练方法、装置、终端及存储介质

Country Status (6)

Country Link
US (1) US11704817B2 (zh)
EP (1) EP3971772B1 (zh)
JP (1) JP7273157B2 (zh)
KR (1) KR102591961B1 (zh)
CN (1) CN110147836B (zh)
WO (1) WO2020228446A1 (zh)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967315A (zh) * 2021-03-02 2021-06-15 北京百度网讯科技有限公司 一种目标跟踪方法、装置及电子设备
CN113469977A (zh) * 2021-07-06 2021-10-01 浙江霖研精密科技有限公司 一种基于蒸馏学习机制的瑕疵检测装置、方法、存储介质
CN113838093A (zh) * 2021-09-24 2021-12-24 重庆邮电大学 基于空间正则化相关滤波器的自适应多特征融合跟踪方法
CN114463829A (zh) * 2022-04-14 2022-05-10 合肥的卢深视科技有限公司 模型训练方法、亲属关系识别方法、电子设备及存储介质
CN114693995A (zh) * 2022-04-14 2022-07-01 北京百度网讯科技有限公司 应用于图像处理的模型训练方法、图像处理方法和设备
CN115455306A (zh) * 2022-11-11 2022-12-09 腾讯科技(深圳)有限公司 推送模型训练、信息推送方法、装置和存储介质

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20180027887A (ko) * 2016-09-07 2018-03-15 삼성전자주식회사 뉴럴 네트워크에 기초한 인식 장치 및 뉴럴 네트워크의 트레이닝 방법
CN110147836B (zh) * 2019-05-13 2021-07-02 腾讯科技(深圳)有限公司 模型训练方法、装置、终端及存储介质
KR20210061839A (ko) * 2019-11-20 2021-05-28 삼성전자주식회사 전자 장치 및 그 제어 방법
CN111401192B (zh) * 2020-03-10 2023-07-18 深圳市腾讯计算机***有限公司 基于人工智能的模型训练方法和相关装置
JP7297705B2 (ja) * 2020-03-18 2023-06-26 株式会社東芝 処理装置、処理方法、学習装置およびプログラム
US11599742B2 (en) * 2020-04-22 2023-03-07 Dell Products L.P. Dynamic image recognition and training using data center resources and data
CN111738436B (zh) * 2020-06-28 2023-07-18 电子科技大学中山学院 一种模型蒸馏方法、装置、电子设备及存储介质
CN111767711B (zh) * 2020-09-02 2020-12-08 之江实验室 基于知识蒸馏的预训练语言模型的压缩方法及平台
CN113515999A (zh) * 2021-01-13 2021-10-19 腾讯科技(深圳)有限公司 图像处理模型的训练方法、装置、设备及可读存储介质
CN114245206B (zh) * 2022-02-23 2022-07-15 阿里巴巴达摩院(杭州)科技有限公司 视频处理方法及装置
CN114359563B (zh) * 2022-03-21 2022-06-28 深圳思谋信息科技有限公司 模型训练方法、装置、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103149939A (zh) * 2013-02-26 2013-06-12 北京航空航天大学 一种基于视觉的无人机动态目标跟踪与定位方法
US20190021426A1 (en) * 2017-07-20 2019-01-24 Siege Sports, LLC Highly Custom and Scalable Design System and Method for Articles of Manufacture
CN109344742A (zh) * 2018-09-14 2019-02-15 腾讯科技(深圳)有限公司 特征点定位方法、装置、存储介质和计算机设备
CN110147836A (zh) * 2019-05-13 2019-08-20 腾讯科技(深圳)有限公司 模型训练方法、装置、终端及存储介质

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101934325B1 (ko) * 2014-12-10 2019-01-03 삼성에스디에스 주식회사 객체 분류 방법 및 그 장치
EP3336774B1 (en) 2016-12-13 2020-11-25 Axis AB Method, computer program product and device for training a neural network
KR102481885B1 (ko) * 2017-09-08 2022-12-28 삼성전자주식회사 클래스 인식을 위한 뉴럴 네트워크 학습 방법 및 디바이스
CN109215057B (zh) * 2018-07-31 2021-08-20 中国科学院信息工程研究所 一种高性能视觉跟踪方法及装置
US11010888B2 (en) * 2018-10-29 2021-05-18 International Business Machines Corporation Precision defect detection based on image difference with respect to templates
CN109671102B (zh) * 2018-12-03 2021-02-05 华中科技大学 一种基于深度特征融合卷积神经网络的综合式目标跟踪方法
CN109766954B (zh) 2019-01-31 2020-12-04 北京市商汤科技开发有限公司 一种目标对象处理方法、装置、电子设备及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103149939A (zh) * 2013-02-26 2013-06-12 北京航空航天大学 一种基于视觉的无人机动态目标跟踪与定位方法
US20190021426A1 (en) * 2017-07-20 2019-01-24 Siege Sports, LLC Highly Custom and Scalable Design System and Method for Articles of Manufacture
CN109344742A (zh) * 2018-09-14 2019-02-15 腾讯科技(深圳)有限公司 特征点定位方法、装置、存储介质和计算机设备
CN110147836A (zh) * 2019-05-13 2019-08-20 腾讯科技(深圳)有限公司 模型训练方法、装置、终端及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NATIONAL INTELLECTUAL PROPERTY ADMINISTRATION, 13 May 2019 (2019-05-13)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967315A (zh) * 2021-03-02 2021-06-15 北京百度网讯科技有限公司 一种目标跟踪方法、装置及电子设备
CN112967315B (zh) * 2021-03-02 2022-08-02 北京百度网讯科技有限公司 一种目标跟踪方法、装置及电子设备
CN113469977A (zh) * 2021-07-06 2021-10-01 浙江霖研精密科技有限公司 一种基于蒸馏学习机制的瑕疵检测装置、方法、存储介质
CN113469977B (zh) * 2021-07-06 2024-01-12 浙江霖研精密科技有限公司 一种基于蒸馏学习机制的瑕疵检测装置、方法、存储介质
CN113838093A (zh) * 2021-09-24 2021-12-24 重庆邮电大学 基于空间正则化相关滤波器的自适应多特征融合跟踪方法
CN113838093B (zh) * 2021-09-24 2024-03-19 重庆邮电大学 基于空间正则化相关滤波器的自适应多特征融合跟踪方法
CN114463829A (zh) * 2022-04-14 2022-05-10 合肥的卢深视科技有限公司 模型训练方法、亲属关系识别方法、电子设备及存储介质
CN114693995A (zh) * 2022-04-14 2022-07-01 北京百度网讯科技有限公司 应用于图像处理的模型训练方法、图像处理方法和设备
CN114463829B (zh) * 2022-04-14 2022-08-12 合肥的卢深视科技有限公司 模型训练方法、亲属关系识别方法、电子设备及存储介质
CN115455306A (zh) * 2022-11-11 2022-12-09 腾讯科技(深圳)有限公司 推送模型训练、信息推送方法、装置和存储介质
CN115455306B (zh) * 2022-11-11 2023-02-07 腾讯科技(深圳)有限公司 推送模型训练、信息推送方法、装置和存储介质

Also Published As

Publication number Publication date
CN110147836A (zh) 2019-08-20
JP7273157B2 (ja) 2023-05-12
KR102591961B1 (ko) 2023-10-19
CN110147836B (zh) 2021-07-02
JP2022532460A (ja) 2022-07-15
EP3971772A4 (en) 2022-08-10
US20210335002A1 (en) 2021-10-28
KR20210110713A (ko) 2021-09-08
EP3971772A1 (en) 2022-03-23
EP3971772B1 (en) 2023-08-09
US11704817B2 (en) 2023-07-18

Similar Documents

Publication Publication Date Title
WO2020228446A1 (zh) 模型训练方法、装置、终端及存储介质
US20220092351A1 (en) Image classification method, neural network training method, and apparatus
CN112446270B (zh) 行人再识别网络的训练方法、行人再识别方法和装置
WO2019100724A1 (zh) 训练多标签分类模型的方法和装置
CN109960742B (zh) 局部信息的搜索方法及装置
EP3968179A1 (en) Place recognition method and apparatus, model training method and apparatus for place recognition, and electronic device
WO2021022521A1 (zh) 数据处理的方法、训练神经网络模型的方法及设备
CN109902548B (zh) 一种对象属性识别方法、装置、计算设备及***
US20220148291A1 (en) Image classification method and apparatus, and image classification model training method and apparatus
CN111310731A (zh) 基于人工智能的视频推荐方法、装置、设备及存储介质
CN110222718B (zh) 图像处理的方法及装置
Xia et al. Loop closure detection for visual SLAM using PCANet features
CN111914997B (zh) 训练神经网络的方法、图像处理方法及装置
WO2020098257A1 (zh) 一种图像分类方法、装置及计算机可读存储介质
US20220157041A1 (en) Image classification method and apparatus
CN111695673B (zh) 训练神经网络预测器的方法、图像处理方法及装置
US20220157046A1 (en) Image Classification Method And Apparatus
CN111179270A (zh) 基于注意力机制的图像共分割方法和装置
CN112529149A (zh) 一种数据处理方法及相关装置
CN113065575A (zh) 一种图像处理方法及相关装置
CN111242114A (zh) 文字识别方法及装置
Ocegueda-Hernandez et al. A lightweight convolutional neural network for pose estimation of a planar model
CN114155388A (zh) 一种图像识别方法、装置、计算机设备和存储介质
CN116363656A (zh) 包含多行文本的图像识别方法、装置及计算机设备
CN113822871A (zh) 基于动态检测头的目标检测方法、装置、存储介质及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20805250

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021536356

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 20217025275

Country of ref document: KR

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020805250

Country of ref document: EP

Effective date: 20211213