CN115731530A

CN115731530A - Model training method and device

Info

Publication number: CN115731530A
Application number: CN202110976217.6A
Authority: CN
Inventors: 陈铠; 洪蓝青; 徐航; 李震国
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2023-03-03

Abstract

The application discloses a model training method, which comprises the following steps: sampling a sample image to obtain a first image block and a second image block, respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the areas of the first image block and the second image block on the sample image is larger than a threshold value, determining loss according to the difference between an obtained first feature map and an obtained second feature map, and updating the feature extraction network based on the loss. According to the method, the cross-over ratio is used as a mode for measuring the relative distance between the two image blocks, the semantic consistency in the local area is guaranteed by requiring the cross-over ratio of the image blocks to be larger than a given threshold value so as to restore the basic assumption of contrast learning, the image blocks are limited in one local area, therefore, the global inconsistency is converted into the local consistency, and the data processing accuracy of the updated feature extraction network is further improved.

Description

Model training method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a model training method and a model training device.

Background

Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, military, etc., and is a study on how to use cameras/camcorders and computers to acquire data and information of a photographed object, which is required by us. In a descriptive sense, a computer is provided with eyes (a camera or a video camera) and a brain (an algorithm) to identify, track, measure and the like an object instead of human eyes, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science of how to make artificial systems "perceive" from images or multidimensional data. Generally, computer vision is to use various imaging systems to obtain input information instead of visual organs, and then the computer is used to process and interpret the input information instead of the brain. The ultimate research goal of computer vision is to make a computer have the ability to adapt to the environment autonomously by visually observing and understanding the world like a human.

The sensing network may be a neural network model that processes and analyzes an image and obtains a processing result, and the sensing network has more and more functions at present, for example, the functions may be image classification, 2D detection, semantic segmentation (semantic segmentation), key point detection, linear object detection (such as lane line or stop line detection in an automatic driving technology), travelable region detection, and the like. In addition, the visual perception system has the characteristics of low cost, non-contact property, small volume and large information amount. With the continuous improvement of the precision of the visual perception algorithm, the visual perception algorithm becomes a key technology of many artificial intelligence systems at present and is widely applied, such as: in Advanced Driving Assistance Systems (ADAS) and Automatic Driving Systems (ADS), dynamic obstacles (people or vehicles) and static objects (traffic lights, traffic signs or traffic cones) on a road surface are identified, and a slimming effect and the like are realized by identifying a human body Mask and key points in a photographing and beautifying function of terminal vision.

In recent years, self-supervised characterization learning based on contrast learning has made a major technological breakthrough, and the basic assumption is that image blocks from the same image are considered to describe the same semantic information, their features should be as close as possible, while image blocks from different images describe different semantic information, and therefore their features should be as different as possible. This approach, while not requiring manual labeling, still requires the additional assumption of a single instance, i.e., requiring only one instance in each image, and this instance occupies the central major portion of the image. Thus, it can be considered that the same image is cut into image blocks as positive samples, and different images are cut into image blocks as negative samples. However, in an automatic driving scene, a street view image often contains multiple instances of pedestrians, vehicles and the like, has strong global inconsistency and does not conform to the single-instance assumption, so that the existing method is often limited to a data set with the single-instance assumption, and is difficult to fully utilize a more realistic multi-instance automatic driving data set.

Disclosure of Invention

In a first aspect, the present application provides a model training method, including:

acquiring a sample image; the sample image may be a multi-instance image (e.g., a street view image) in the above-described auto-driving scenario, which in one possible implementation may include a plurality of objects including at least one of a person, a vehicle, a traffic sign, a lane line, a plant. Illustratively, the object may include, but is not limited to, at least one of the following: dynamic obstacles (pedestrians), riders (cycles), tricycles (tricycles), cars (cars), trucks (trucks), buses (Bus)), static obstacles (traffic cones (trafficcon), traffic sticks (TrafficStick), fire hydrants (firehydrants), motorcycles (motocycles), bicycles (bicycles)), traffic signs ((TrafficSign), guide signs (GuideSign), billboards (billboards), red traffic lights (TrafficLight _ Red)/Yellow traffic lights (TrafficLight _ Yellow)/Green traffic lights (TrafficLight _ Green)/Black traffic lights (TrafficLight _ Black), road signs (roadn)).

Sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image;

after the sample image is acquired, the sample image may be sampled to obtain a plurality of partial images (e.g., a first image block and a second image block), and the first image block and the second image block may be small images obtained by sampling the sample image (where small refers to that the image areas of the first image block and the second image block are smaller relative to the size of the sample image).

The sampling may be understood as clipping, in one sampling process, a rectangular area may be randomly determined on a sample image, and an image in the determined rectangular area may be used as a first image block, and in another sampling process, a rectangular area may be randomly determined on the sample image, and an image in the determined rectangular area may be used as a second image block.

The above-mentioned sampling may be random sampling, so-called random sampling, which is to be understood that the position of the sampling and/or the size of the sampling is randomly determined.

The first image block and the second image block may be rectangular images.

Respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-to-parallel ratio of the first image block and the second image block between the areas of the sample image is larger than a threshold value, so as to obtain a first feature map and a second feature map;

a common region (overlap region) may exist between a region where the first image block is located on the sample image and a region where the second image block is located on the sample image, and a ratio between an area size of the common region and all regions occupied by the first image block and the second image block on the sample image is a cross-over (IoU) between the regions where the first image block and the second image block are located on the sample image;

in one possible implementation, the threshold may be a value greater than or equal to 0.4, for example the threshold may be 0.4, 0.45, 0.5, 0.55, and so on. It should be understood that the cross-over ratio threshold obviously cannot be determined too low, but at the same time is not as high as possible, since it is not desirable that the image blocks are not correlated but are not identical, so the selection of the cross-over ratio threshold is also actually a balance between data noise and data complexity in controlling multi-instance unsupervised learning.

Determining loss according to the difference between the first feature map and the second feature map, and updating the feature extraction network based on the loss to obtain an updated feature extraction network.

According to the method, the cross-over ratio is used as a mode for measuring the relative distance between the two image blocks, the semantic consistency in the local area is guaranteed by requiring the cross-over ratio of the image blocks to be larger than a given threshold value so as to restore the basic assumption of contrast learning, the image blocks are limited in one local area, therefore, the global inconsistency is converted into the local consistency, and the data processing accuracy of the updated feature extraction network is further improved. Meanwhile, the balance between data noise and data complexity under different scenes can be effectively controlled by selecting the threshold value.

In one possible implementation, the threshold is a value greater than or equal to 0.4.

In one possible implementation, before determining the loss according to the difference between the first feature map and the second feature map, the method further includes: aligning the first image block and the second image block to obtain an aligned first image block and an aligned second image block; said determining a loss from a difference between said first feature map and said second feature map comprises: determining a loss based on a difference between the aligned first feature map and the aligned second feature map.

In one possible implementation, the sample image includes a target area, the target area is an overlapping area where the first image block and the second image block are located on the sample image, and aligning the first image block and the second image block includes: according to the target area, determining a first sub-feature map corresponding to the target area in the first feature map and a second sub-feature map corresponding to the target area in the second feature map; performing upsampling on the first sub-feature map to obtain the aligned first image block; and upsampling the second sub-feature map to obtain the aligned second image block, wherein the size of the aligned first image block is consistent with that of the aligned second image block.

In order to distinguish multi-instance features, a global pooling layer behind a backbone network needs to be abandoned to maintain two-dimensional structure and position information of a feature map, but an additional problem of feature misalignment is brought, namely that the two-dimensional feature map does not have the one-to-one correspondence of the same relative positions any more. In the embodiments of the present application, two different ways are provided for feature alignment: the overlapping parts of the image blocks are respectively used as the interested areas of the two image blocks by aligning the interested areas (regions of interest), for example, the subsequent calculation can be carried out by extracting the features of the overlapping parts only by using the RoI Align, and the method is intuitive but does not fully utilize the information of the non-overlapping parts.

In one possible implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, and the M first pixel points correspond to the M second pixel points one to one, and the method further includes: obtaining a third feature map according to the M first pixel points and the M second pixel points, wherein the size of the third feature map is consistent with that of the second feature map, the third feature map comprises M third feature points, and each third feature point is obtained based on the pixel position difference between the first pixel point and the second pixel point with the corresponding relationship; and fusing the third feature map and the first feature map to obtain the aligned first image block, wherein the second feature map is used as the aligned second image block.

In one possible implementation, the fusing the third feature map with the first feature map includes:

and splicing the third characteristic map and the first characteristic map in the depth direction.

And the displacement alignment takes each pair of pixel points located at the same relative position, calculates the coordinate displacement of the pixel points in the original image, connects the pixel points with the feature map in series (splices in the depth direction), and provides the calculated coordinate displacement and the feature map as extra side information to a prediction network for implicit feature alignment to help subsequent feature prediction, so that the feature information of a non-coincident region is fully utilized, and the subsequent similarity measurement is more flexible.

In one possible implementation, the determining a loss from a difference between the first feature map and the second feature map includes: processing the M first feature points of the first feature map through a target prediction network to obtain a predicted value of each first feature point; clustering M second feature points of the first feature map based on a target clustering algorithm to update feature values of the M second feature points, wherein the feature value of each updated second feature point is a clustering center feature value of a clustering category where the updated second feature point is located; and determining loss according to the predicted value of each first characteristic point and the difference between the characteristic values of each updated second characteristic point.

According to the embodiment of the application, the intra-image clustering is utilized, so that on one hand, the network has the capacity of distinguishing different example characteristics; on the other hand, by considering the overall information of the two-dimensional characteristic diagram, the regression target provided by the target branch is more robust, and the online branch introduces a global view angle by deploying a self-attention module to obtain a more accurate prediction result.

In one possible implementation, the method further comprises:

and updating the target prediction network according to the loss.

In one possible implementation, the sample image includes a plurality of objects including at least one of a person, a vehicle, a traffic sign, a lane line, a plant.

In one possible implementation, the method further comprises:

acquiring a target network and an image to be processed, wherein the target network comprises the updated feature extraction network and a downstream task network; and processing the image to be processed through the target network to obtain a processing result.

In a second aspect, the present application provides a model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring a sample image;

the sampling module is used for sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image;

the characteristic extraction module is used for respectively extracting the characteristics of the first image block and the second image block through a characteristic extraction network based on the fact that the cross-over ratio of the first image block and the second image block between the areas of the sample image is larger than a threshold value so as to obtain a first characteristic diagram and a second characteristic diagram;

and the model updating module is used for determining loss according to the difference between the first feature map and the second feature map and updating the feature extraction network based on the loss so as to obtain an updated feature extraction network.

In one possible implementation, the apparatus further comprises:

an alignment module, configured to align the first image block and the second image block before determining a loss according to a difference between the first feature map and the second feature map, so as to obtain an aligned first image block and an aligned second image block;

the model update module is specifically configured to:

determining a loss based on a difference between the aligned first feature map and the aligned second feature map.

In a possible implementation, the sample image includes a target area, the target area is an overlapping area where the first image block and the second image block are located on the sample image, and the alignment module is specifically configured to:

according to the target area, determining a first sub-feature map corresponding to the target area in the first feature map and a second sub-feature map corresponding to the target area in the second feature map;

performing upsampling on the first sub-feature map to obtain the aligned first image block;

and upsampling the second sub-feature map to obtain the aligned second image block, wherein the size of the aligned first image block is consistent with that of the aligned second image block.

In a possible implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one to one, and the alignment module is specifically configured to:

obtaining a third feature map according to the M first pixel points and the M second pixel points, wherein the third feature map and the second feature map have the same size, the third feature map comprises M third feature points, and each third feature point is obtained based on a pixel position difference between the first pixel point and the second pixel point having a corresponding relationship;

and fusing the third feature map and the first feature map to obtain the aligned first image block, wherein the second feature map is used as the aligned second image block.

In a possible implementation, the alignment module is specifically configured to:

and splicing the third feature map and the first feature map in the depth direction.

In a possible implementation, the model updating module is specifically configured to:

processing the M first feature points of the first feature map through a target prediction network to obtain a predicted value of each first feature point;

clustering M second feature points of the first feature map based on a target clustering algorithm to update feature values of the M second feature points, wherein the feature value of each updated second feature point is a clustering center feature value of a clustering category where the updated second feature point is located;

and determining loss according to the predicted value of each first characteristic point and the difference between the characteristic values of each updated second characteristic point.

In one possible implementation, the model updating module is further configured to:

and updating the target prediction network according to the loss.

In one possible implementation, the apparatus further comprises:

the data processing module is used for acquiring a target network and an image to be processed, wherein the target network comprises the updated feature extraction network and a downstream task network;

and processing the image to be processed through the target network to obtain a processing result.

In a third aspect, an embodiment of the present application provides a model training apparatus, which may include a memory, a processor, and a bus system, where the memory is used to store a program, and the processor is used to execute the program in the memory to perform any one of the methods described in the first aspect and the second aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which a computer program is stored, and when the computer program runs on a computer, the computer is caused to execute the method according to any one of the first aspect and the second aspect.

In a fifth aspect, embodiments of the present application provide a computer program, which when run on a computer, causes the computer to perform the first aspect and any optional method thereof.

In a sixth aspect, the present application provides a chip system, which includes a processor, configured to support an execution device or a training device to implement the functions recited in the above aspects, for example, to transmit or process data recited in the above methods; or, information. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the execution device or the training device. The chip system may be formed by a chip, or may include a chip and other discrete devices.

The embodiment of the application provides a model training method, which comprises the following steps: acquiring a sample image; sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image; respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the first image block to the second image block on the sample image is larger than a threshold value, so as to obtain a first feature map and a second feature map; determining a loss according to a difference between the first feature map and the second feature map, and updating the feature extraction network based on the loss to obtain an updated feature extraction network. The cross-over ratio is used as a mode for measuring the relative distance between the two image blocks, semantic consistency in a local area is guaranteed by requiring the cross-over ratio of the image blocks to be larger than a given threshold value so as to restore a basic hypothesis of contrast learning, the image blocks are limited in the local area, therefore, global inconsistency is converted into local consistency, and the data processing accuracy of the updated feature extraction network is further improved. Meanwhile, the balance between data noise and data complexity under different scenes can be effectively controlled by selecting the threshold value.

Drawings

FIG. 1 is a schematic structural diagram of an artificial intelligence body framework;

FIG. 2 is an application scenario of an embodiment of the present application;

FIG. 3 is a schematic diagram of a convolutional neural network employed in an embodiment of the present application;

FIG. 4 is a schematic diagram of a convolutional neural network employed in an embodiment of the present application;

FIG. 5 is a diagram illustrating a system architecture according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a model training method provided in an embodiment of the present application;

FIG. 7 is a schematic diagram of a cross-ratio calculation provided in the embodiment of the present application;

fig. 8 is a schematic diagram of sampling an image block according to an embodiment of the present application;

fig. 9 is a schematic diagram of sampling an image block according to an embodiment of the present application;

fig. 10 is a schematic diagram of sampling an image block according to an embodiment of the present application;

fig. 11 is a schematic diagram of sampling an image block according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a backbone network;

fig. 13 is a schematic diagram illustrating alignment of image blocks according to an embodiment of the present application;

fig. 14 is a schematic diagram illustrating alignment of image blocks according to an embodiment of the present application;

FIG. 15 is a clustering illustration in an embodiment of the present application;

FIG. 16 is a similarity calculation schematic according to an embodiment of the present application;

FIG. 17 is a schematic diagram of a model training method provided in an embodiment of the present application;

FIG. 18 is a schematic diagram of a model training method provided in an embodiment of the present application;

FIG. 19 is a schematic of a downstream task network architecture;

FIG. 20 is a schematic view of a head;

FIG. 21 is a schematic of a clustering result;

FIG. 22 is a schematic diagram of a model training apparatus according to an embodiment of the present application;

fig. 23 is a schematic structural diagram of an execution device according to an embodiment of the present application;

FIG. 24 is a schematic structural diagram of a training apparatus provided in an embodiment of the present application;

fig. 25 is a schematic structural diagram of a chip according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings. The terminology used in the description of the embodiments of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Embodiments of the present application are described below with reference to the accompanying drawings. As can be known to those skilled in the art, with the development of technology and the emergence of new scenarios, the technical solution provided in the embodiments of the present application is also applicable to similar technical problems.

The terms "first," "second," and the like in the description and in the claims of the present application and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances and are merely descriptive of the manner in which objects of the same nature are distinguished in the embodiments of the application. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of elements is not necessarily limited to those elements, but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The general workflow of the artificial intelligence system will be described first, and please refer to fig. 1, in which fig. 1 shows a schematic structural diagram of an artificial intelligence framework, which is explained below from two dimensions of "intelligent information chain" (horizontal axis) and "IT value chain" (vertical axis). Where "intelligent information chain" reflects a list of processes processed from the acquisition of data. For example, the general processes of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision making and intelligent execution and output can be realized. In this process, the data undergoes a "data-information-knowledge-wisdom" refinement process. The 'IT value chain' reflects the value of the artificial intelligence to the information technology industry from the bottom infrastructure of the human intelligence, information (realization of providing and processing technology) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for the artificial intelligent system, realizes communication with the outside world, and realizes support through a foundation platform. Communicating with the outside through a sensor; the computing power is provided by intelligent chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA and the like); the basic platform comprises distributed computing framework, network and other related platform guarantees and supports, and can comprise cloud storage and computing, interconnection and intercommunication networks and the like. For example, sensors and external communications acquire data that is provided to smart chips in a distributed computing system provided by the underlying platform for computation.

(2) Data of

Data at the upper level of the infrastructure is used to represent the data source for the field of artificial intelligence. The data relates to graphs, images, voice and texts, and also relates to the data of the Internet of things of traditional equipment, including service data of the existing system and sensing data such as force, displacement, liquid level, temperature, humidity and the like.

(3) Data processing

Data processing typically includes data training, machine learning, deep learning, searching, reasoning, decision making, and the like.

The machine learning and the deep learning can be used for performing symbolized and formalized intelligent information modeling, extraction, preprocessing, training and the like on data.

Inference means a process of simulating an intelligent human inference mode in a computer or an intelligent system, using formalized information to think about and solve a problem by a machine according to an inference control strategy, and a typical function is searching and matching.

The decision-making refers to a process of making a decision after reasoning intelligent information, and generally provides functions of classification, sequencing, prediction and the like.

(4) General capabilities

After the above-mentioned data processing, further based on the result of the data processing, some general capabilities may be formed, such as algorithms or a general system, e.g. translation, analysis of text, computer vision processing, speech recognition, recognition of images, etc.

(5) Intelligent product and industrial application

The intelligent product and industry application refers to the product and application of an artificial intelligence system in various fields, and is the encapsulation of an artificial intelligence integral solution, the intelligent information decision is commercialized, and the landing application is realized, and the application field mainly comprises: intelligent terminal, intelligent transportation, intelligent medical treatment, autopilot, wisdom city etc..

The following respectively makes simple introduction to the application scenarios such as ADAS/ADS visual perception system, mobile phone beauty, image classification, commodity classification, etc.

Application scenario 1: ADAS/ADS visual perception system

As shown in fig. 2, in ADAS and ADS, multiple types of 2D target detection need to be performed in real time, including: dynamic obstacles (pedestrians), riders (cycles), tricycles (tricycles), cars (cars), trucks (trucks), buses (Bus)), static obstacles (traffic cones (trafficcon), traffic sticks (TrafficStick), fire hydrants (firehydrants), motorcycles (motocycles), bicycles (bicycles)), traffic signs ((TrafficSign), guide signs (GuideSign), billboards (billboards), red traffic lights (TrafficLight _ Red)/Yellow traffic lights (TrafficLight _ Yellow)/Green traffic lights (TrafficLight _ Green)/Black traffic lights (TrafficLight _ Black), road signs (roadn)). In addition, in order to accurately acquire the area of the dynamic obstacle occupied in the 3-dimensional space, it is necessary to perform 3D estimation on the dynamic obstacle and output a 3D frame. In order to fuse with data of a laser radar, a Mask of a dynamic obstacle needs to be acquired, so that laser point clouds hitting the dynamic obstacle are screened out; in order to accurately park a parking space, 4 key points of the parking space need to be detected simultaneously; in order to perform the pattern positioning, it is necessary to detect key points of a static object. By using the model (for example, the target network) trained by the technical scheme provided by the embodiment of the application, all or part of the functions can be completed in the target network.

Application scenario 2: mobile phone beautifying function

In a mobile phone, a Mask and key points of a human body are detected by a model (e.g., a target network) trained by the technical scheme provided by the embodiment of the application, and corresponding parts of the human body can be enlarged and reduced, such as waist-up and hip-up operations, so that a beautifying image is output.

Application scenario 3: image classification scene:

after the images to be classified are obtained, the model (such as a target network) trained by the technical scheme provided by the embodiment of the application can be used for determining the classes of the objects in the images to be classified, and then the images to be classified can be classified according to the classes of the objects in the images to be classified. For photographers, many photographs, including animals, people, and plants, are taken every day. The method can quickly classify the photos according to the content in the photos, and can be divided into photos containing animals, photos containing people and photos containing plants.

Application scenario 4: and (4) commodity classification:

after the images containing the commodities are obtained, the types of the commodities in the images of the commodities can be determined by using a model (such as a target network) trained by the technical scheme provided by the embodiment of the application, and then the commodities are classified according to the types of the commodities. For a wide variety of goods in superstores or supermarkets.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the convenience of understanding, the related terms and related concepts such as neural networks related to the embodiments of the present application will be described below.

(1) Neural network

The neural network may be composed of neural units, and the neural units may refer to operation units with xs (i.e. input data) and intercept 1 as inputs, and the output of the operation units may be:

wherein s =1, 2, \8230, n is a natural number greater than 1, ws is the weight of xs, and b is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by a plurality of the above-mentioned single neural units being joined together, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) And object detection, namely, by using image processing and machine learning, computer graphics and other related methods, the object detection can determine the category of an image object and determine a detection frame for positioning the object.

(3) A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of convolutional layers and sub-sampling layers, which can be regarded as a filter. The convolutional layer is a neuron layer for performing convolution processing on an input signal in a convolutional neural network. In convolutional layers of convolutional neural networks, one neuron may be connected to only a portion of the neighbor neurons. In a convolutional layer, there are usually several characteristic planes, and each characteristic plane may be composed of several neural units arranged in a rectangular shape. The neural units of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights may be understood as the way features are extracted is location independent. The convolution kernel can be formalized as a matrix of random size, and can be learned to obtain reasonable weights during the training process of the convolutional neural network. In addition, sharing weights brings the direct benefit of reducing connections between layers of the convolutional neural network, while reducing the risk of overfitting.

CNN is a very common neural network, and the structure of CNN will be described in detail below with reference to fig. 3. As described in the introduction of the basic concept, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, and the deep learning architecture refers to performing multiple levels of learning at different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to images input thereto.

As shown in fig. 3, the Convolutional Neural Network (CNN) 200 may include an input layer 210, a convolutional/pooled layer 220 (where the pooled layer is optional), and a fully connected layer 230.

Convolutional layer/pooling layer 220:

a convolutional layer:

convolutional/pooling layer 220 as shown in fig. 3 may comprise layers as in examples 221-226, for example: in one implementation, 221 is a convolutional layer, 222 is a pooling layer, 223 is a convolutional layer, 224 is a pooling layer, 225 is a convolutional layer, 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 is a pooling layer, 224, 225 are convolutional layers, and 226 is a pooling layer. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.

The internal operation of a convolutional layer will be described below by taking convolutional layer 221 as an example.

Convolutional layer 221 may include a number of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from the input image matrix, and the convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on the image, the weight matrix is usually processed pixel by pixel (or two pixels by two pixels) \8230; \8230, depending on the value of the step size stride) in the horizontal direction on the input image, thereby completing the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same size (row by column), i.e. a plurality of matrices of the same type, are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, which dimension is understood herein to be determined by the "plurality" described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix to extract image edge information, another weight matrix to extract a particular color of the image, yet another weight matrix to blur unwanted noise in the image, etc. The plurality of weight matrices have the same size (row × column), the feature maps extracted by the plurality of weight matrices having the same size also have the same size, and the extracted feature maps having the same size are combined to form the output of the convolution operation.

The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can be used to extract information from the input image, so that the convolutional neural network 200 can make correct prediction.

When convolutional neural network 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 200 increases, the more convolutional layers (e.g., 226) that go further forward extract more and more complex features, such as features with high levels of semantics, the more semantic features are suitable for the problem to be solved.

A pooling layer:

since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, either one layer of convolutional layers followed by one pooling layer or multiple layers of convolutional layers followed by one or more pooling layers, as exemplified by 220 in FIG. 3. The only purpose of the pooling layer in the image processing process is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller sized images. The average pooling operator may calculate pixel values in the image over a certain range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as a result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the image processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.

Fully connected layer 230:

after processing by convolutional layer/pooling layer 220, convolutional neural network 200 is not sufficient to output the required output information. Because, as previously described, convolutional layer/pooling layer 220 only extracts features and reduces the parameters associated with the input image. However, to generate the final output information (required class information or other relevant information), the convolutional neural network 200 needs to generate one or a set of the required number of classes of output using the fully-connected layer 230. Therefore, a plurality of hidden layers (231, 232 to 23n shown in FIG. 3) may be included in the fully-connected layer 230, and parameters included in the hidden layers may be obtained by pre-training according to training data related to a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on, \8230

After the hidden layers in the fully-connected layer 230, i.e., the last layer of the whole convolutional neural network 200 is the output layer 240, the output layer 240 has a loss function similar to the classification cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e., the propagation from the direction 210 to 240 in fig. 3 is the forward propagation) of the whole convolutional neural network 200 is completed, the backward propagation (i.e., the propagation from the direction 240 to 210 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 200, and the error between the result output by the convolutional neural network 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network 200 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, only includes a part of the network structure shown in fig. 3, for example, the convolutional neural network employed in the embodiment of the present application may only include the input layer 210, the convolutional layer/pooling layer 220 and the output layer 240.

It should be noted that the convolutional neural network 100 shown in fig. 3 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 4, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the fully-connected layer 230 for processing.

(4) Deep neural network

Deep Neural Networks (DNN), also called multi-layer Neural networks, can be understood as Neural networks with a large number of hidden layers, here "Many "have no particular metric. From the DNN, which is divided by the positions of different layers, the neural networks inside the DNN can be divided into three categories: input layer, hidden layer, output layer. Generally, the first layer is an input layer, the last layer is an output layer, and the middle layers are hidden layers. The layers are all connected, that is, any neuron at the ith layer is necessarily connected with any neuron at the (i + 1) th layer. Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is a function of the input vector or vectors,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: suppose that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

Superscript 3 represents the coefficient WThe number of layers, and the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input. The summary is that: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.

(5) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be slightly lower, and the adjustment is carried out continuously until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the greater the difference, the training of the deep neural network becomes a process of reducing the loss as much as possible.

(6) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial super-resolution model in the training process, so that the reconstruction error loss of the super-resolution model is smaller and smaller. Specifically, error loss occurs when an input signal is transmitted in a forward direction until the input signal is output, and parameters in an initial super-resolution model are updated by reversely propagating error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion with error loss as a dominant factor, and aims to obtain parameters, such as a weight matrix, of the optimal super-resolution model.

(7) Self-supervised learning: the method is characterized in that certain attributes (such as a rotation angle and an image block distribution sequence) of data are used as 'labels' of the data, unsupervised pre-training is carried out in a supervised mode, and finally, a backbone network parameter is taken as initialization of various downstream task network parameters. Here, upstream refers to the pre-training process, while downstream refers to various actual visual problems. Among them, contrast learning (coherent learning) is a recent research focus, which constructs a contrast task by mining the consistency of data itself, and considers that different image blocks from the same image should have similar features, and the features learned in this way indicate that even a migration effect beyond supervised pre-training is obtained.

(7) Intra-domain pre-training (domain-specific pre-training): the pointer is pre-trained for a downstream task in a specific field by using an upstream data set with the same data field as the downstream task, so that the data difference between upstream and downstream is reduced to reduce the optimization difficulty in downstream fine tuning. In the present invention, the datasets used upstream and downstream are multi-instance autopilot datasets.

(9) Backbone network (backbone): the method is a basic network structure for extracting features of data and is also an object for pre-training and main learning. On the basis of a backbone network, different proprietary network structures can be deployed according to different tasks; backbone networks are commonly migratable, but proprietary networks are often not migratable. For images, a commonly used backbone network is a convolutional neural network, and the final feature representation is a two-dimensional feature map (2D feature map). An additional global pooling layer (global pooling layer) is often deployed in an existing contrast learning model, for example, a two-dimensional feature map is processed into a one-dimensional feature vector (1D feature vector) by averaging in a spatial dimension. According to the invention, the two-dimensional characteristic diagram is directly modeled by abandoning the global pooling layer.

Next, a more detailed architecture of an execution subject that executes the model training method in the embodiment of the present application is described.

The system architecture provided by the embodiment of the present application is described in detail below with reference to fig. 5. Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in FIG. 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data collection system 560.

The execution device 510 includes a computation module 511, an I/O interface 512, a pre-processing module 513, and a pre-processing module 514. The goal model/rules 501 may be included in the calculation module 511, with the pre-processing module 513 and the pre-processing module 514 being optional.

The data acquisition device 560 is used to acquire training samples. The training sample may be image data or the like, and in the embodiment of the present application, the training sample may be a sample image. After the training samples are collected, the data collection device 560 stores the training samples in the database 530.

The training device 520 may train the feature extraction network based on training samples maintained in the database 530 to obtain the target model/rule 501. In the embodiment of the present application, the target model/rule 501 may extract a network for the updated features. It should be understood that the above process of training the feature extraction network may be a pre-training process, and after obtaining the updated feature extraction network, the updated feature extraction network may be subjected to fine tuning for a target task in combination with a downstream task data set.

It should be noted that, in practical applications, the training samples maintained in the database 530 are not necessarily all collected from the data collection device 560, and may be received from other devices. It should be noted that, the training device 520 does not necessarily perform the training of the target model/rule 501 based on the training samples maintained by the database 530, and may also obtain the training samples from the cloud or other places for performing the model training, and the above description should not be taken as a limitation on the embodiment of the present application.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, for example, the executing device 510 shown in fig. 5, where the executing device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR)/Virtual Reality (VR) device, a vehicle-mounted terminal, or a server or a cloud.

In particular, training device 520 may be passed to performing device 510.

In fig. 5, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user may input data (e.g., to-be-processed data in the embodiment of the present application) to the I/O interface 512 through a client device 540.

The pre-processing module 513 and the pre-processing module 514 are configured to perform pre-processing according to input data received by the I/O interface 512. It should be understood that there may be no pre-processing module 513 and pre-processing module 514 or only one pre-processing module. When the pre-processing module 513 and the pre-processing module 514 are not present, the input data may be processed directly using the calculation module 511.

During the process of preprocessing the input data by the execution device 510 or performing the calculation and other related processes by the calculation module 511 of the execution device 510, the execution device 510 may call the data, codes and the like in the data storage system 550 for corresponding processes, or store the data, instructions and the like obtained by corresponding processes in the data storage system 550.

Finally, the I/O interface 512 presents the processing results (e.g., the processing results in the embodiment of the present application) to the client device 540, thereby providing them to the user.

In the case shown in fig. 5, the user can manually give input data, and this "manually give input data" can be operated through an interface provided by the I/O interface 512. Alternatively, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data to obtain authorization from the user, the user may set the corresponding permissions in the client device 540. The user can view the result output by the execution device 510 at the client device 540, and the specific presentation form can be display, sound, action, and the like. The client device 540 may also serve as a data collection terminal, collecting input data of the input I/O interface 512 and output results of the output I/O interface 512 as new sample data, as shown, and storing the new sample data in the database 530. Of course, the input data inputted to the I/O interface 512 and the output result outputted from the I/O interface 512 as shown in the figure may be directly stored in the database 530 as new sample data by the I/O interface 512 without being collected by the client device 540.

It should be noted that fig. 5 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 5, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510. It is to be appreciated that the execution device 510 described above can be deployed in the client device 540.

From the inference side of the model:

in this embodiment, the calculation module 511 of the execution device 520 may acquire codes stored in the data storage system 550 to implement the model feed-forward process in this embodiment.

In this embodiment, the computation module 511 of the execution device 520 may include a hardware circuit (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor, a microcontroller, or the like), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with a function of executing instructions, such as a CPU, a DSP, or a hardware system without a function of executing instructions, such as an ASIC, an FPGA, or the like, or a combination of the above hardware systems without a function of executing instructions and a hardware system with a function of executing instructions.

Specifically, the computing module 511 of the execution device 520 may be a hardware system having a function of executing instructions, the data processing method provided in the embodiment of the present application may be a software code stored in a memory, and the computing module 511 of the execution device 520 may acquire the software code from the memory and execute the acquired software code to implement the model feedforward process in the embodiment of the present application.

It should be understood that the computing module 511 of the execution device 520 may be a hardware system without a function of executing instructions and a combination of hardware systems with a function of executing instructions, and some steps of the data processing method provided by the embodiment of the present application may also be implemented by a hardware system without a function of executing instructions in the computing module 511 of the execution device 520, which is not limited herein.

From the training side of the model:

in this embodiment of the present application, the training device 520 may obtain codes stored in a memory (not shown in fig. 5, and may be integrated with the training device 520 or separately deployed from the training device 520) to implement the model training method in this embodiment of the present application.

In this embodiment, the training device 520 may include a hardware circuit (e.g., an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA), a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, etc.), or a combination of these hardware circuits, for example, the training device 520 may be a hardware system with an instruction execution function, such as a CPU, a DSP, etc., or a hardware system without an instruction execution function, such as an ASIC, an FPGA, etc., or a combination of the above hardware systems without an instruction execution function and a hardware system with an instruction execution function.

Specifically, the training device 520 may be a hardware system having a function of executing instructions, the data processing method provided in the embodiment of the present application may be a software code stored in a memory, and the training device 520 may acquire the software code from the memory and execute the acquired software code to implement the model training method provided in the embodiment of the present application.

It should be understood that the training device 520 may be a combination of a hardware system without a function of executing instructions and a hardware system with a function of executing instructions, and some steps of the model training method provided by the embodiment of the present application may also be implemented by a hardware system without a function of executing instructions in the training device 520, which is not limited herein.

Referring to fig. 6 and fig. 6 are schematic diagrams of an embodiment of a model training method provided in an embodiment of the present application, where the model training method provided in the embodiment of the present application may be applied to a training device, the training device may be a terminal device such as a mobile phone, a tablet, a notebook computer, and an intelligent wearable device, and the training device may also be a device with data processing capability such as a server and a chip, as shown in fig. 6, the model training method provided in the embodiment of the present application may include:

601. a sample image is acquired.

In an automatic driving scene, a street view image often contains multiple instances (or called objects) of pedestrians, vehicles and the like, has strong global inconsistency and does not conform to a single-instance assumption (image blocks from the same image describe the same semantic information, and the features of the image blocks should be as close as possible, while image blocks from different images describe different semantic information, so the features of the image blocks should be as different as possible), so that the existing method is often limited to a data set with the single-instance assumption, and it is difficult to fully utilize a more realistic multi-instance automatic driving data set.

In this embodiment of the application, the sample image may be a multi-instance image (e.g., a street view image) in the automatic driving scene.

In one possible implementation, the sample image may include a plurality of objects including at least one of a person, a vehicle, a traffic sign, a lane line, a plant. Illustratively, the object may include, but is not limited to, at least one of the following: dynamic obstacles (pedestrians), riders (cycles), tricycles (tricycles), cars (cars), trucks (trucks), buses (Bus)), static obstacles (traffic cones (trafficcon), traffic sticks (TrafficStick), fire hydrants (firehydrants), motorcycles (motocycles), bicycles (bicycles)), traffic signs ((TrafficSign), guide signs (GuideSign), billboards (billboards), red traffic lights (TrafficLight _ Red)/Yellow traffic lights (TrafficLight _ Yellow)/Green traffic lights (TrafficLight _ Green)/Black traffic lights (TrafficLight _ Black), road signs (roadn)).

602. Sampling the sample image to obtain a first image block and a second image block, wherein the first image block and the second image block are different image blocks on the sample image.

The first image block and the second image block may be rectangular images.

603. And respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the first image block to the second image block on the sample image is larger than a threshold value, so as to obtain a first feature map and a second feature map.

A common region (overlap region) may exist between a region where the first image block is located on the sample image and a region where the second image block is located on the sample image, and a ratio between an area size of the common region and all regions occupied by the first image block and the second image block on the sample image is a cross-over (IoU) between the regions where the first image block and the second image block are located on the sample image (for example, see fig. 7).

In one possible implementation, the threshold may be a value greater than or equal to 0.4.

Among them, the root cause of global inconsistency in multi-instance scenes is that randomly cut image blocks may represent completely different scenes and semantic information, but if two image blocks can be controlled to be sufficiently "close" to each other, then within a local area, according to image continuity, the two image blocks can be considered to describe similar semantic information, thereby restoring the basic assumption of contrast learning. In the embodiment of the present application, an intersection ratio between two image blocks may be used as a way to measure the distance, where the intersection ratio refers to a ratio of an area of a superposition area of the two image blocks to a total coverage area (see fig. 7, left). It is required that the subsequent feature extraction and similarity calculation be performed only when the intersection ratio of two randomly generated image blocks is greater than a given threshold (see fig. 7 right).

It should be understood that the cross-over ratio threshold obviously cannot be determined too low, but at the same time is not as high as possible, since it is not desirable that the image blocks are not correlated at all, but it is also undesirable that the two are identical, so the selection of the cross-over ratio threshold is also actually a balance between data noise and data complexity in controlling multi-instance unsupervised learning.

Exemplarily, referring to fig. 8, fig. 8 shows an example of a sample image, fig. 9 is an illustration of a first tile and a second tile, the cross ratio IOU between the first tile and the second tile shown in fig. 9 is 0.3, fig. 10 is an illustration of the first tile and the second tile shown in fig. 10, the cross ratio IOU between the first tile and the second tile shown in fig. 10 is 0.5, fig. 11 is an illustration of the first tile and the second tile shown in fig. 11, and the cross ratio IOU between the first tile and the second tile shown in fig. 11 is 0.7.

In the embodiment of the application, based on that the cross-over ratio of the areas of the first image block and the second image block on the sample image is greater than a threshold, feature extraction may be performed on the first image block and the second image block respectively through a feature extraction network to obtain a first feature map and a second feature map.

Next, a feature extraction network in the embodiment of the present application is described:

in one possible implementation, the feature extraction network may be a backbone network, and the backbone network is configured to receive an input image (for example, the first image block and the second image block in this embodiment) and perform convolution processing on the input image to generate a plurality of feature maps.

It should be noted that "performing convolution processing on the input image" herein is not to be understood to be performing convolution processing only on the input image, and in some implementations, convolution processing and other processing may be performed on the input image.

It should be noted that "performing convolution processing on the first image to generate a plurality of feature maps" herein should not be understood to mean performing convolution processing on the image a plurality of times, and generating one feature map each time, that is, it should not be understood that each feature map is obtained based on performing convolution processing on the image, but rather, the image is a source of a plurality of feature maps as a whole; in one implementation, the image may be convolved to obtain one feature map, and then the generated feature map may be convolved to obtain another feature map, and so on, to obtain multiple feature maps.

In addition, a series of convolution processes may be performed on the input image, and specifically, each time a convolution process is performed, a feature map obtained by a previous convolution process may be subjected to a convolution process to obtain one feature map.

It should be noted that the feature maps may be feature maps with multi-scale resolution, that is, the feature maps are not feature maps with the same resolution, and in an alternative implementation, the feature maps may form a feature pyramid.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a feature extraction network according to an embodiment of the present application. As shown in fig. 12, the backbone network is configured to receive an input image, perform convolution processing on the input image, and output feature maps (feature map C1, feature map C2, feature map C3, and feature map C4) with different resolutions corresponding to the image; that is to say, feature maps corresponding to the images in different sizes are output, and the backbone network completes extraction of basic features and provides corresponding features for subsequent detection.

Specifically, the backbone network may perform a series of convolution processes on the input image to obtain feature maps (feature maps) at different scales (with different resolutions). These feature maps will provide the base features for subsequent detection modules. The backbone network may take various forms, such as a Visual Geometry Group (VGG), a residual neural network (net), a core structure of *** lenet (inclusion-net), and the like.

The backbone network can perform convolution processing on an input image to generate a plurality of convolution feature maps with different scales, wherein each feature map is a matrix of H, W and C, H is the height of the feature map, W is the width of the feature map, and C is the number of channels of the feature map.

The backhaul may adopt various existing convolution network frameworks, such as VGG16, resnet50, inclusion-Net, etc., and the Resnet18 is described as the backhaul.

Assume that the resolution of the input image is H × W × 3 (height H, width W, number of channels is 3, i.e., three channels of RBG). The input image may be convolved with a convolution layer Res18-Conv1 of Resnet18 to generate Featuremap C1, which is downsampled 2 times relative to the input image and the number of channels expanded to 64, so that the resolution of C1 is H/4 w/4 x 64. C1 can be subjected to convolution operation through Res18-Conv2 of Resnet18 to obtain Featuremap C2, and the resolution of the characteristic diagram is consistent with that of C1; c2 continues to carry out convolution operation through Res18-Conv3 to generate Featuremap C3, the feature map is further sampled in comparison with C2, the number of channels is multiplied, and the resolution is H/8W 128; finally, C3 is convolved by Res18-Conv4 to generate Featuremap C4 with the resolution of H/16W/16 256.

It should be noted that the backbone network in the embodiment of the present application may also be referred to as a backbone network, and is not limited herein.

It should be noted that the backbone network shown in fig. 12 is only one implementation and does not limit the present application.

In this embodiment of the present application, feature extraction is performed on the first image block and the second image block respectively through a feature extraction network to obtain a first feature map and a second feature map, where the first feature map may be multiple feature maps (a feature pyramid) obtained by processing the first image block through the feature extraction network, or may be one feature map obtained after a certain convolution operation, and similarly, the second feature map may be multiple feature maps (a feature pyramid) obtained by processing the second image block through the feature extraction network, or may be one feature map obtained after a certain convolution operation.

Wherein the first characteristic diagram and the second characteristic diagram have the same size.

604. Determining loss according to the difference between the first feature map and the second feature map, and updating the feature extraction network based on the loss to obtain an updated feature extraction network.

In the embodiment of the application, after feature extraction is respectively performed on the first image block and the second image block through a feature extraction network based on that the cross-over ratio of the first image block and the second image block between the areas of the sample image is greater than a threshold value to obtain a first feature map and a second feature map, loss can be determined according to the difference between the first feature map and the second feature map, and the feature extraction network is updated based on the loss to obtain an updated feature extraction network.

In a possible implementation, the first image block and the second image block may be aligned first to obtain an aligned first image block and an aligned second image block;

in one implementation, the sample image may include a target area, where the target area is an overlapping area where the first image block and the second image block are located on the sample image, and a first sub-feature map corresponding to the target area in the first feature map and a second sub-feature map corresponding to the target area in the second feature map may be determined according to the target area; upsampling the first sub-feature map to obtain the aligned first image block; and upsampling the second sub-feature map to obtain the aligned second image block, wherein the size of the aligned first image block is consistent with that of the aligned second image block.

In order to distinguish multi-instance features, a global pooling layer behind a backbone network needs to be abandoned to maintain two-dimensional structure and position information of a feature map, but an additional problem of feature misalignment is brought, namely that the two-dimensional feature map does not have the one-to-one correspondence of the same relative positions any more. In the embodiments of the present application, two different ways are provided for feature alignment: the region of interest (region of interest) aligns the overlapping portions of the image blocks as the regions of interest of the two image blocks, for example, the RoI Align may be used to extract only the features of the overlapping portions for subsequent calculation, which is intuitive but does not fully utilize the information of the non-overlapping portions (for example, see fig. 13).

In one implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one-to-one, a third feature map can be obtained according to the M first pixel points and the M second pixel points, the third feature map and the second feature map have the same size, the third feature map includes M third feature points, and each third feature point is obtained based on a pixel position difference between the first pixel points and the second pixel points having a corresponding relationship; and fusing the third feature map and the first feature map to obtain the aligned first image block, wherein the second feature map is used as the aligned second image block. Optionally, the third feature map and the first feature map may be spliced in the depth direction.

In the case of displacement alignment, each pair of pixel points located at the same relative position is taken, the coordinate displacement of the pixel points in the original image is calculated and is connected in series with the feature map (the pixel points are spliced in the depth direction), and the coordinate displacement is provided as additional side information to a prediction network for implicit feature alignment to help subsequent feature prediction (for example, as shown in fig. 14), so that feature information of a non-coincident region is fully utilized, and the subsequent similarity measurement is more flexible.

After obtaining the aligned first feature map and the aligned second feature map, the loss may be determined according to a difference between the aligned first feature map and the aligned second feature map.

Next, how to obtain the difference between the aligned first feature map and the aligned second feature map is described:

in one possible implementation, the M first feature points of the first feature map may be processed through a target prediction network to obtain a predicted value of each first feature point; wherein the target prediction network may comprise a convolution operation (e.g. a 1 x 1 convolution kernel).

The step of processing the M first feature points of the first feature map through the target prediction network may be referred to as an online branch, where the online branch predicts a cluster center result of the target branch through the prediction network, and optionally, the online branch may further be additionally configured with a self-attention module with a spatial dimension, so as to make a more accurate prediction by fully considering context information. Specifically, if the image online branch input is R, the final online branch prediction result Q is:

wherein H and W are the length and width of R and R', q _θ (. Cndot.) is a predictive network, and sim (,. Cndot.) is defined as:

sim(R _i,j ,R _i′,j′ )＝(max(cos(R _i,j ,R _i′,j′ ),0)) ² ；

in a possible implementation, based on a target clustering algorithm, the M second feature points of the first feature point map are clustered to update feature values of the M second feature points, where the feature value of each updated second feature point is a cluster center feature value of a cluster category in which the feature value is located.

Based on a target clustering algorithm, the step of clustering the M second feature points of the first feature map may be referred to as target branching. The target clustering algorithm may be, but is not limited to, K-means (K-means), mean-shift clustering, density-based clustering methods, hierarchical clustering, graph community detection (graph community detection), gaussian mixture model K-means (GMM K-means), and like clustering methods.

The method comprises the steps that a clustering hierarchical structure exists in a multi-instance scene, and the clustering hierarchical structure comprises category clustering, instance clustering and pixel clustering from top to bottom; the definition of simultaneous clustering is a relative concept, that is, the same pixel point may belong to different clusters in image blocks of different contexts (see two image blocks shown in the lower right of fig. 15, the pixel point P belongs to a category cluster of people in the image block on the left side, and belongs to a category cluster of men in the image block on the right side). Based on the observation, the characteristics of different examples can be distinguished through clustering in the image, so that the characteristics of the same clustering pixel point are as close as possible, the characteristics of different clustering pixel points are as different as possible, and the consistency of the clustering analysis results of the two image blocks is promoted. As shown in fig. 16, alternatively, a K-means cluster may be used to obtain a cluster center (mean, smooth) feature of each point on the target branch.

According to the embodiment of the application, the intra-image clustering is utilized, so that on one hand, the network has the capacity of distinguishing different example characteristics; on the other hand, by considering the overall information of the two-dimensional characteristic diagram, the regression target provided by the target branch is more robust, and the online branch introduces a global visual angle by deploying a self-attention module to obtain a more accurate prediction result.

After obtaining the predicted value of each first feature point and the feature value of each updated second feature point, a loss may be determined according to a difference between the predicted value of each first feature point and the feature value of each updated second feature point, where the loss may represent a difference (or may be described as a similarity) between the predicted value of each first feature point and the feature value of each updated second feature point. The two branches utilize the whole information of the two-dimensional characteristic diagrams to the maximum extent through different modes, so that the similarity of the two-dimensional characteristic diagrams can be better measured.

In one implementation, the similarity may be defined as:

wherein, the inputs of the on-line branch and the target branch are R, R' and Q respectively _i,j Is the output of an online branch, kmeans (R' _i,j ) Is the output of the target branch.

After the penalty is incurred, the online branch (e.g., including the target prediction network and the feature extraction network) may be updated based on the penalty. For example, the online branch parameters may be updated using a gradient descent while the target branch parameters are updated as a running average of the online branch parameters, and the training process may be iterated until the network converges.

Referring to fig. 17 and 18, fig. 17 and 18 are a flow chart of data enhancement, feature extraction and network training of a multi-instance self-supervised learning framework (MultiSiam) according to an embodiment of the present application:

for each input image, randomly cutting to obtain two image blocks, checking whether the intersection ratio of the two image blocks is greater than a given threshold value, if not, cutting again until the intersection ratio exceeds the threshold value, and if so, continuing to perform scale adjustment and color texture enhancement; respectively extracting two-dimensional feature maps of the two image blocks by using a backbone network; reducing the dimension of the two-dimensional characteristic graphs of the two image blocks by using a projection network so as to reduce the subsequent calculation amount; restoring the one-to-one correspondence relationship at the same relative position of the two-dimensional feature map by using alignment of the region of interest or displacement alignment through the feature alignment module, and paying attention to the fact that the feature alignment module does not change the spatial scale of the input feature map; running a K-means clustering algorithm on a target branch (left) feature graph to obtain a clustering result of each pixel point and a corresponding clustering center point feature, and simultaneously adding a gradient truncation operation at the end of a target branch in order to prevent network degradation; predicting a target branch clustering result through a self-attention module and a prediction network by an online branch (right), calculating two-dimensional clustering similarity according to a definition, and weighting the two-dimensional clustering similarity with the step one-dimensional feature similarity (for example, the weights of the two can be both 0.5 by default) to obtain final feature similarity measurement; and updating the online branch parameters by using gradient descent, updating the target branch parameters to be the moving average of the online branch parameters, and iterating until the network converges.

It should be understood that the above process of training the feature extraction network may be a pre-training process, and after obtaining the updated feature extraction network, the updated feature extraction network may be subjected to fine tuning for a target task in combination with a downstream task data set.

For example, the updated feature extraction network may be connected to a downstream task network, and the updated feature extraction network and the downstream task network may be fine-tuned based on a downstream task data set to obtain a target network, and the target network may be deployed on an execution device to perform a feed-forward process.

For example, taking a downstream task as an example of target detection, the downstream task network may include one or more heads, as shown in fig. 19, where the one or more heads are configured to detect a task object in one task according to a feature map output by the feature extraction network, and output 2D frames of an area where the task object is located and a confidence corresponding to each 2D frame; optionally, multiple heads may be arranged in parallel, and each head may complete detection of different task objects; the task object is an object needing to be detected in the task; the higher the confidence is, the higher the probability that the object corresponding to the task exists in the 2D frame corresponding to the confidence is.

In the embodiment of the application, different heads can complete different 2D detection tasks, for example, one head in a plurality of heads can complete vehicle detection, and a 2D frame and confidence coefficient of Car/Truck/Bus are output; the head1 in the multiple heads can complete human detection, and 2D frames and confidence degrees of the Pedestrian/Cyclint/Tricycle are output; a head of the multiple heads may complete the detection of the traffic light, outputting a 2D box and confidence for Red _ TransfficLight/Green _ TransfficLight/Yellow _ TransfficLight/Black _ TransfficLight.

In the embodiment of the present application, the downstream task network may include a plurality of serial heads; the serial head is connected with a parallel head; it is emphasized here that in practice a serial head is not necessary, and for scenarios where only 2D boxes need to be detected, no serial head need be included.

Wherein the serial head may be used to: the method comprises the steps of extracting features of a region where a 2D frame is located on one or more feature graphs on a feature extraction network by using the 2D frame of a task object of a task provided by a parallel head connected with the 2D frame, and predicting 3D information, mask information or Keypiont information of the task object of the task according to the features of the region where the 2D frame is located. The serial head is optionally connected behind the parallel head in series, and 3D/Mask/Keypoint detection of objects in the 2D frame is completed on the basis of detecting the 2D frame of the task. For example, the series 3D _head0 completes the estimation of the orientation, center of mass, and length and width height of the vehicle, thereby outputting a 3D frame of the vehicle; the serial Mask _ head0 predicts a fine Mask of the vehicle, so that the vehicle is divided; the serial Keypont _ head0 completes the estimation of the key points of the vehicle. The serial head is not necessary, and certain tasks do not need to carry out 3D/Mask/Keypoint detection, so that the serial head does not need to be connected in series, such as the detection of traffic lights, only 2D frames need to be detected, and the serial head does not need to be connected in series. In addition, some tasks can select to be connected in series with one or more serial heads according to specific requirements of the tasks, for example, detection of a parking lot (Parkingslot) needs to obtain a 2D frame and key points of a parking space, so that only one serial Keypoint _ head needs to be connected in series in the task, and 3D and Mask heads are not needed.

In this embodiment of the application, the header is connected to the feature extraction Network, and the header may complete detection of a 2D frame of a task according to a feature map provided by the feature extraction Network, and output the 2D frame of an object of the task and a corresponding confidence coefficient, and the like, and a structural schematic of the header is described next, referring to fig. 20, where fig. 20 is a schematic of the header, and as shown in fig. 20, the header includes three modules, namely, a candidate Region generation Network (RPN), a ROI-ALIGN, and an RCNN.

The RPN module may be configured to predict a region where the task object is located on one or more third feature maps provided by the feature extraction network, and output a candidate 2D frame matching the region; alternatively, it can be understood that the RPN predicts the regions where the task object may exist on one or more horizontal graphs output by the feature extraction network, and gives a frame of these regions, which are called candidate regions (propofol). For example, when the head is responsible for detecting a car, its RPN layer predicts a candidate box in which the car may exist; when the head is responsible for detecting a person, its RPN layer predicts the candidate boxes in which the person may be present. Of course, these propofol are inaccurate, on the one hand they do not necessarily contain the objects of the task, and on the other hand the boxes are not compact.

The 2D candidate region prediction process may be implemented by the RPN module of the head, which predicts regions where the task object may exist according to a feature map provided by a feature extraction network, and provides candidate frames (also called candidate regions, propofol) of the regions. In this embodiment, if the head is responsible for detecting the car, its RPN level predicts the candidate frames for possible cars.

The RPN layer may generate a feature map RPN Hidden by, for example, a 3 x 3 convolution on a third feature map provided by the feature extraction network. The RPN layer of the later head will predict Proposual from the RPN Hidden. Specifically, the RPN layer of head predicts the coordinates and confidence of the propofol at each position of RPN Hidden by a convolution of 1 × 1, respectively. The higher this confidence, the greater the probability that the object of this Proposal is present for the task. For example, a higher score for a certain Proposal in the head indicates a higher probability of the vehicle being present. The propofol predicted by each RPN layer needs to go through a propofol merging module, redundant propofol is removed according to the degree of coincidence between the propofol (this process may be adopted but is not limited to NMS algorithm), and the N (N < K) propofol with the largest score is selected from the remaining K propofol as a candidate region where an object may exist. These propofol's are inaccurate, on the one hand they do not necessarily contain the objects of the task, and on the other hand the boxes are not tight. Therefore, the RPN module is only a coarse detection process, and needs to be subdivided by the subsequent RCNN module. When the RPN module regresses the coordinates of the propofol, the coordinates relative to the Anchor are regressed instead of directly regressing the absolute values of the coordinates. The higher these anchors match the actual object, the greater the probability that the RPN can detect the object.

The ROI-ALIGN module is used for deducting the characteristics of the region where the candidate 2D frame is located from a characteristic diagram provided by the characteristic extraction network according to the region predicted by the RPN module; that is, the ROI-ALIGN module mainly extracts the features of the region where each propofol is located on a certain feature map according to the propofol provided by the RPN module, and resize to a fixed size to obtain the features of each propofol. It is understood that the ROI-ALIGN module can use, but is not limited to, ROI-POOLING/ROI-ALIGN/PS-ROIPOOLING/PS-ROIPIALIGN feature extraction methods.

The RCNN module is used for performing convolution processing on the characteristics of the region where the candidate 2D frame is located through a neural network to obtain confidence coefficients of the candidate 2D frame belonging to each object class; and adjusting the coordinates of the 2D frame of the candidate region through a neural network, so that the adjusted 2D frame is more matched with the shape of the actual object than the candidate 2D frame, and selecting the adjusted 2D frame with the confidence coefficient larger than a preset threshold value as the 2D frame of the region. That is to say, the RCNN module mainly performs a refinement process on the features of each propofol proposed by the ROI-ALIGN module, obtains confidence of each category to which each propofol belongs (for example, for a vehicle task, 4 scores of background/Car/Truck/Bus may be given), and adjusts the coordinates of the 2D frame of the propofol, and outputs a more compact 2D frame. These 2D boxes are merged by Non Maximum Suppression (NMS) and output as the final 2D box.

The 2D candidate area subdivision classification is mainly implemented by the RCNN module of head in fig. 20, which further regresses more compact 2D box coordinates according to the features of each propofol extracted by the ROI-ALIGN module, and classifies this propofol, and outputs the confidence that it belongs to each category. There are many realizable forms of RCNN, and the Feature size output by the ROI-ALIGN module may be N × 14 × 256 (Feature of prosassals), which is first processed by convolution module 4 (Res 18-Conv 5) of respet 18 in the RCNN module, and the Feature size output is N × 7 × 512, and then processed by a Global Avg Pool (average pooling layer), and the features of 7 × 7 in each channel in the input features are averaged to obtain N × 512 features, where each Feature vector of 1 × 512 dimension represents the Feature of each prosusal. Next, the exact coordinates of the box (output vector of N × 4, where the 4 values indicate the x/y coordinates of the center point of the box, and the width and height of the box) and the confidence of the category of the box (in head0, it is necessary to give a score that this box is background/Car/Truck/Bus) are regressed separately by 2 full connection layers FC. And finally, selecting a plurality of boxes with the largest scores through box merging operation, and removing repeated boxes through NMS operation so as to obtain compact box output.

In some practical application scenarios, the sensing network may further include other heads, and may further perform 3D/Mask/Keypoint detection on the basis of detecting the 2D frame. Illustratively, taking 3D as an example, the ROI-ALIGN module extracts features of an area where each 2D frame is located on a feature map output by the feature extraction network according to an accurate 2D frame provided by the head, and assuming that the number of the 2D frames is M, the feature size output by the ROI-ALIGN module is M14 256, which is first processed by Res18-Conv5 of Resnet18, and the feature size output is N7 512, and then processed by Global Avg Pool (average pooling layer), and features of 7 × 7 of each channel in the input features are averaged to obtain features of M512, where each feature vector of 1 × 512 dimension represents the feature of each 2D frame. Next, the orientation angle (orientation, M × 1 vector), centroid point coordinates (centroid, M × 2 vector, these 2 values represent the x/y coordinates of the centroid), and length, width, and height (division) of the object in the frame are regressed through 3 fully connected layers FC.

It should be noted that the header shown in fig. 19 and fig. 20 is only one implementation manner, and does not limit the present application.

Taking two large-scale multi-instance autopilot datasets as an upstream pre-training dataset as an example, in the selection of experimental data, upstream autopilot pre-training may use multi-instance autopilot datasets including the Waymo public dataset and the SODA-5M dataset. The Waymo public data set contains nearly 79 million unlabeled images, the image scales are different from (1920, 968) to (1920, 1280), and the Waymo public data set is the largest-scale automatic driving data set which is open at present; the SODA-5M dataset contained 500 million high quality autopilot images, all at the image scale (1920, 1280). The downstream migration data set mainly considers two widely used automatic driving semantic segmentation benchmark tasks of Cityscapes and BDD100K, wherein the Cityscapes comprise labels of 8 semantic categories and comprise 2975 training set images and 500 verification set images collected from 27 cities; the BDD100K contains 9 labels of semantic categories, containing 70000 training set images and 10000 verification set images collected from 4 cities.

Because the performance of the algorithm cannot be directly evaluated due to the fact that the upstream data set lacks manual labeling, the quality of the pre-training model is judged by using an average cross-correlation mIOU of prediction results on a downstream task verification set as an evaluation index after a self-supervision pre-training model is used for initializing a backbone network of a downstream task, wherein the mIOU is better when the mIOU value is higher. Firstly, performing self-supervision pre-training on an upstream automatic driving data set according to a proposed frame, abandoning upstream task exclusive modules such as a projection network and a prediction network after network convergence, and only reserving backbone network parameters to migrate to a downstream task; and (3) fine-tuning the network parameters by using a downstream task training set, predicting the fine-tuned model on a verification set, and reporting the mIOU of the final prediction result as an evaluation standard.

ResNet-50 is used as the default backbone network. To better accommodate large batches of training, the network training process is stabilized using a LARS optimizer and cosine learning rate attenuation. To better compare the pre-training results of the model on different data sets, we keep the GPU time-aligned with the training on the different data sets. Specifically, 325epoch and 55epoch pre-training were performed on Waymo and SODA-5M, respectively, for a fair comparison with ImageNet 200epoch pre-training.

By pre-training on the public automatic driving data set Waymo, the method provided by the embodiment of the application successfully obtains 4.7% of remarkable performance improvement on Cityscapes and 3.9% of remarkable performance improvement on BDD100K on the basis of a reference model BYOL, successfully achieves the current optimal result in Waymo pre-training, and can better use a multi-instance data set to carry out feature learning. However, considering that Waymo has fewer images (0.79m vs1.28m) than ImageNet and has strong foreground-background imbalance, waymo is actually inferior to ImageNet in both quantity and quality of images, and the pre-training result is difficult to obtain beyond ImageNet pre-training. Therefore, the automatic driving data set SODA-5M is additionally used for pre-training, the migration result of ImageNet pre-training is successfully exceeded, and the effectiveness of intra-domain pre-training is shown.

It is noted that due to the existence of the single sample hypothesis, it is necessary to manually perform data screening and cleaning to collect more images similar to the ImageNet dataset, and the multi-instance scenario is free from the single-instance hypothesis, and can collect the ultra-large-scale pre-training dataset at a very low cost, so the multi-instance self-supervised learning is more practical for the industrial-level dataset.

Referring to table 1, table 1 shows pre-training results of the embodiment of the present application on a multi-instance autopilot dataset, which results in significant performance improvement compared to a reference model; when the ultra-large scale automatic driving data set is used for pre-training, the pre-training successfully exceeds ImageNet pre-training, and the effectiveness of intra-domain pre-training is shown.

TABLE 1

Illustratively, the finally learned features are subjected to visual analysis, the last layer of two-dimensional feature map of the pre-trained backbone network is taken for K-means cluster analysis, and the clustering result is shown in fig. 21.

In addition, the scheme provided by the embodiment of the application is deployed on a single instance to perform self-supervision learning to verify the extensibility of the framework. The rest settings except experimental data are consistent with the technical scheme. In the selection of experimental data, an ImageNet dataset with a single instance hypothesis can be used as the upstream dataset, which contains 1000 classes of natural images, and contains 128 ten thousand training set pictures in total; instead of using artificial labeling, the image data can be used directly for self-supervised characterization learning. Two general detection data sets of VOC and COCO are additionally considered in the downstream data set, wherein the VOC data set comprises 20 semantic category marking information, and the number of the semantic category marking information is 1 ten thousand training images and 4900 testing images; the COCO data set comprises 80 semantic category labeling information, and about 11.8 ten thousand training images and 5000 verification set images are total.

Through verification, although the embodiment of the application is specially designed for multi-instance scenes, the single-instance scene images can still be effectively utilized to obtain high-quality feature representation, and the currently optimal migration result is obtained on multiple downstream tasks (see table 2), so that the universality and the scalability of the framework of the embodiment of the application are seen.

TABLE 2

Referring to fig. 22, fig. 22 is a schematic structural diagram of a model training apparatus provided in an embodiment of the present application, and as shown in fig. 22, an apparatus 2200 provided in the embodiment of the present application includes:

an acquiring module 2201, configured to acquire a sample image.

For a specific description of the obtaining module 2201, reference may be made to the description of step 601 in the foregoing embodiment, which is not described herein again.

A sampling module 2202, configured to sample the sample image to obtain a first image block and a second image block, where the first image block and the second image block are different image blocks in the sample image;

for a detailed description of the sampling module 2202, reference may be made to the description of step 602 in the foregoing embodiment, which is not described herein again.

A feature extraction module 2203, configured to perform feature extraction on the first image block and the second image block respectively through a feature extraction network to obtain a first feature map and a second feature map based on that a cross-over ratio of areas of the first image block and the second image block on the sample image is greater than a threshold;

for a detailed description of the feature extraction module 2203, reference may be made to the description of step 603 in the foregoing embodiment, which is not described herein again.

A model updating module 2204, configured to determine a loss according to a difference between the first feature map and the second feature map, and update the feature extraction network based on the loss to obtain an updated feature extraction network.

For a detailed description of the model updating module 2204, reference may be made to the description of step 604 in the foregoing embodiment, which is not described herein again.

In one possible implementation, the apparatus further comprises:

the model update module is specifically configured to:

In a possible implementation, the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points in a one-to-one manner, and the alignment module is specifically configured to:

clustering M second feature points of the first feature map based on a target clustering algorithm to update feature values of the M second feature points, wherein the feature value of each updated second feature point is a clustering center feature value of a cluster type where the updated second feature point is located;

and determining the loss according to the difference between the predicted value of each first characteristic point and the characteristic value of each updated second characteristic point.

and updating the target prediction network according to the loss.

In one possible implementation, the apparatus further comprises:

Referring to fig. 23, fig. 23 is a schematic structural diagram of an execution device provided in the embodiment of the present application, and the execution device 2300 may be embodied as a virtual reality VR device, a mobile phone, a tablet, a notebook computer, an intelligent wearable device, a monitoring data processing device or a server, which is not limited herein. Specifically, the execution apparatus 2300 includes: a receiver 2301, a transmitter 2302, a processor 2303 and a memory 2304 (wherein the number of processors 2303 in the execution device 2300 may be one or more, for example, one processor in fig. 23), wherein the processor 2303 may include an application processor 23031 and a communication processor 23032. In some embodiments of the application, the receiver 2301, the transmitter 2302, the processor 2303 and the memory 2304 may be connected by a bus or other means.

The memory 2304 may include both read-only memory and random access memory, and provides instructions and data to the processor 2303. A portion of the memory 2304 may also include non-volatile random access memory (NVRAM). The memory 2304 stores the processor and operating instructions, executable modules or data structures, or a subset thereof, or an expanded set thereof, wherein the operating instructions may include various operating instructions for performing various operations.

The processor 2303 controls the operation of the execution apparatus. In a particular application, the various components of the execution device are coupled together by a bus system that may include a power bus, a control bus, a status signal bus, etc., in addition to a data bus. For clarity of illustration, the various buses are referred to in the figures as a bus system.

The methods disclosed in the embodiments of the present application may be implemented in the processor 2303 or implemented by the processor 2303. The processor 2303 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 2303. The processor 2303 may be a general-purpose processor, a Digital Signal Processor (DSP), a microprocessor or a microcontroller, and may further include an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components. The processor 2303 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 2304, and the processor 2303 reads information in the memory 2304 and completes the steps of the method in combination with hardware of the processor.

The receiver 2301 may be used to receive input numeric or character information and generate signal inputs related to performing device related settings and function control. The transmitter 2302 may be used to output numeric or character information through a first interface; the transmitter 2302 may also be used to send instructions to the disk groups through the first interface to modify data in the disk groups; the transmitter 2302 may also include a display device such as a display screen.

In one embodiment of the present application, the processor 2303 is configured to execute the model obtained by the model training method of fig. 6.

Referring to fig. 24, fig. 24 is a schematic structural diagram of a training device provided in the embodiment of the present application, an image training apparatus described in the corresponding embodiment of fig. 22 may be disposed on the training device 2400 to implement the function of the image training apparatus in the corresponding embodiment of fig. 22, specifically, the training device 2400 is implemented by one or more servers, and the training device 2400 may generate relatively large differences due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 2424 (e.g., one or more processors) and a memory 2432, and one or more storage media 2430 (e.g., one or more mass storage devices) for storing an application program 2442 or data 2444. The memory 2432 and the storage medium 2430 can be, among other things, transient or persistent storage. The program stored on the storage medium 2430 may include one or more modules (not shown), each of which may include a sequence of instructions for operating on the exercise device. Still further, central processor 2424 may be disposed in communication with storage medium 2430 for performing a sequence of instructional operations on storage medium 2430 on exercise device 2400.

Training device 2400 may also include one or more power supplies 2426, one or more wired or wireless network interfaces 2450, one or more input-output interfaces 2458; or one or more operating systems 2441, such as Windows Server, mac OS XTM, unixTM, linuxTM, freeBSDTM, etc.

In the embodiment of the present application, the central processing unit 2424 is configured to perform the model training method according to fig. 6.

Embodiments of the present application also provide a computer program product, which when executed on a computer causes the computer to perform the steps performed by the aforementioned execution device, or causes the computer to perform the steps performed by the aforementioned training device.

Also provided in an embodiment of the present application is a computer-readable storage medium, in which a program for signal processing is stored, and when the program is run on a computer, the program causes the computer to execute the steps executed by the aforementioned execution device, or causes the computer to execute the steps executed by the aforementioned training device.

The execution device, the training device, or the terminal device provided in the embodiment of the present application may specifically be a chip, where the chip includes: a processing unit, which may be for example a processor, and a communication unit, which may be for example an input/output interface, a pin or a circuit, etc. The processing unit may execute the computer execution instructions stored in the storage unit to enable the chip in the execution device to execute the data processing method described in the above embodiment, or to enable the chip in the training device to execute the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, and the like, and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as a read-only memory (ROM) or another type of static storage device that can store static information and instructions, a Random Access Memory (RAM), and the like.

Specifically, referring to fig. 25, fig. 25 is a schematic structural diagram of a chip provided in the embodiment of the present application, where the chip may be represented as a neural network processor NPU 2500, and the NPU 2500 is mounted on a Host CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks. The core portion of the NPU is an arithmetic circuit 2503, and the controller 2504 controls the arithmetic circuit 2503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 2503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuit 2503 is a two-dimensional systolic array. The arithmetic circuit 2503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 2503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 2502 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 2501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 2508.

The unified memory 2506 is used for storing input data and output data. The weight data directly passes through a Memory Access Controller (DMAC) 2505, and the DMAC is carried into a weight Memory 2502. The input data is also carried into the unified memory 2506 via the DMAC.

The BIU is a Bus Interface Unit, bus Interface Unit 2510, for the interaction of the AXI Bus with the DMAC and an Instruction Fetch memory (IFB) 2509.

A Bus Interface Unit 2510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 2509 to obtain instructions from the external memory, and is also used for the memory Unit access controller 2505 to obtain the original data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 2506, or transfer weight data to the weight memory 2502, or transfer input data to the input memory 2501.

The vector calculation unit 2507 includes a plurality of operation processing units, and further processes the output of the operation circuit such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/full-connection layer network calculation in the neural network, such as Batch Normalization, pixel-level summation, up-sampling of a feature plane and the like.

In some implementations, the vector calculation unit 2507 can store the processed output vector to the unified memory 2506. For example, the vector calculation unit 2507 may calculate a linear function; alternatively, a nonlinear function is applied to the output of the arithmetic circuit 2503, such as linear interpolation of the feature planes extracted from the convolutional layers, and then, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 2507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 2503, e.g., for use in subsequent layers in a neural network.

An instruction fetch buffer 2509 connected to the controller 2504, configured to store instructions used by the controller 2504;

the unified memory 2506, the input memory 2501, the weight memory 2502, and the instruction fetch memory 2509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The processor mentioned in any of the above may be a general purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above programs.

It should be noted that the above-described embodiments of the apparatus are merely schematic, where the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiments of the apparatus provided in the present application, the connection relationship between the modules indicates that there is a communication connection therebetween, which may be specifically implemented as one or more communication buses or signal lines.

Through the above description of the embodiments, those skilled in the art will clearly understand that the present application can be implemented by software plus necessary general-purpose hardware, and certainly can also be implemented by special-purpose hardware including special-purpose integrated circuits, special-purpose CPUs, special-purpose memories, special-purpose components and the like. Generally, functions performed by computer programs can be easily implemented by corresponding hardware, and specific hardware structures for implementing the same functions may be various, such as analog circuits, digital circuits, or dedicated circuits. However, for the present application, the implementation of a software program is more preferable. Based on such understanding, the technical solutions of the present application or portions thereof that contribute to the prior art may be embodied in the form of a software product, which is stored in a readable storage medium, such as a floppy disk, a usb disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk of a computer, and includes several instructions for enabling a computer device (which may be a personal computer, an exercise device, or a network device) to execute the method according to the embodiments of the present application.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, training device, or data center to another website site, computer, training device, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that a computer can store or a data storage device, such as a training device, a data center, etc., that incorporates one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

Claims

1. A method of model training, the method comprising:

acquiring a sample image;

respectively extracting the features of the first image block and the second image block through a feature extraction network based on the fact that the cross-over ratio of the first image block to the second image block on the sample image is larger than a threshold value, so as to obtain a first feature map and a second feature map;

2. The method of claim 1, wherein the threshold is a value greater than or equal to 0.4.

3. The method according to claim 1 or 2, wherein before determining the loss from the difference between the first profile and the second profile, the method further comprises:

aligning the first image block and the second image block to obtain an aligned first image block and an aligned second image block;

said determining a loss from a difference between said first feature map and said second feature map comprises:

determining a loss according to a difference between the aligned first feature map and the aligned second feature map.

4. The method according to claim 3, wherein the sample image includes a target area, the target area is an overlapping area where the first image block and the second image block are located on the sample image, and aligning the first image block and the second image block includes:

5. The method of claim 3, wherein the first feature map and the second feature map are the same in size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one-to-one, and the method further comprises:

6. The method of claim 5, wherein fusing the third feature map with the first feature map comprises:

7. The method of any of claims 1 to 6, wherein determining the loss based on the difference between the first profile and the second profile comprises:

8. The method of claim 7, further comprising:

and updating the target prediction network according to the loss.

9. The method of any one of claims 1 to 8, wherein the sample image comprises a plurality of objects, the objects comprising at least one of a person, a vehicle, a traffic sign, a lane line, a plant.

10. The method according to any one of claims 1 to 9, further comprising:

acquiring a target network and an image to be processed, wherein the target network comprises the updated feature extraction network and a downstream task network;

11. A model training apparatus, the apparatus comprising:

the acquisition module is used for acquiring a sample image;

12. The apparatus of claim 11, wherein the threshold is a value greater than or equal to 0.4.

13. The apparatus of claim 11 or 12, further comprising:

the model update module is specifically configured to:

14. The apparatus according to claim 13, wherein the sample image includes a target area, the target area is an overlapping area where the first tile and the second tile are located on the sample image, and the alignment module is specifically configured to:

15. The apparatus according to claim 13, wherein the first feature map and the second feature map have the same size, the first feature map includes M first feature points, the second feature map includes M second feature points, the M first feature points correspond to M first pixel points in the sample image, the M second feature points correspond to M second pixel points in the sample image, the M first pixel points correspond to the M second pixel points one-to-one, and the alignment module is specifically configured to:

16. The apparatus according to claim 15, wherein the alignment module is specifically configured to:

17. The apparatus according to any one of claims 11 to 16, wherein the model update module is specifically configured to:

18. The apparatus of claim 17, wherein the model update module is further configured to:

and updating the target prediction network according to the loss.

19. The apparatus of any of claims 11 to 18, wherein the sample image comprises a plurality of objects, the objects comprising at least one of a person, a vehicle, a traffic sign, a lane line, a plant.

20. The apparatus of any one of claims 11 to 19, further comprising:

21. A model training apparatus, the apparatus comprising a memory and a processor; the memory stores code, and the processor is configured to retrieve the code and perform the method of any of claims 1 to 10.

22. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 10.

23. A computer program product, characterized in that it comprises code for implementing the steps of the method of any one of claims 1 to 10 when said code is executed.