WO2023056889A1 - 模型训练和场景识别方法、装置、设备及介质 - Google Patents

模型训练和场景识别方法、装置、设备及介质 Download PDF

Info

Publication number
WO2023056889A1
WO2023056889A1 PCT/CN2022/123011 CN2022123011W WO2023056889A1 WO 2023056889 A1 WO2023056889 A1 WO 2023056889A1 CN 2022123011 W CN2022123011 W CN 2022123011W WO 2023056889 A1 WO2023056889 A1 WO 2023056889A1
Authority
WO
WIPO (PCT)
Prior art keywords
scene
convolution
layer
module
training
Prior art date
Application number
PCT/CN2022/123011
Other languages
English (en)
French (fr)
Inventor
罗雄文
卢江虎
项伟
Original Assignee
百果园技术(新加坡)有限公司
罗雄文
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百果园技术(新加坡)有限公司, 罗雄文 filed Critical 百果园技术(新加坡)有限公司
Publication of WO2023056889A1 publication Critical patent/WO2023056889A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Definitions

  • the present application relates to the technical field of image processing, for example, to a method, device, device and medium for model training and scene recognition.
  • Machine review technology (referred to as machine review) is more and more widely used in large-scale short video/picture review.
  • the illegal pictures identified by machine review are pushed to the staff for review (referred to as human review), and finally determine whether the picture violates regulations.
  • human review The emergence of machine review has greatly improved the efficiency of picture review.
  • machine review tends to rely on the visual commonality of images to make violation judgments, thus ignoring changes in review results caused by changes in the general environment. For example, in the review of gun violations, when a machine review recognizes a gun in an image, it will generally consider the picture to be a violation, but the accuracy of such a machine review result is poor. This is because, for example, if it is a gun in an anime or game scene, the picture is not an illegal picture. Therefore, the scene recognition has a great influence on the accuracy of the review results of the machine review. There is an urgent need for a scene recognition solution.
  • Embodiments of the present application provide a model training and scene recognition method, device, device, and medium.
  • the embodiment of the present application provides a scene recognition model training method.
  • the scene recognition model includes a core feature extraction layer, a global information feature extraction layer connected to the core feature extraction layer, and at least one level of local attention mechanism.
  • Supervised learning LCS module fully connected decision-making layer, the method includes:
  • the parameters of the core feature extraction layer and the global information feature extraction layer are obtained through training;
  • the parameters of the fully connected decision-making layer are obtained through training.
  • the embodiment of the present application provides a scene recognition method based on the scene recognition model trained by the method described above, the method comprising:
  • the image to be recognized is input into a pre-trained scene recognition model, and scene information corresponding to the image to be recognized is determined based on the scene recognition model.
  • the embodiment of the present application provides a scene recognition model training device, the device includes:
  • the first training unit is configured to obtain the parameters of the core feature extraction layer and the global information feature extraction layer by training the first scene label of the sample image and the standard cross-entropy loss.
  • the second training unit is configured to train the weight parameters of the LCS module at each level according to the feature map output by the LCS module at each level and the loss value calculated pixel by pixel from the first scene label of the sample image.
  • the third training unit is configured to use the first scene label of the sample image and the standard cross-entropy loss to obtain the parameters of the fully connected decision-making layer through training.
  • the embodiment of the present application provides a scene recognition device based on the scene recognition model trained by the above-mentioned device, and the device includes:
  • An acquisition module configured to acquire an image to be identified
  • the recognition module is configured to input the image to be recognized into a pre-trained scene recognition model, and determine the scene information corresponding to the image to be recognized based on the scene recognition model.
  • the embodiment of the present application provides an electronic device, the electronic device includes a processor, and the processor is used to implement the steps of the above-mentioned scene recognition model training method when executing the computer program stored in the memory, or to implement The steps of the scene recognition method described above.
  • the embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned scene recognition model training method is implemented. Steps, or steps to realize the scene recognition method as described above.
  • Embodiments of the present application provide a model training and scene recognition method, device, device, and medium.
  • the scene recognition model includes a core feature extraction layer, a global information feature extraction layer connected to the core feature extraction layer, and a LCS module, fully connected decision-making layer, the method includes:
  • the parameters of the core feature extraction layer and the global information feature extraction layer are obtained through training;
  • the parameters of the fully connected decision-making layer are obtained through training.
  • Fig. 1 is a schematic diagram of the scene recognition model training process provided by the embodiment of the present application.
  • FIG. 2 is a schematic diagram of the application of the scene recognition method provided by the embodiment of the present application.
  • Fig. 3 is the flow chart of the model main body training stage provided by the embodiment of the present application.
  • FIG. 4 is a flow chart of the model branch expansion stage provided by the embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of the core feature extraction part of the scene recognition model provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of the structure and execution principle of the global information feature extraction layer provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a detailed explanation of the principle of the local supervised learning module provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of the structure of the extended branch network of the scene recognition model and the principle of the first round of training provided by the embodiment of the present application;
  • FIG. 9 is a schematic diagram of the branch expansion stage structure and training of the scene recognition model provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of the scene recognition process provided by the embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a scene recognition model training device provided in an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a scene recognition device provided in an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another electronic device provided by an embodiment of the present application.
  • Convolutional neural network an end-to-end complex mapping used to extract image or video features and complete visual tasks such as classification and detection based on the extracted features, usually stacked by multiple basic convolution modules.
  • Convolution layer An operation layer that uses a kernel with a specific receptive field to perform weighted sum feature extraction. Generally, this layer also combines a nonlinear activation function to improve the mapping ability.
  • a summary operation such as summarizing pixel values in a specific range or a specific dimension, usually including maximization, minimization, and averaging.
  • Group convolution divide the feature map into several groups by channel, and perform the same or different convolution operations on the feature maps of each group, which can be used to reduce computational overhead.
  • Feature Pyramid A multi-scale feature extraction method, usually extracting feature maps from different layers of the network, then aligning feature maps through some upsampling scheme, and fusing these feature maps to produce multi-scale features.
  • Residual block A module composed of multiple convolutional layers with cross-layer connection bypass. Using this module, a deep convolutional neural network can be built, and the phenomenon of gradient disappearance can be avoided, and the training of the network can be accelerated.
  • Heat map A feature map that can reflect the local importance of an image. Generally, the higher the importance, the greater the local heat value, or vice versa.
  • Locally supervised learning Use directly connected labels and losses for some parts of the model or a certain part of the feature map to learn parameters or extraction capabilities.
  • Attention Mechanism A mechanism that forces the network to pay attention to important regions by fitting the importance of different parts, and makes decisions based on the characteristics of important regions.
  • Sigmoid An activation function that does not consider the mutual exclusion relationship between categories. Usually, the output value after being activated will fall in the [0,1] interval to complete the standardization.
  • Deformable convolution A convolution operation in which the convolution kernel is not a canonical geometric shape, and the non-standard geometric shape is usually generated by adding an offset to the original shape.
  • Standard cross entropy A conventional loss evaluation function for simple classification problems, often used to train classification networks, including single-label classification and multi-label classification.
  • Focal Loss A loss function for the category imbalance problem, which can allow categories with less data to get a larger penalty, preventing the model from completely leaning towards categories with more data.
  • the embodiment of the present application is not directly applied to the computer review process to directly generate review results; instead, it outputs the scene information required by the specific machine review model in the form of scene signals, and cooperates with the machine review model through appropriate strategies. Generate push results. According to the final push result, videos or pictures that violate the regulations will be pushed to the human review process for multiple rounds of review, and the result of punishment will be obtained; and the final push result is that the normal video or picture will also be sampled and inspected in different areas according to a certain sampling rate , or push it to the human review link for review according to the reporting results, so as to avoid missing videos/pictures that seriously violate the regulations.
  • Fig. 1 is a schematic diagram of the scene recognition model training process provided by the embodiment of the present application, the process includes the following steps:
  • S101 Obtain parameters of the core feature extraction layer and the global information feature extraction layer through training using the first scene label of the sample image and the standard cross-entropy loss.
  • the scene recognition model includes a core feature extraction layer, a global information feature extraction layer connected to the core feature extraction layer, LCS modules at each level, and a fully connected decision-making layer.
  • the scene recognition method provided in the embodiment of the present application is applied to an electronic device, and the electronic device may be a smart device such as a personal computer (PC), a tablet computer, or a server.
  • the electronic device may be a smart device such as a personal computer (PC), a tablet computer, or a server.
  • the scene recognition model also includes a branch extension structure;
  • the branch extension structure includes a convolution layer and a local object association module;
  • the convolution layer of the branch extension structure According to the feature map output by the convolution layer of the branch extension structure and the loss value calculated pixel by pixel of the second scene label of the sample image, train the weight parameters of the convolution layers at each level of the branch extension structure; The loss function of the scene confidence regularization term is trained to obtain the parameters of the local object association relationship module; wherein, the granularity of the first scene label and the second scene label is different.
  • the first scene label is a coarse-grained scene label
  • the second scene label is a fine-grained scene label
  • the multi-level fine-grained scene recognition model proposed in the embodiment of the present application will run in a synchronous manner, that is, for the image to be recognized in a certain video or picture set, it will take precedence over other existing images in the machine review process.
  • Some machine-censorship models are executed first, and the scene information obtained by execution is stored in the cache queue; then a series of machine-censorship models are activated through synchronous signals (machine-censorship model a, machine-censorship model b and machine-censorship model c in Figure 2 etc.), using the same video or image set as the input to be recognized, run the machine review model, and get the preliminary review results.
  • the machine review model/label configured with the scenario strategy will take the corresponding scene signal from the cache queue, calculate it together with the preliminary review result of the machine review, obtain the final review result, and decide whether to push it to the human review link;
  • the machine review model/label without scenario policy is directly based on the results given by the machine review model to decide whether to push.
  • the review result of the machine review is that the image to be recognized is a violation image
  • it is determined that the image to be recognized is a violation image. And decided to push it to the human review stage. If it is determined that the scene information corresponding to the image to be recognized does not belong to illegal scene information, or the review result of the machine review is that the image to be recognized is not a violation image, then it is determined that the image to be recognized is not a violation image, and at this time it will not be pushed to the human review process, or conduct sampling inspections in different areas according to a certain sampling rate, or send reports to the human review process for re-examination based on the reported results.
  • which scene information belongs to illegal scene information can be stored in the electronic device in advance, and after the scene information corresponding to the image to be recognized is determined, it can be judged whether the scene information corresponding to the image to be recognized belongs to illegal scene information.
  • the process of the machine review model reviewing whether the image to be recognized is a violation image can use related technologies, which will not be repeated here.
  • FIG. 3 is a schematic diagram of the training process of the main structure of the scene recognition model provided by the embodiment of the present application
  • FIG. 4 is a schematic diagram of the training process of the branch extension structure of the scene recognition model provided by the embodiment of the present application.
  • Figure 3 and Figure 4 show the overall training process of the multi-level fine-grained scene recognition model proposed by the embodiment of the present application, which consists of two stages, namely "model main body training stage” and “model branch expansion stage", after the training is completed
  • a scene recognition model with the structure shown on the left side of Figure 2 will be generated to cooperate with improving the accuracy of the machine review model.
  • the feature extraction capability of the main structure part is very important, because each component of the main structure part is generally used as a pre-component when the branch expands the structure, which affects the feature extraction capability of specific fine-grained branches.
  • Each module is trained.
  • the core feature extraction layer and multi-scale global information of the scene recognition model are specially optimized through the first scene label (generally, the picture-level scene semantic label at this time) and the standard cross-entropy loss.
  • the weight parameters of the local supervised learning module are not optimized first.
  • the main purpose of the first round of optimization is to allow the model to obtain abstract semantic features and the ability to extract global features on multiple scales.
  • the parameters (weight parameters) of the first round of optimization are fixed, and a "locally supervised learning module with attention mechanism" (Local Supervised module, LCS module) is connected to the convolutional feature map group after each pooling layer.
  • LCS module Local Supervised module with attention mechanism
  • each level takes the pooled convolutional feature map group as input, and outputs a "focused" feature map after the LCS module, through the feature map and the first scene label of the sample image (generally at this time is Labels at the block diagram level), the weight parameters of the LCS module at each level are optimized with a pixel-by-pixel binary sigmoid loss.
  • the model will be sensitive to the features of local objects. Local object features can be extracted through the LCS module, and the Fisher convolution feature encoding can be used to reduce feature redundancy while minimizing the subtle features that affect decision-making. missing.
  • the third round of optimization focuses on the weight parameters of the decision-making layer for fusion features.
  • the fully-connected output layer used in the first round of optimization will be removed and replaced with a fully-connected decision-making layer corresponding to the fusion of the three features.
  • Scene labels generally image-level scene semantic labels at this time
  • standard cross-entropy loss are used for training optimization, and weight parameters other than the decision-making layer are fixed.
  • Figure 4 gives an example of extending a branch.
  • the general branch will start from the output of a certain convolutional layer in the main structure part, and access several convolution pooling operation layers.
  • the weight parameters of the main structure part will be fixed, and the training will be completed after two rounds of optimization.
  • the first round of optimization uses a pixel-wise binary sigmoid loss to directly optimize the associated convolutional layer, without using the LCS module as a relay.
  • the main purpose of the first round of optimization is to enable the extended branch to have feature learning capabilities for local objects.
  • the second round of optimization will embed some "local object association learning modules" composed of "deformed convolution" in the extended branch. Focus loss, learning the mining ability of local object associations.
  • multiple different branches can be extended on the main network to handle different task requirements.
  • the embodiment of the present application provides a solution for image scene recognition based on the scene recognition model.
  • the core feature extraction layer and global information are obtained through the training of the first scene label of the sample image and the standard cross-entropy loss.
  • the parameters of the feature extraction layer and then according to the feature map output by the LCS module of each level and the loss value calculated pixel by pixel of the first scene label of the sample image, the weight parameters of the LCS module of each level are trained, and finally the scene recognition is obtained by training The parameters of the fully connected decision layer of the model.
  • the scene recognition model has the ability to extract high-rich features, and the scene recognition is performed based on the scene recognition model, which greatly improves the accuracy of scene recognition.
  • the scene recognition model also includes a branch extension structure, so as to meet the requirements of different fine-grained scene recognition.
  • the core feature extraction layer in the scene recognition model is connected with the global information feature extraction layer, the LCS module at each level, and the fully connected decision-making layer (Fully Connected Layer (FC layer) in Figure 4). ) and branch extension structural connections.
  • the core feature extraction layer When performing scene recognition on an image to be recognized based on the scene recognition model, the core feature extraction layer first performs feature extraction on the image to be recognized, and then outputs the results to the global information feature extraction layer, the LCS module at each level, and the full connection decision Layer and branch extension structure.
  • the global information feature extraction layer, the LCS module at each level, the fully connected decision-making layer, and the branch expansion structure are processed in parallel according to the received feature map to obtain the final scene recognition result.
  • Figures 5 to 9 show the details of the scene recognition model and model training.
  • the core feature extraction layer includes a first-type grouping multi-receptive field residual convolution module and a second-type grouping multi-receptive field residual convolution module;
  • the multi-receptive field residual convolution module of the first type group includes a first group, a second group and a third group, the convolution sizes of the first group, the second group and the third group are different, and the first group
  • the grouping, the second grouping and the third grouping include the residual calculation bypass structure; each grouping outputs feature maps through convolution operation and residual calculation, and the feature maps output by each grouping are concatenated in the channel dimension and channel shuffled, volume After the product is fused, it is output to the next module;
  • the multi-receptive field residual convolution module of the second grouping includes a fourth grouping, a fifth grouping and a sixth grouping, the convolution sizes of the fourth grouping, the fifth grouping and the sixth grouping are different, and the fifth grouping
  • the group and the sixth group respectively include a 1 ⁇ 1 convolution bypass structure and a residual calculation bypass structure; the feature maps output by each group are concatenated in the channel dimension and channel shuffled, and then output to the next module after convolution fusion.
  • the scene recognition model proposed in the embodiment of this application has a core feature extraction layer structure as shown on the right side of Figure 5, which consists of a series of "grouped multi-receptive field residual convolution modules".
  • the left side of Figure 5 describes two types of grouped multi-receptive field residual convolution modules, which are composed of three convolution branches with different receptive fields.
  • the feature map group calculated by the previous module will be divided into three groups. They are sent to different convolution branches for convolution operation to further extract features.
  • the multi-receptive field residual convolution module of the first group is "GM-Resblock" in Figure 5.
  • the convolution kernels of the first group, the second group and the third group are respectively Three different sizes of 1x1, 3x3, and 5x5 are used, and the 5x5 convolution operation is replaced by two layers of 3x3 convolution operations, which can increase the number of nonlinear maps and improve the fitting ability while maintaining the same receptive field.
  • Each branch also adds a bypass structure used to calculate the residual to avoid gradient disappearance while expanding the depth of the model.
  • the embodiment of this application uses multi-receptive field convolution as a module, mainly because scene recognition is a visually complex problem, and local features at different scales may affect scene discrimination, and the multi-receptive field mechanism can capture as much as possible. Factors that facilitate decision-making.
  • the output results of the three convolution branches of GM-Resblock will be spliced and shuffled in the channel dimension, and finally fused with 1x1 convolution and passed to the next module.
  • the multi-receptive field residual convolution module of the second type of grouping is the "GM projection block" in Figure 5, and the GM projection block includes the fourth group, the fifth group and the sixth group, the fourth group, the fifth group.
  • the convolution kernels of the three branches of the group and the sixth group use three different sizes of 1x1, 3x3, and 5x5 respectively, of which the 5x5 convolution operation is replaced by two layers of 3x3 convolution operations, and the GM projection block will also be used.
  • the feature map is down-sampled, so its structure has been slightly modified, such as canceling the bypass of the 1x1 convolution branch, and adding 1x1 convolution to the bypass of the 3x3 and 5x5 convolution branches to maintain the size of the feature map and the number of channels consistency.
  • the output results of the three convolution branches of the GM projection block will be concatenated and shuffled in the channel dimension, and finally fused using 1x1 convolution and passed to the next module.
  • the feature maps of different levels in the core feature extraction layer are upsampled using deconvolution operations with different expansion factors, and the bilinear interpolation algorithm is used to align the number of channels in the channel dimension, and the feature maps of each level are added and merged channel by channel , the merged feature map group is convolutionally fused, and the global information feature vector is obtained by channel-by-channel global average pooling, the global information feature vector and the fully connected layer FC feature vector are spliced, and the standard cross-entropy loss is used to train The parameters of the core feature extraction layer and the global information feature extraction layer are obtained.
  • the global information feature extraction module is trained together with the core feature extraction layer of the model.
  • Figure 6 briefly shows its details and principle.
  • this application starts from the multi-scale feature map, fuses the global information of different scales, and obtains high-quality global information feature vectors.
  • multi-scale features can reduce information loss on the one hand, and make the model more sensitive to important regions at the global spatial level on the other hand.
  • the embodiment of the present application draws on the idea of the feature pyramid to upsample the feature map groups at different levels in the core feature extraction layer of the model using deconvolution operations with different expansion factors to ensure that the size of the feature maps is consistent.
  • deconvolution is used instead of ordinary filling upsampling, mainly to alleviate the image distortion problem caused by the upsampling operation.
  • the number of channels of the feature maps at different levels is still inconsistent.
  • the bilinear interpolation algorithm is simply used circularly in the channel dimension to supplement the insufficient channels and align the number of channels.
  • the feature maps of each level are combined by channel-by-channel addition operation, and the merged feature map group is fused using 1x1 convolution, and the global information feature vector is obtained through channel-by-channel global average pooling. According to Figure 3, this vector will be concatenated with the FC feature vector that records abstract features, and then connected to the standard cross-entropy loss for optimization.
  • the weight parameters for training the LCS module of each level include:
  • the activation function is used to obtain the importance weight of each channel, and the feature map of each channel is weighted and summed according to the importance weight of each channel to obtain a summary heat map;
  • the embodiment of the present application controls the importance of feature maps of different channels through the attention mechanism of the channel dimension while down-sampling. It is used to obtain a summary heat map that more accurately indicates the pixel information of each location, and better guides the learning of local object features by the LCS module.
  • the LCS module uses ordinary 3x3 convolutions to complete downsampling instead of pooling operations. This is done to avoid excessive local object activation offsets when reducing redundancy.
  • the attention mechanism of the channel dimension uses Sigmoid activation to obtain the importance weight, because the importance between channels is not mutually exclusive; finally, an importance weight vector is output through the fully connected layer, and then the importance value is multiplied by the corresponding channel to complete the channel Weighting of feature maps.
  • the LCS module After the LCS module outputs the attention-enhanced feature map group, it will specifically access the "local object supervision loss" to supervise and guide the module to learn the extraction ability of local object features. For example, first sum the attention-enhanced feature map group across channels pixel by pixel to obtain a heat map reflecting the activation of different pixel positions. Then use the heat map and the mask map marked based on the block diagram object and the "object-scene correlation importance" to calculate the loss, and perform backpropagation.
  • the mask map is based on the image-level scene semantic label, and is obtained according to the degree of influence of the object in the scene image on the scene judgment.
  • the object in the image will be given a mask according to the frame range it occupies, and the scene judgment Objects with a large impact are marked as "important" (mask value is 1.0), common objects that have little influence on scene judgment and appear in multiple scenes are marked as “unimportant” (mask value is 0.5), and background mask value is 0.0 .
  • the loss uses a pixel-by-pixel binary sigmoid loss, and the penalty weight is selected according to the ratio of the area of the "important object” to the area of the "unimportant object".
  • p i, j represents the activation value of the pixel on the heat map
  • mask i, j represents the pixel-level label
  • area represents the area.
  • ⁇ im , ⁇ unim , ⁇ ' im , ⁇ ' unim , and ⁇ back take values respectively 0.8, 0.6, 1.0, 0.5, 0.3. It should be noted that when training the LCS module in this application, the modules at each level are directly connected to the loss and backpropagated independently, and the mask map will be down-sampled accordingly as needed.
  • H, W represent the height and width of the image
  • i, j represent the row number and column number of the pixel
  • l bsigmoid represents the calculation method of the loss value corresponding to each pixel
  • T area represents the threshold triggering different calculation loss calculation methods
  • mask_area im represents the area of the mask area of an important object
  • mask_area unim represents the area of the mask area of an ordinary object
  • mask_area im and mask_area unim represent the area that can be marked manually.
  • the embodiment of the present application uses the Fisher convolutional coding method to reduce the dimension of the feature map, and uses the Fisher convolutional feature coding technology to extract the local object feature vector, while reducing the loss of subtle decisive features and avoiding the geometric transformation caused by redundant features. Influence.
  • the process of Fisher convolutional feature encoding is relatively simple. It mainly uses a variety of general-purpose Gaussian distributions to mix vectors on different pixels and reduce the number of features in dimension dimensions. The steps are as follows:
  • the feature map is flattened in the dimension dimension so that it is represented as HxW C-dimensional vectors.
  • PCA is used to reduce the dimensionality of each C-dimensional vector to M-dimensional.
  • K Gaussian mixture parameter values are calculated using K Gaussian distributions on HxW M-dimensional vectors.
  • step 3 of Figure 3 After obtaining local object features and global information features, these features will be combined with FC abstract features to rebuild a subject decision-making layer oriented to high-rich features, and use these features to complete high-precision decision-making .
  • the branch extension structure is constructed using a depth-separated convolutional residual block DW.
  • the middle layer uses a DW convolutional layer, and a 1x1 convolutional layer is used before and after the DW convolution.
  • the local object association learning module includes a deformable convolution layer, a convolution layer and an average pooling layer;
  • the deformable convolution layer obtains the offset value of the convolution kernel at the current pixel position, and the current position of the convolution kernel parameter plus the offset value is used as its actual effective position, and obtains the feature map pixel value of the actual effective position, after convolution operation and the average pooling operation outputs a feature map.
  • branches are usually expanded according to new fine-grained scenario requirements, and new branches can be designed with an appropriate network structure according to requirements.
  • a depth-wise separation convolution residual block (Depth-Wise, DW) is used to construct branches, as shown in FIG. 8 .
  • the middle layer uses DW convolution to replace the ordinary convolution layer, which reduces the computational overhead by about two-thirds.
  • 1x1 convolution is used to realize the inverse channel shrinkage operation, and the output adopts linear Activation, this is mainly to prevent Relu from discarding too many features when the negative value is activated.
  • This application finally uses three modules (branch module component a, branch module component b, and branch module component c) to form a fine-grained branch in series. Since the branch is obtained by extending the core feature extraction layer of the main body of the scene model, this part is not specifically optimized for the learning ability of local object features, so the corresponding level of the extended branch network will directly access the LCS loss proposed above for pre-training optimization.
  • association relationship learning module In order to obtain the learning ability of association relationship based on the extraction ability of local object features, in the second round of the branch expansion stage, this application embeds an "association relationship learning module" between the components of each branch module. These modules and the original The components of the branch network are trained together.
  • the association relationship learning module consists of a deformable convolutional layer, a 1x1 convolutional layer and an average pooling layer.
  • the deformable convolution layer is its core, and it uses a deformed convolution kernel during the convolution operation. This is mainly because the global spatial association of local objects is generally not a regular geometric shape, and its association logic is passed through the deformed Convolution kernels can be modeled more accurately.
  • deformable convolution Before performing the convolution operation, it needs to pass a branch to obtain the convolution kernel offset of the current pixel position, which includes X offset and Y offset (because the convolution kernel The parameters usually only need to pay attention to the size dimension), and then the current position of the convolution kernel parameter plus the offset value is used as its actual effective position. Considering that the coordinates of this position may be floating point numbers, the feature map of the corresponding position of the convolution kernel parameter Pixel values can be obtained using bilinear interpolation. After the deformable convolution operation is completed, a 1x1 convolution operation and an average pooling operation (non-global average pooling, without changing the size) will be performed once, which is mainly used to smooth the output result.
  • the relationship learning module is only a bypass of the branch expansion network connection position, and the original modules will still be directly connected.
  • this round of training since it generally focuses on fine-grained scenarios, it is more prone to data category imbalance and cross-category feature overlap. Therefore, this round of training uses focal loss as the main part of the loss function. This loss will give more training attention to a small number of categories, and it is also very suitable as a multi-label training loss.
  • this application also uses the confidence of each scene in the main part as a regularization item to improve the efficiency of this round of training.
  • the format of the loss function is as follows:
  • L focus represents the standard focus loss
  • Represents the confidence score of the image in the main part for a certain scene category i is a regular term
  • this application uses the L2 regular term as the penalty term for amplification.
  • Branch expansion can be performed at any level of the subject recognition feature extraction layer and expanded in a tree form.
  • This application uses a three-stage training scheme to train the main feature extraction part of the model from the three perspectives of abstract features, global information features, and local object features, so that the model has the ability to extract high-rich features, and based on them for scene discrimination, greatly Improve scene recognition accuracy.
  • This application combines the idea of feature pyramid to mine global information features from multiple scales, avoiding the loss of global spatial correlation information caused by excessive downsampling and nonlinear transformation, providing high-quality global information features, and improving The ability to recognize background scenes.
  • This application uses local supervised learning at multiple levels to provide local object feature extraction capabilities for different levels. Compared with single-level local object feature extraction, it reduces the loss of subtle scene decision information and enriches local object characteristics.
  • This application enhances the attention degree of the local supervised learning module to different channels through the attention mechanism, strengthens the activation of important local object features, and points out the direction for the subsequent Fisher coding.
  • This application proposes for the first time based on the generalized heat map, combined with the importance of local objects at the block diagram level, and using a new pixel-by-pixel binary Sigmoid loss for optimization, forcing the local supervised learning module to focus on the learning of "important local objects” and reduce “important local objects”. The interference of unimportant local objects” and "background” on decision-making.
  • This application uses Fisher convolutional coding to extract feature vectors from feature maps, reducing redundancy while avoiding excessive abstraction and loss of information.
  • this application uses multi-branch residual convolution as the basic module to ensure the feature extraction capability; while in the model branch expansion stage, this application uses depth-separated convolution, shared Strategies such as local learning modules reduce overhead.
  • This application proposes for the first time to use deformable convolution to build an association relationship learning module, and use the geometric flexibility of deformable convolution to accurately model the association relationship of local objects.
  • This application also uses the scene confidence of the main part as a regular term, combined with focus loss, to optimize fine-grained scene recognition with category imbalance.
  • focal loss can also be used to fully train only the core feature extraction layer, and then the global information feature extraction module is trained separately.
  • the global information feature extraction module can simply use two layers of deconvolution to complete size upsampling and channel expansion at the same time, but this will slow down the convergence speed.
  • the global information feature extraction module can also use channel-level attention mechanism and fully connected layer to complete feature fusion.
  • a locally supervised learning module can be trained with fully connected layers using image-level semantic labels combined with an auxiliary loss.
  • the fine-grained branch extension network can also be extended on the existing branch extension network, without having to use the main network as the starting point for extension.
  • the main part of the model can also use the basic module based on depth separation convolution to reduce overhead, and nxn convolution can also be converted into equivalent 1xn and nx1 convolution to reduce overhead.
  • Association learning can design a special loss function and train multiple layers independently without mixing and training in branch expansion networks.
  • Embodiment 6 is a diagrammatic representation of Embodiment 6
  • FIG. 10 is a schematic diagram of the scene recognition process provided by the embodiment of the present application, and the process includes:
  • S202 Input the image to be recognized into a pre-trained scene recognition model, and determine scene information corresponding to the image to be recognized based on the scene recognition model.
  • the scene recognition method provided in the embodiment of the present application is applied to an electronic device, and the electronic device may be a smart device such as a PC or a tablet computer, or may be a server.
  • the electronic device for scene recognition may be the same as or different from the electronic device for model training in the foregoing embodiments.
  • the electronic device for model training trains the model through the method in the above embodiment, and can directly save the trained scene recognition model in the electronic device for scene recognition, so that subsequent scene recognition can be performed
  • the electronic equipment directly performs corresponding processing through the scene recognition model completed by the training.
  • the image input to the scene recognition model for processing is used as the image to be recognized.
  • the image to be recognized is acquired, the image to be recognized is input into the pre-trained scene recognition model, and the scene information corresponding to the image to be recognized is determined based on the scene recognition model.
  • Embodiment 7 is a diagrammatic representation of Embodiment 7:
  • FIG. 11 is a schematic structural diagram of a scene recognition model training device provided in an embodiment of the present application, which includes:
  • the first training unit 11 is configured to obtain the parameters of the core feature extraction layer and the global information feature extraction layer by training the first scene label of the sample image and the standard cross-entropy loss.
  • the second training unit 12 is configured to train the weight parameters of the LCS modules at each level according to the feature maps output by the LCS modules at each level and the loss values calculated pixel by pixel from the first scene label of the sample image.
  • the third training unit 13 is configured to use the first scene label of the sample image and the standard cross-entropy loss to train to obtain the parameters of the fully connected decision-making layer.
  • the device also includes:
  • the fourth training unit 14 is configured to train convolutions at each level of the branch extension structure according to the feature map output by the convolution layer of the branch extension structure and the loss value calculated pixel by pixel from the second scene label of the sample image.
  • the weight parameter of the product layer; through the loss function with the scene confidence regular term, the parameters of the local object association relationship learning module are obtained through training; wherein, the granularity of the first scene label and the second scene label is different.
  • the first training unit 11 is also configured to upsample the feature maps of different levels in the core feature extraction layer using deconvolution operations with different expansion factors, and use a bilinear interpolation algorithm to align the number of channels in the channel dimension,
  • the feature maps of each level are added and merged channel by channel, and the merged feature map group is convoluted and fused, and the global information feature vector is obtained through channel-by-channel global average pooling, and the global information feature vector and the fully connected layer FC
  • the feature vectors are concatenated, and the parameters of the core feature extraction layer and the global information feature extraction layer are obtained through standard cross-entropy loss training.
  • the second training unit 12 is also configured to use an activation function to obtain the importance weight of each channel through the attention mechanism of the channel dimension, and perform weighted calculation of the feature map of each channel according to the importance weight of each channel. and, to obtain a summary heat map; calculate the loss value pixel by pixel according to the summary heat map, the object-scene-related importance and the area of the object, and train the weight parameters of the LCS modules at each level according to the loss value.
  • Embodiment 8 is a diagrammatic representation of Embodiment 8
  • FIG. 12 is a schematic structural diagram of a scene recognition device provided in an embodiment of the present application, which includes:
  • An acquisition unit 21 configured to acquire an image to be identified
  • the recognition unit 22 is configured to input the image to be recognized into a pre-trained scene recognition model, and determine the scene information corresponding to the image to be recognized based on the scene recognition model.
  • the device also includes:
  • the determining unit 23 is configured to determine that the image to be recognized is a violation image in response to determining that the scene information corresponding to the image to be recognized belongs to illegal scene information and the review result of the machine review is that the image to be recognized is a violation image.
  • Embodiment 9 is a diagrammatic representation of Embodiment 9:
  • an electronic device is also provided in the embodiment of the present application, as shown in FIG. , the communication interface 302, and the memory 303 complete mutual communication through the communication bus 304;
  • a computer program is stored in the memory 303, and when the program is executed by the processor 301, the processor 301 is made to perform the following steps:
  • the parameters of the core feature extraction layer and the global information feature extraction layer are obtained through training;
  • the parameters of the fully connected decision-making layer are obtained through training.
  • an electronic device is also provided in the embodiment of the present application. Since the problem-solving principle of the above-mentioned electronic device is similar to the scene recognition model training method, the implementation of the above-mentioned electronic device can refer to the implementation of the method, and the repetition is not repeated. Let me repeat.
  • the embodiment of the present application also provides an electronic device, as shown in FIG. 14 , including: a processor 401, a communication interface 402, a memory 403, and a communication bus 404, wherein , the communication interface 402, and the memory 403 complete mutual communication through the communication bus 404;
  • a computer program is stored in the memory 403, and when the program is executed by the processor 401, the processor 401 is made to perform the following steps:
  • the image to be recognized is input into a pre-trained scene recognition model, and scene information corresponding to the image to be recognized is determined based on the scene recognition model.
  • an electronic device is also provided in the embodiment of the present application. Since the problem-solving principle of the above-mentioned electronic device is similar to the scene recognition method, the implementation of the above-mentioned electronic device can refer to the implementation of the method, and the repetition will not be repeated. .
  • the embodiment of the present application also provides a computer-readable storage medium, where a computer program executable by an electronic device is stored in the computer-readable storage medium.
  • a computer program executable by an electronic device is stored in the computer-readable storage medium.
  • the parameters of the core feature extraction layer and the global information feature extraction layer are obtained through training;
  • the parameters of the fully connected decision-making layer are obtained through training.
  • an embodiment of the present application also provides a computer-readable storage medium. Since the processor executes the computer program stored on the computer-readable storage medium, the principle of solving problems is similar to the scene recognition model training method. Therefore, the implementation of the processor executing the computer program stored in the above-mentioned computer-readable storage medium can refer to the implementation of the method, and repeated descriptions will not be repeated.
  • the embodiment of the present application also provides a computer-readable storage medium, where a computer program executable by an electronic device is stored in the computer-readable storage medium.
  • a computer program executable by an electronic device is stored in the computer-readable storage medium.
  • the image to be recognized is input into a pre-trained scene recognition model, and the scene information corresponding to the image to be recognized is determined based on the scene recognition model.
  • the embodiment of the present application also provides a computer-readable storage medium. Since the principle of solving the problem when the processor executes the computer program stored on the above-mentioned computer-readable storage medium is similar to the scene recognition method, the processing For the implementation of the computer program stored in the above-mentioned computer-readable storage medium by the device, reference may be made to the implementation of the method, and repeated descriptions will not be repeated.
  • Embodiments of the present application provide a model training and scene recognition method, device, device, and medium to provide a highly accurate scene recognition solution.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本申请公开了一种模型训练和场景识别方法、装置、设备及介质,在训练场景识别模型时,首先通过样本图像的第一场景标签和标准交叉熵损失,训练得到核心特征提取层和全局信息特征提取层的参数,然后根据每个层级的带注意力机制的局部监督学习LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数,最后训练得到场景识别模型的全连接决策层的参数。

Description

模型训练和场景识别方法、装置、设备及介质
本申请要求在2021年10月09日提交中国专利局、申请号为202111174534.2的中国专利申请的优先权,以上申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及图像处理技术领域,例如涉及一种模型训练和场景识别方法、装置、设备及介质。
背景技术
机器审核技术(简称机审)在大规模短视频/图片审核中的应用越来越广泛,机审确定出的违规图片再推送至工作人员审核(简称人审),最终确定图片是否违规。机审的出现大大提高了图片审核的效率。但是机审倾向于依靠图像视觉上的共性来作出违规判决,从而忽略了因大环境变化而导致的审核结果的变化。例如***违规的审核,机审在识别到图像上出现***时,一般会认为图片违规,但是这样的机审结果准确性较差。这是因为例如是动漫或游戏场景下的***,则图片并非是违规图片。因此场景识别对于机审的审核结果的准确性影响较大。目前亟需一种场景识别的方案。
发明内容
本申请实施例提供了一种模型训练和场景识别方法、装置、设备及介质。
本申请实施例提供了一种场景识别模型训练方法,所述场景识别模型包括核心特征提取层以及与所述核心特征提取层连接的全局信息特征提取层、至少一个层级的带注意力机制的局部监督学习LCS模块、全连接决策层,所述方法包括:
通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数;
根据每个层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数;
通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
另一方面,本申请实施例提供了一种基于上述所述的方法训练得到的场景识别模型的场景识别方法,所述方法包括:
获取待识别的图像;
将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
另一方面,本申请实施例提供了一种场景识别模型训练装置,所述装置包括:
第一训练单元,设置为通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数。
第二训练单元,设置为根据每个层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数。
第三训练单元,设置为通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
另一方面,本申请实施例提供了一种基于上述所述的装置训练得到的场景识别模型的场景识别装置,所述装置包括:
获取模块,设置为获取待识别的图像;
识别模块,设置为将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
再一方面,本申请实施例提供了一种电子设备,所述电子设备包括处理器,所述处理器用于执行存储器中存储的计算机程序时实现上述所述场景识别模型训练方法的步骤,或实现上述所述场景识别方法的步骤。
再一方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述所述场景识别模型训练方法的步骤,或实现如上述所述场景识别方法的步骤。
本申请实施例提供了一种模型训练和场景识别方法、装置、设备及介质,所述场景识别模型包括核心特征提取层以及与所述核心特征提取层连接的全局信息特征提取层、各层级的LCS模块、全连接决策层,所述方法包括:
通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数;
根据所述各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数;
通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
附图说明
为了更清楚地说明本申请实施例,下面将对实施例描述中所需要使用的附图作简要介绍,下面描述中的附图仅仅是本申请的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的场景识别模型训练过程示意图;
图2为本申请实施例提供的场景识别方法应用示意图;
图3为本申请实施例提供的模型主体训练阶段流程图;
图4为本申请实施例提供的模型分支扩展阶段流程图;
图5为本申请实施例提供的场景识别模型核心特征提取部分的结构示意图;
图6为本申请实施例提供的全局信息特征提取层的结构与执行原理示意图;
图7为本申请实施例提供的局部监督学习模块原理详解示意图;
图8为本申请实施例提供的场景识别模型扩展分支网络的结构和第一轮训练原理示意图;
图9为本申请实施例提供的场景识别模型分支扩展阶段结构与训练示意图;
图10为本申请实施例提供的场景识别过程示意图;
图11为本申请实施例提供的场景识别模型训练装置结构示意图;
图12为本申请实施例提供的场景识别装置结构示意图;
图13为本申请实施例提供的电子设备结构示意图;
图14为本申请实施例提供的另一电子设备结构示意图。
具体实施方式
下面将结合附图对本申请进行描述,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
本申请实施例涉及的专用缩略语或自定义名词,解释如下:
卷积神经网络:一种用于提取图像或视频特征,并根据所提取的特征完成分类、检测等视觉任务的端到端复杂映射,通常由多个基础卷积模块堆叠而成。
卷积层:一个利用带特定感受野的核对图像进行加权求和特征提取的操作层,一般该层还会结合非线性激活函数提高映射能力。
池化:一种归总操作,比如对特定范围或者特定维度的像素值进行归总,通常包括最大化、最小化和取平均等等。
分组卷积:按通道把特征图组分成若干个小组,每个小组的特征图执行相同或不同的卷积操作,可以用来降低计算开销。
特征金字塔:一种多尺度特征的提取方法,通常是从网络的不同层级取出特征图,然后通过某种上采样方案对齐特征图,并融合这些特征图来产出多尺度特征。
残差块:一种带跨层连接旁路的由多个卷积层组成的模块,使用该模块可以搭建较深的卷积神经网络,并且避免梯度消失现象,加速网络的训练。
热力图:一种可以反映图像局部重要性的特征图,一般重要性越高,则局部的热力值就越大,或者相反。
局部监督学习:对模型的某些部分或特征图的某个局部使用直接相连的标签和损失,进行参数或提取能力的学习。
注意力机制:一种通过拟合不同部分重要性程度来迫使网络关注重要区域,并基于重要区域特征作出决策的机制。
Sigmoid:一种不考虑类别互斥关系的激活函数,通常被激活以后的输出值都会落在[0,1]区间,以完成标准化。
可变形卷积:一种卷积核不是规范几何形状的卷积操作,不规范的几何形状通常由原形状加上偏移生成。
标准交叉熵:一种常规的用于简单分类问题的损失评估函数,常用于训练分类网络,包括单标签分类和多标签分类。
Focal Loss:一种针对类别不平衡问题的损失函数,可以让数据量较少的类别获得较大的惩罚,防止模型完全倾向于数据量较多的类别。
需要说明的是,本申请实施例并非直接应用于机审环节,直接产生推审结果;而是以场景信号的方式输出具体机审模型所需要的场景信息,通过合适的策略与机审模型共同产生推送结果。最终的推送结果认为违规的视频或图片将被推送至人审环节进行多轮审核,得到处罚结果;而最终的推送结果认为正常的视频或图片,也会按照一定的采样率在不同区域抽样巡查,或根据举报结果推送至人审环节复审,避免漏掉严重违规的视频/图片。
实施例1:
图1为本申请实施例提供的场景识别模型训练过程示意图,该过程包括以下步骤:
S101:通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数。
S102:根据所述各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数。
S103:通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
其中,场景识别模型包括核心特征提取层以及与所述核心特征提取层连接的全局信息特征提取层、各层级的LCS模块、全连接决策层。
本申请实施例提供的场景识别方法应用于电子设备,该电子设备可以是个人电脑(personal computer,PC)、平板电脑等智能设备,也可以是服务器。
为了适应不同细粒度场景识别的要求,本申请实施例中,所述场景识别模型还包括分支扩展结构;所述分支扩展结构包括卷积层和局部对象关联关系模块;
根据所述分支扩展结构的卷积层输出的特征图和所述样本图像的第二场景标签逐像素计算得到的损失值,训练所述分支扩展结构各层级的卷积层的权重参数;通过带场景置信正则项的损失函数,训练得到局部对象关联关系模块的参数;其中,所述第一场景标签和所述第二场景标签的粒度不同。
一般第一场景标签为粗粒度场景标签,第二场景标签为细粒度场景标签。
如图2所示,本申请实施例提出的多层次细粒度场景识别模型会以同步的方式运行,即针对某个视频或图片集中的待识别图像,它会优先于机审流程中的其它现有的机审模型先执行,把执行得到的场景信息存储在缓存队列中;然后通过同步信号激活一系列的机审模型(图2中的机审模型a、机审模型b和机审模型c等),以同样的视频或图片集中的待识别图像作为输入,运行机审模型,得到初步的推审结果。其中,配置了场景策略的机审模型/标签,将从缓存队列中取出相应的场景信号,与机审初步推审结果共同计算,得到最终的推审结果,并决定是否推送至人审环节;而没有配置场景策略的机审模型/标签则直接根据机审模型所给的结果决定是否推送。
若确定出的待识别图像对应的场景信息属于违规场景信息,且机审的审核结果为所述待识别图像为违规图像,则确定所述待识别图像为违规图像。并决定推送至人审环节。若确定出的待识别图像对应的场景信息不属于违规场景信息,或机审的审核结果为所述待识别图像不是违规图像,则确定所述待识别图像不是违规图像,此时不推送至人审环节,或者按照一定的采样率在不同区域 抽样巡查,或根据举报结果推送至人审环节复审。
其中,电子设备中可以预先保存哪些场景信息属于违规场景信息,在确定出的待识别图像对应的场景信息之后,便能够判断待识别图像对应的场景信息是否属于违规场景信息。机审模型审核待识别图像是否为违规图像的过程可以采用相关技术,在此不再进行赘述。
图3为本申请实施例提供的场景识别模型的主体结构训练过程示意图,图4为本申请实施例提供的场景识别模型的分支扩展结构训练过程示意图。
图3和图4展示了本申请实施例提出的多层次细粒度场景识别模型的整体训练过程,由两个阶段组成,分别为“模型主体训练阶段”和“模型分支扩展阶段”,完成训练后将生成一个如图2左侧所示结构的场景识别模型,用以配合提高机审模型的精度。对于场景识别模型,主体结构部分的特征提取能力非常重要,因为主体结构部分的各个组成部件,一般都会作为分支扩展结构时的前置部件,影响具体细粒度分支的特征提取能力。为了大幅提高模型主体结构部分的特征提取能力,使其去挖掘高丰富度的特征;如图3所示,在模型主体结构训练阶段,使用了三轮具有不同针对目标的训练策略对主体结构部分的各个模块进行训练。其中,第一轮主要通过第一场景标签(此时一般为图片级的场景语义标签)和标准交叉熵损失,经多次迭代,专门优化了场景识别模型的核心特征提取层和多尺度全局信息特征提取层,局部监督学习模块的权重参数先不优化。第一轮优化的主要是为了让模型获得抽象语义特征和多尺度上全局特征的提取能力。然后,固定住第一轮优化的参数(权重参数),给每个池化层后的卷积特征图组接上一个“带注意力机制的局部监督学习模块”(Local Supervised模块,LCS模块),每个层级均以池化后的卷积特征图组作为输入,经LCS模块后,输出一张“聚焦化”特征图,通过该特征图和样本图像的第一场景标签(此时一般为框图级的标签),以逐像素的二值sigmoid损失对各层级的LCS模块权重参数进行优化。第二轮优化完成后,模型将对局部对象的特征敏感,可通过LCS模块提取局部对象特征,并通过Fisher卷积特征编码,在减少特征冗余的同时,尽量降低有决策影响的细微特征的缺失。第三轮优化则集中在针对融合特征的决策层权重参数,此时第一轮优化使用的全连接输出层将被移除,换上对应三种特征融合的全连接决策层,同样通过第一场景标签(此时一般为图片级的场景语义标签)和标准交叉熵损失进行训练优化,决策层以外的权重参数固定。
在模型主体训练阶段结束后,模型可以根据后续的细粒度场景需求在主体 结构上进行分支扩展。图4举了扩展一个分支的例子。一般分支会从主体结构部分的某个卷积层输出开始,接入若干的卷积池化操作层,训练时会固定住主体结构部分的权重参数,经过两轮优化完成训练。第一轮优化使用了逐像素的二值sigmoid损失直接对关联的卷积层进行优化,不再使用LCS模块作为中继。第一轮优化的主要也是为了让扩展分支对局部对象具备特征学习能力。而第二轮优化则会在扩展分支中嵌入一些由“变形卷积”组成的“局部对象关联关系学习模块”,在分支已具备局部对象特征提取能力的基础上,通过带场景置信正则项的focus损失,学习局部对象关联关系的挖掘能力。使用类似的方法,可以在主体网络上扩展多个不同的分支,处理不同的任务需求。
本申请实施例提供了一种基于场景识别模型进行图像场景识别的方案,在训练场景识别模型时,首先通过样本图像的第一场景标签和标准交叉熵损失,训练得到核心特征提取层和全局信息特征提取层的参数,然后根据各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练各层级的LCS模块的权重参数,最后训练得到场景识别模型的全连接决策层的参数。使得场景识别模型具备高丰富度特征的提取能力,基于场景识别模型进行场景识别,大幅提高场景识别的准确性。并且,场景识别模型还包括分支扩展结构,从而适应不同细粒度场景识别的要求。
由图4可以看出,场景识别模型中的核心特征提取层分别与全局信息特征提取层、各层级的LCS模块、全连接决策层(图4中的全连接层(Fully Connected Layer,FC层))以及分支扩展结构连接。在基于场景识别模型对待识别的图像进行场景识别时,首先由核心特征提取层对待识别的图像进行特征提取,然后在将结果分别输出至全局信息特征提取层、各层级的LCS模块、全连接决策层以及分支扩展结构。全局信息特征提取层、各层级的LCS模块、全连接决策层以及分支扩展结构根据收到的特征图再并行进行处理,得到最终的场景识别结果。
图5~图9展示了场景识别模型以及模型训练的细节。
实施例2:
所述核心特征提取层包括第一类分组多感受野残差卷积模块和第二类分组多感受野残差卷积模块;
所述第一类分组多感受野残差卷积模块包括第一分组、第二分组和第三分组,所述第一分组、第二分组和第三分组的卷积尺寸不同,所述第一分组、第二分组和第三分组包括残差计算旁路结构;每个分组通过卷积操作和残差计算 输出特征图,每个分组输出的特征图在通道维度拼接并进行通道混洗,卷积融合后输出到下一模块;
所述第二类分组多感受野残差卷积模块包括第四分组、第五分组和第六分组,所述第四分组、第五分组和第六分组的卷积尺寸不同,所述第五分组和第六分组分别包括1×1卷积旁路结构和残差计算旁路结构;每个分组输出的特征图在通道维度拼接并进行通道混洗,卷积融合后输出到下一模块。
本申请实施例提出的场景识别模型,其核心特征提取层结构如图5右侧所示,由一系列的“分组多感受野残差卷积模块”组成。图5左侧描述了两类分组多感受野残差卷积模块,它们由三个不同感受野的卷积分支组成,为了节省计算量,上一模块计算得到的特征图组将会分成三组分别传送至不同的卷积分支进行卷积操作,进一步提取特征。第一类分组多感受野残差卷积模块为图5中的“GM-Resblock”,为了覆盖不同的感受野,第一分组、第二分组和第三分组这三个分支的卷积核分别使用了1x1,3x3,5x5三种不同的尺寸,其中5x5卷积操作使用两层3x3卷积操作进行替换,这样可以在保持相同感受野的同时,增加非线性映射的个数,提高拟合能力。每个分支还加入了用来计算残差的旁路结构,以在扩大模型深度的同时避免梯度消失。本申请实施例采用多感受野卷积作为模块,主要是因为场景识别属于视觉复杂问题,不同尺度下的局部特征都有可能对场景的判别造成影响,而多感受野机制可以尽量捕获更多的促进决策的因素。为了保证通道数的规整性,GM-Resblock三个卷积分支输出的结果会在通道维度拼接并进行通道混洗,最后使用1x1卷积融合以后传到下一个模块。需要说明的是,第二类分组多感受野残差卷积模块为图5中的“GM投影block”,GM投影block包括第四分组、第五分组和第六分组,第四分组、第五分组和第六分组三个分支的卷积核分别使用了1x1,3x3,5x5三种不同的尺寸,其中5x5卷积操作使用两层3x3卷积操作进行替换,GM投影block还会被用于对特征图进行下采样,所以它的结构稍微做了修改,比如取消了1x1卷积分支的旁路,在3x3和5x5卷积分支的旁路加入1x1卷积,用来维持特征图尺寸和通道数的一致性。为了保证通道数的规整性,GM投影block三个卷积分支输出的结果会在通道维度拼接并进行通道混洗,最后使用1x1卷积融合以后传到下一个模块。
实施例3:
所述通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和全局信息特征提取层的参数包括:
对所述核心特征提取层中不同层级的特征图使用膨胀因子不同的反卷积操作进行上采样,在通道维度使用双线性插值算法对齐通道数,各个层级的特征图逐通道进行相加合并,合并后的特征图组进行卷积融合,并通过逐通道的全局平均池化得到全局信息特征向量,将所述全局信息特征向量和全连接层FC特征向量拼接,并通过标准交叉熵损失训练得到所述核心特征提取层和全局信息特征提取层的参数。
在模型主体训练阶段的第一轮,全局信息特征提取模块伴随着模型的核心特征提取层一起被训练。图6简单展示了其细节和原理。为了提取图像的全局信息,本申请从多尺度的特征图出发,融合不同尺度的全局信息,获取高质量的全局信息特征向量。与单一尺度的全局信息特征相比,多尺度的特征一方面可以减少信息丢失,另一方面还可以使模型对全局空间层面的重要区域更敏感。本申请实施例借鉴特征金字塔的思想,对模型核心特征提取层中不同层级的特征图组使用膨胀因子不同的反卷积操作进行上采样,确保特征图的尺寸一致。此处使用反卷积而非普通的填充上采样,主要是为了缓解上采样操作导致的图像失真问题。完成上采样操作后,不同层级的特征图的通道数仍然不一致,此处简单地在通道维度循环使用双线性插值算法补充不足的通道,对齐通道数。然后各个层级的特征图执行逐通道相加操作完成合并,合并以后的特征图组使用1x1卷积进行融合,并通过逐通道的全局平均池化获得全局信息特征向量。根据图3,该向量将和记录抽象特征的FC特征向量拼接,然后接入标准交叉熵损失进行优化。
实施例4:
所述根据各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数包括:
通过通道维度的注意力机制使用激活函数获取每个通道的重要性权重,根据所述每个通道的重要性权重对每个通道的特征图进行加权求和,得到汇总热力图;
根据所述汇总热力图、对象场景关联重要度和对象的面积逐像素计算损失值,根据所述损失值训练所述各层级的LCS模块的权重参数。
在模型主体结构部分,另一个重要步骤是令其对局部对象特征也具备较好的提取能力。如图7所示,在模型主体训练阶段第一轮完成以后,同样是在多个层级上,本申请提出使用“带注意力机制的局部监督学习模块”和“局部对象监督损失”来增强模型该部分的提取能力。其中,带注意力机制的局部监督 学习模块(LCS模块)的结构如图7左下所示。首先,取出各层级的特征图组,先经过3x3卷积进行映射并保持通道数不变。考虑到同一个像素位置上,不同通道的特征图重要性并不相同,所以本申请实施例在尺寸下采样的同时,还通过通道维度的注意力机制对不同通道的特征图重要性进行控制,用来获取一张更为准确地表明各个位置像素信息的汇总热力图,更好地指引LCS模块对局部对象特征的学习。另外,LCS模块使用普通3x3卷积来完成下采样,而非池化操作,这样做是为了在降冗余的时候避免局部对象激活偏移过大。通道维度的注意力机制使用Sigmoid激活获取重要性权重,因为通道间的重要性不是互斥关系;最终会通过全连接层输出一个重要性权重向量,然后重要性值与对应通道相乘,完成通道特征图的加权。
LCS模块输出经注意力增强的特征图组以后,会专门接入“局部对象监督损失”来有监督地引导该模块学习局部对象特征的提取能力。例如,先对经注意力增强的特征图组跨通道逐像素求和,得到一张反映不同像素位置激活情况的热力图。然后使用该热力图和基于框图对象与“对象-场景关联重要度”标记的mask图求损失,并进行反向传播。mask图是在图像级的场景语义标签的基础上,根据场景图像中对象对场景判决的影响程度得到的标签,其中,图像中的对象会根据其所占据的框图范围给出mask,对场景判决影响大的对象标为“重要”(mask值给1.0),对场景判决影响较小且在多个场景中出现的公共对象标为“不重要”(mask值给0.5),背景mask值给0.0。为了达到“局部监督学习”的效果,损失使用逐像素的二值sigmoid损失,根据“重要对象”的面积与“不重要对象”的面积之比来选定惩罚权重,当“重要对象”的面积远小于“不重要对象”时,“重要对象”的惩罚权重与“不重要对象”的惩罚权重的相对差距将扩大,让LCS模块在“重要对象”为小目标时,加大对其的学习力度,避免偏向于“不重要对象”或“背景”的学习。需要注意的是,因为LCS模块的目标是提取局部对象特征,所以“背景”的惩罚权重在两种情况下都会取一个较小的值。具体的损失表达式如下:
Figure PCTCN2022123011-appb-000001
Figure PCTCN2022123011-appb-000002
Figure PCTCN2022123011-appb-000003
其中,p i,j代表热力图上像素的激活值,mask i,j代表像素级的标签,area代表面积,本申请λ imunim,λ' im,λ' unimback分别取值0.8,0.6,1.0,0.5,0.3。需要注意的是,本申请在训练LCS模块时,各个层级的模块都是与损失直连,独自反向传播,mask图会根据需要进行相应的下采样。
其中,H,W代表图片的高和宽,i,j代表像素的行号和列号,l bsigmoid代表每个像素对应损失值的计算方式,T area代表触发不同计算损失计算方式的阈值,mask_area im代表重要对象的mask区域面积,mask_area unim代表普通对象的mask区域面积,mask_area im与mask_area unim代表可人为标出。
LCS模块完成训练后,其直接提取的特征仍然是特征图组,尺寸为HxWxC,直接作为特征冗余度仍然会过大,而使用非线性全连接层来提取特征向量又会导致一些细微的决定性特征丢失。因此,本申请实施例利用Fisher卷积编码方法降低特征图的维度,使用Fisher卷积特征编码技术来提取局部对象特征向量,在减少细微的决定性特征丢失的同时避免冗余特征带来的几何变换影响。Fisher卷积特征编码的流程比较简单,主要利用了多种通用高斯分布来对不同像素上的向量进行混合,减少尺寸维度的特征数,步骤如下:
在尺寸维度展平特征图,使其表示为HxW个C维向量。
利用PCA把每个C维向量降维为M维。
在HxW个M维向量上利用K个高斯分布计算K个高斯混合参数值。
把HxW个M维向量演化为K个M维高斯向量。
计算所有高斯向量的均值向量和方差向量,将其拼接并L2正则化,最终输出长度为2MK的局部对象特征向量,每个层级输出一个向量。
此处和全局信息特征提取不同在于,为了获取一些细微的局部对象特性,不同层级的特征不再融合,而是分别输出。如图3步骤3所示,获取了局部对象特征、全局信息特征以后,这些特征还会和FC抽象特征结合起来,重新构建一个面向高丰富度特征的主体决策层,利用这些特征完成高精度决策。
实施例5:
所述分支扩展结构使用深度分离卷积残差块DW构造,在残差块的主路中,中间层使用DW卷积层,DW卷积前后使用1x1卷积层。
所述局部对象关联关系学***均池化层;
可变形卷积层获取当前像素位置的卷积核偏移值,卷积核参数的当前位置加上偏移值作为其真实生效的位置,获取真实生效的位置的特征图像素值,经过卷积操作和平均池化操作输出特征图。
完成模型主体的训练后,进入分支扩展阶段。通常会根据新的细粒度场景需求来扩展分支,可以根据需求采用合适的网络结构去设计新分支。本申请实施例考虑到分支的多重扩展性,为了控制每个分支的开销,使用深度分离卷积残差块(Depth-Wise,DW)来构造分支,如图8所示。在残差块的主路中,中间层使用DW卷积来替换普通卷积层,约减少三分之二的计算开销,DW卷积前后使用1x1卷积实现逆通道收缩操作,同时输出采用线性激活,这样做主要是为了避免Relu在负值激活时抛弃过多的特征。本申请最终使用三个模块(分支模块组成部分a、分支模块组成部分b和分支模块组成部分c)串接起来组成细粒度分支。由于分支是由场景模型主体的核心特征提取层扩展得到,该部分并未特别针对局部对象特征学习能力进行优化,所以扩展分支网络的对应层级会直接接入前面提出的LCS损失进行预训练优化,此处和主体部分的训练不同,并未额外增加LCS模块,而是使用和扩展分支网络共享卷积层参数。这样做一方面是为了降低开销,另一方面是为了在分支扩展阶段的第二轮训练时,可以在局部对象特征提取的基础上,同时学会局部对象的关联关系,结合局部对象特征和局部对象的全局空间关联实现细粒度复杂场景的识别。
为了在局部对象特征提取能力的基础上获得关联关系的学***均池化层组成。其中可变形卷积层是其核心,它在进行卷积操作时使用了变形的卷积核,这主要是因为局部对象的全局空间关联一般不会是规则的几何形状,其关联逻辑通过变形的卷积核能更准确地进行建模。可变形卷积的执行过程很简单,它在执行卷积操作前需要先通过一个分支去获取当前像素位置的卷积核偏移,该偏移包括X偏移和Y偏移(因为卷积核参数通常只需要关注尺寸维度),然后卷积核参数的当前位置加上偏移值作为其真实生效的位置,考虑到该位置的坐标可能为浮点数,所以卷积核参数对应位置的特征图像素值可以使用双线性插值获得。完成可变形卷积操作后,还会执行一次1x1卷积操作和平均池化操作(非全局平均池化,不改变尺寸),这主要是用于平滑输出结果。需要注意的是,关联关系学***衡以及跨类别特征重叠的现象。因此本轮训练采用了focal loss作为损失函数的主要部分,这种loss会对数量较少的类别给予更多的训练关注,同时它也很适合作为多标签训练损失。另外,本申请还用了主体部分各个场景的置信度来作为正则项,提高本轮训练的效率。损失函数的格式如下:
Figure PCTCN2022123011-appb-000004
其中L focus代表标准focus loss,
Figure PCTCN2022123011-appb-000005
代表图像在主体部分针对某一场景类别i的置信分,
Figure PCTCN2022123011-appb-000006
为正则项,本申请使用L2正则项作为扩增的处罚项。分支扩展可在主体识别特征提取层的任意层级进行,以树状方式展开。
其中Num class代表类别数量。
本申请的实施例带来了以下的有益效果:
本申请使用三阶段的训练方案从抽象特征、全局信息特征、局部对象特征三个角度对模型主体特征提取部分进行训练,使模型具备高丰富度特征的提取能力,并根据它们作场景判别,大幅提高场景识别精度。
本申请结合特征金字塔的思想,从多个尺度去挖掘全局信息特征,避免了因过多的下采样和非线性变换而导致的全局空间关联信息丢失,提供了高质量的全局信息特征,提高了背景类场景的识别能力。
本申请在多个层级上分别通过局部监督学习的方式,为不同层级提供局部 对象特征的提取能力,与单一层级的局部对象特征提取相比,减少了细微的场景决策信息的丢失,丰富了局部对象特征。
本申请通过注意力机制增强了局部监督学习模块对不同通道的关注程度,加强了重要局部对象特征的激活,给后续的Fisher编码指明方向。
本申请首次提出基于归总热力图,结合框图级的局部对象重要性,使用新的逐像素二值Sigmoid损失进行优化,迫使局部监督学习模块可以聚焦于“重要局部对象”的学习,并且减少“不重要局部对象”和“背景”对决策的干扰。
本申请使用Fisher卷积编码从特征图提取特征向量,减少冗余的同时避免过度抽象丢失信息。
在主体训练阶段,为了增加特征的丰富度,本申请使用了多分支残差卷积作为基本模块,保证了特征提取能力;而在模型分支扩展阶段,本申请则使用了深度分离卷积、共享局部学习模块等策略减少开销。
本申请首次提出使用可变形卷积来搭建关联关系学习模块,利用可变形卷积的几何灵活性对局部对象的关联关系准确建模。
本申请还使用主体部分的场景置信度作为正则项,结合focus loss,很好地对类别不平衡的细粒度场景识别进行优化。
模型主体训练阶段的第一轮也可以使用focal loss只针对核心特征提取层进行充分训练,然后再单独训练全局信息特征提取模块。
全局信息特征提取模块可以单纯使用两层反卷积同时完成尺寸上采样和通道扩充,但这样会减慢收敛速度。
全局信息特征提取模块也可使用通道级的注意力机制和全连接层完成特征融合。
局部监督学习模块可使用图像级的语义标签结合辅助损失与全连接层一起训练。
细粒度分支扩展网络也可在已有的分支扩展网络上进行扩展,而不必都以主体网络作为扩展的起点。
模型主体部分也可以使用基于深度分离卷积的基础模块来降低开销,同时nxn卷积也可以转化为等价的1xn和nx1卷积减少开销。
关联关系学习可以设计专门的损失函数,多层级独自训练,而无需混合在分支扩展网络一起训练。
实施例6:
图10为本申请实施例提供的场景识别过程示意图,该过程包括:
S201:获取待识别的图像。
S202:将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
在本申请实施例提供的场景识别方法应用于电子设备,该电子设备可以为PC、平板电脑等智能设备,也可以为服务器。该进行场景识别的电子设备与上述实施例中进行模型训练的电子设备可以相同,也可以不同。
由于模型训练的过程一般是离线的,进行模型训练的电子设备通过上述实施例中的方法训练模型,可以直接将训练完成的场景识别模型保存在进行场景识别的电子设备中,以便后续进行场景识别的电子设备,直接通过该训练完成的场景识别模型进行相应的处理。
在本申请实施例中,将输入到的场景识别模型进行处理的图像作为待识别的图像。获取到该待识别的图像之后,将该待识别的图像输入预先训练完成的场景识别模型,基于场景识别模型确定待识别的图像对应的场景信息。
实施例7:
图11为本申请实施例提供的场景识别模型训练装置结构示意图,该装置包括:
第一训练单元11,设置为通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数。
第二训练单元12,设置为根据所述各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数。
第三训练单元13,设置为通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
所述装置还包括:
第四训练单元14,设置为根据所述分支扩展结构的卷积层输出的特征图和所述样本图像的第二场景标签逐像素计算得到的损失值,训练所述分支扩展结构各层级的卷积层的权重参数;通过带场景置信正则项的损失函数,训练得到局部对象关联关系学习模块的参数;其中,所述第一场景标签和所述第二场景标签的粒度不同。
所述第一训练单元11,还设置为对所述核心特征提取层中不同层级的特征图使用膨胀因子不同的反卷积操作进行上采样,在通道维度使用双线性插值算法对齐通道数,各个层级的特征图逐通道进行相加合并,合并后的特征图组进 行卷积融合,并通过逐通道的全局平均池化得到全局信息特征向量,将所述全局信息特征向量和全连接层FC特征向量拼接,并通过标准交叉熵损失训练得到所述核心特征提取层和全局信息特征提取层的参数。
所述第二训练单元12,还设置为通过通道维度的注意力机制使用激活函数获取每个通道的重要性权重,根据所述每个通道的重要性权重对每个通道的特征图进行加权求和,得到汇总热力图;根据所述汇总热力图、对象场景关联重要度和对象的面积逐像素计算损失值,根据所述损失值训练所述各层级的LCS模块的权重参数。
实施例8:
图12为本申请实施例提供的场景识别装置结构示意图,该装置包括:
获取单元21,设置为获取待识别的图像;
识别单元22,设置为将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
所述装置还包括:
确定单元23,设置为响应于确定出的所述待识别图像对应的场景信息属于违规场景信息且机审的审核结果为所述待识别图像为违规图像,确定所述待识别图像为违规图像。
实施例9:
在上述各实施例的基础上,本申请实施例中还提供了一种电子设备,如图13所示,包括:处理器301、通信接口302、存储器303和通信总线304,其中,处理器301,通信接口302,存储器303通过通信总线304完成相互间的通信;
所述存储器303中存储有计算机程序,当所述程序被所述处理器301执行时,使得所述处理器301执行如下步骤:
通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数;
根据所述各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数;
通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
基于同一发明构思,本申请实施例中还提供了一种电子设备,由于上述电子设备解决问题的原理与场景识别模型训练方法相似,因此上述电子设备的实施可以参见方法的实施,重复之处不再赘述。
实施例10:
在上述各实施例的基础上,本申请实施例中还提供了一种电子设备,如图14所示,包括:处理器401、通信接口402、存储器403和通信总线404,其中,处理器401,通信接口402,存储器403通过通信总线404完成相互间的通信;
所述存储器403中存储有计算机程序,当所述程序被所述处理器401执行时,使得所述处理器401执行如下步骤:
获取待识别的图像;
将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
基于同一发明构思,本申请实施例中还提供了一种电子设备,由于上述电子设备解决问题的原理与场景识别方法相似,因此上述电子设备的实施可以参见方法的实施,重复之处不再赘述。
实施例11:
在上述各实施例的基础上,本申请实施例还提供了一种计算机存储可读存储介质,所述计算机可读存储介质内存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行时实现如下步骤:
通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数;
根据所述各层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练所述各层级的LCS模块的权重参数;
通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
基于同一发明构思,本申请实施例中还提供了一种计算机可读存储介质,由于处理器在执行上述计算机可读存储介质上存储的计算机程序时解决问题的原理与场景识别模型训练方法相似,因此处理器在执行上述计算机可读存储介质存储的计算机程序的实施可以参见方法的实施,重复之处不再赘述。
实施例12:
在上述各实施例的基础上,本申请实施例还提供了一种计算机存储可读存储介质,所述计算机可读存储介质内存储有可由电子设备执行的计算机程序,当所述程序在所述电子设备上运行时,使得所述电子设备执行时实现如下步骤:
获取待识别的图像;
将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识 别模型确定所述待识别的图像对应的场景信息。
基于同一发明构思,本申请实施例中还提供了一种计算机可读存储介质,由于处理器在执行上述计算机可读存储介质上存储的计算机程序时解决问题的原理与场景识别方法相似,因此处理器在执行上述计算机可读存储介质存储的计算机程序的实施可以参见方法的实施,重复之处不再赘述。
本申请实施例提供了一种模型训练和场景识别方法、装置、设备及介质,用以提供一种准确性较高的场景识别方案。
本申请是参照根据本申请实施例的方法、设备(***)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请的一些实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括一些实施例以及落入本申请范围的所有变更和修改。
本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。

Claims (13)

  1. 一种场景识别模型训练方法,所述场景识别模型包括核心特征提取层以及与所述核心特征提取层连接的全局信息特征提取层、至少一个层级的带注意力机制的局部监督学习LCS模块、全连接决策层,所述方法包括:
    通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数;
    根据每个层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数;
    通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到所述全连接决策层的参数。
  2. 如权利要求1所述的方法,其中,所述场景识别模型还包括分支扩展结构;所述分支扩展结构包括卷积层和局部对象关联关系学习模块;
    所述方法还包括:
    根据所述分支扩展结构的卷积层输出的特征图和所述样本图像的第二场景标签逐像素计算得到的损失值,训练所述分支扩展结构至少一个层级的卷积层的权重参数;通过带场景置信正则项的损失函数,训练得到局部对象关联关系学习模块的参数;其中,所述第一场景标签和所述第二场景标签的粒度不同。
  3. 如权利要求1所述的方法,其中,所述核心特征提取层包括第一类分组多感受野残差卷积模块和第二类分组多感受野残差卷积模块;
    所述第一类分组多感受野残差卷积模块包括第一分组、第二分组和第三分组,所述第一分组、第二分组和第三分组的卷积尺寸不同,所述第一分组、第二分组和第三分组包括残差计算旁路结构;每个分组通过卷积操作和残差计算输出特征图,每个分组输出的特征图在通道维度拼接并进行通道混洗,卷积融合后输出到下一模块;
    所述第二类分组多感受野残差卷积模块包括第四分组、第五分组和第六分组,所述第四分组、第五分组和第六分组的卷积尺寸不同,所述第五分组和第六分组分别包括1×1卷积旁路结构和残差计算旁路结构;每个分组输出的特征图在通道维度拼接并进行通道混洗,卷积融合后输出到下一模块。
  4. 如权利要求1所述的方法,其中,所述通过样本图像的第一场景标签和标准交叉熵损失,训练得到所述核心特征提取层和所述全局信息特征提取层的参数包括:
    对所述核心特征提取层中不同层级的特征图使用膨胀因子不同的反卷积操作进行上采样,在通道维度使用双线性插值算法对齐通道数,至少一个层级的 特征图逐通道进行相加合并,合并后的特征图组进行卷积融合,并通过逐通道的全局平均池化得到全局信息特征向量,将所述全局信息特征向量和全连接层FC特征向量拼接,并通过标准交叉熵损失训练得到所述核心特征提取层和全局信息特征提取层的参数。
  5. 如权利要求1所述的方法,其中,所述根据每个层级的LCS模块输出的特征图和所述样本图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数包括:
    通过通道维度的注意力机制使用激活函数获取每个通道的重要性权重,根据所述每个通道的重要性权重对每个通道的特征图进行加权求和,得到汇总热力图;
    根据所述汇总热力图、对象场景关联重要度和对象的面积逐像素计算损失值,根据所述损失值训练每个层级的LCS模块的权重参数。
  6. 如权利要求2所述的方法,其中,所述分支扩展结构使用深度分离卷积残差块DW构造,在残差块的主路中,中间层使用DW卷积层,DW卷积层前后使用1x1卷积层。
  7. 如权利要求2所述的方法,其中,所述局部对象关联关系学***均池化层;
    可变形卷积层获取当前像素位置的卷积核偏移值,卷积核参数的当前位置加上偏移值作为所述卷积核参数的真实生效的位置,获取真实生效的位置的特征图像素值,经过卷积操作和平均池化操作输出特征图。
  8. 一种基于如权利要求1-7任一项所述的方法训练得到的场景识别模型的场景识别方法,所述方法包括:
    获取待识别的图像;
    将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
  9. 如权利要求8所述的方法,还包括:
    响应于确定出的所述待识别图像对应的场景信息属于违规场景信息且机审的审核结果为所述待识别图像为违规图像,确定所述待识别图像为违规图像。
  10. 一种场景识别模型训练装置,包括:
    第一训练单元,设置为通过样本图像的第一场景标签和标准交叉熵损失,训练得到核心特征提取层和全局信息特征提取层的参数;
    第二训练单元,设置为根据每个层级的LCS模块输出的特征图和所述样本 图像的第一场景标签逐像素计算得到的损失值,训练每个层级的LCS模块的权重参数;
    第三训练单元,设置为通过所述样本图像的第一场景标签和标准交叉熵损失,训练得到全连接决策层的参数。
  11. 一种基于如权利要求10所述的装置训练得到的场景识别模型的场景识别装置,所述装置包括:
    获取单元,设置为获取待识别的图像;
    识别单元,设置为将所述待识别的图像输入预先训练完成的场景识别模型,基于所述场景识别模型确定所述待识别的图像对应的场景信息。
  12. 一种电子设备,所述电子设备包括处理器,所述处理器用于执行存储器中存储的计算机程序时实现如权利要求1-7中任一所述场景识别模型训练方法的步骤,或实现如权利要求8-9所述场景识别方法的步骤。
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1-7中任一所述场景识别模型训练方法的步骤,或实现如权利要求8-9所述场景识别方法的步骤。
PCT/CN2022/123011 2021-10-09 2022-09-30 模型训练和场景识别方法、装置、设备及介质 WO2023056889A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111174534.2A CN114049584A (zh) 2021-10-09 2021-10-09 一种模型训练和场景识别方法、装置、设备及介质
CN202111174534.2 2021-10-09

Publications (1)

Publication Number Publication Date
WO2023056889A1 true WO2023056889A1 (zh) 2023-04-13

Family

ID=80205598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/123011 WO2023056889A1 (zh) 2021-10-09 2022-09-30 模型训练和场景识别方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN114049584A (zh)
WO (1) WO2023056889A1 (zh)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128043A (zh) * 2023-04-17 2023-05-16 中国科学技术大学 视频场景边界检测模型的训练方法和场景边界检测方法
CN116170829A (zh) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 一种独立专网业务的运维场景识别方法及装置
CN116309582A (zh) * 2023-05-19 2023-06-23 之江实验室 一种便携式超声扫描图像的识别方法、装置及电子设备
CN116562760A (zh) * 2023-05-09 2023-08-08 杭州君方科技有限公司 纺织化纤供应链监管方法及其***
CN116579616A (zh) * 2023-07-10 2023-08-11 武汉纺织大学 一种基于深度学习的风险识别方法
CN116612287A (zh) * 2023-07-17 2023-08-18 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN116664845A (zh) * 2023-07-28 2023-08-29 山东建筑大学 基于块间对比注意力机制的智慧工地图像分割方法及***
CN116823728A (zh) * 2023-05-29 2023-09-29 苏州大学 一种基于图像处理的睑板腺腺体分割方法
CN116824279A (zh) * 2023-08-30 2023-09-29 成都信息工程大学 一种具有全局特征捕获能力的轻量级地基云图分类方法
CN116883952A (zh) * 2023-09-07 2023-10-13 吉林同益光电科技有限公司 基于人工智能算法的电力施工现场违章识别方法及***
CN116977624A (zh) * 2023-07-06 2023-10-31 无锡学院 一种基于YOLOv7模型的目标识别方法、***、电子设备及介质
CN116994076A (zh) * 2023-09-28 2023-11-03 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN117036891A (zh) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 一种基于跨模态特征融合的图像识别方法及***
CN117095694A (zh) * 2023-10-18 2023-11-21 中国科学技术大学 一种基于标签层级结构属性关系的鸟类鸣声识别方法
CN117115177A (zh) * 2023-08-22 2023-11-24 南通大学 基于动态通道图卷积与多尺度注意力的闪电通道分割方法
CN117372881A (zh) * 2023-12-08 2024-01-09 中国农业科学院烟草研究所(中国烟草总公司青州烟草研究所) 一种烟叶病虫害智能识别方法、介质及***
CN117392552A (zh) * 2023-12-13 2024-01-12 江西农业大学 一种基于双路径卷积神经网络的叶片病害识别方法及***
CN117407785A (zh) * 2023-12-15 2024-01-16 西安晟昕科技股份有限公司 雷达信号识别模型的训练方法、雷达信号识别方法及装置
CN117454756A (zh) * 2023-10-23 2024-01-26 广州航海学院 一种微带天线建模方法、装置、电子设备和介质
CN117541587A (zh) * 2024-01-10 2024-02-09 山东建筑大学 太阳能电池板缺陷检测方法、***、电子设备及存储介质
CN117557775A (zh) * 2023-11-06 2024-02-13 武汉大学 基于红外和可见光融合的变电站电力设备检测方法及***
CN117612029A (zh) * 2023-12-21 2024-02-27 石家庄铁道大学 一种基于渐进特征平滑和尺度适应性膨胀卷积的遥感图像目标检测方法
CN117764988A (zh) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及***
CN117949028A (zh) * 2024-03-26 2024-04-30 山东和同信息科技股份有限公司 一种基于物联网的智能水务仪表运行管控***及方法

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114049584A (zh) * 2021-10-09 2022-02-15 百果园技术(新加坡)有限公司 一种模型训练和场景识别方法、装置、设备及介质
CN114743045B (zh) * 2022-03-31 2023-09-26 电子科技大学 一种基于双分支区域建议网络的小样本目标检测方法
CN114494791B (zh) * 2022-04-06 2022-07-08 之江实验室 一种基于注意力选择的transformer运算精简方法及装置
CN115222875B (zh) * 2022-06-01 2024-06-07 支付宝(杭州)信息技术有限公司 模型的确定方法、局部场景重建方法、介质、设备及产品
CN116863277A (zh) * 2023-07-27 2023-10-10 北京中关村科金技术有限公司 结合rpa的多媒体数据检测方法及***
CN117173438B (zh) * 2023-09-04 2024-02-27 之江实验室 一种深度耦合多源传感特性的景象匹配方法
CN117194938B (zh) * 2023-11-07 2024-02-02 中国人民解放军总医院第一医学中心 一种基于智能咬嘴的指令识别方法、装置和存储介质
CN117853432A (zh) * 2023-12-26 2024-04-09 北京长木谷医疗科技股份有限公司 一种基于混合模型的骨关节病变识别方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203354A (zh) * 2016-07-14 2016-12-07 南京信息工程大学 基于混合深度结构的场景识别方法
CN110046656A (zh) * 2019-03-28 2019-07-23 南京邮电大学 基于深度学习的多模态场景识别方法
US20200012923A1 (en) * 2016-10-06 2020-01-09 Siemens Aktiengesellschaft Computer device for training a deep neural network
US20200342360A1 (en) * 2018-06-08 2020-10-29 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus, and computer-readable medium, and electronic device
CN112926429A (zh) * 2021-02-19 2021-06-08 百果园技术(新加坡)有限公司 机审模型训练、视频机审方法、装置、设备及存储介质
CN114049584A (zh) * 2021-10-09 2022-02-15 百果园技术(新加坡)有限公司 一种模型训练和场景识别方法、装置、设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106203354A (zh) * 2016-07-14 2016-12-07 南京信息工程大学 基于混合深度结构的场景识别方法
US20200012923A1 (en) * 2016-10-06 2020-01-09 Siemens Aktiengesellschaft Computer device for training a deep neural network
US20200342360A1 (en) * 2018-06-08 2020-10-29 Tencent Technology (Shenzhen) Company Limited Image processing method and apparatus, and computer-readable medium, and electronic device
CN110046656A (zh) * 2019-03-28 2019-07-23 南京邮电大学 基于深度学习的多模态场景识别方法
CN112926429A (zh) * 2021-02-19 2021-06-08 百果园技术(新加坡)有限公司 机审模型训练、视频机审方法、装置、设备及存储介质
CN114049584A (zh) * 2021-10-09 2022-02-15 百果园技术(新加坡)有限公司 一种模型训练和场景识别方法、装置、设备及介质

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116128043A (zh) * 2023-04-17 2023-05-16 中国科学技术大学 视频场景边界检测模型的训练方法和场景边界检测方法
CN116170829A (zh) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 一种独立专网业务的运维场景识别方法及装置
CN116562760A (zh) * 2023-05-09 2023-08-08 杭州君方科技有限公司 纺织化纤供应链监管方法及其***
CN116562760B (zh) * 2023-05-09 2024-04-26 杭州君方科技有限公司 纺织化纤供应链监管方法及其***
CN116309582A (zh) * 2023-05-19 2023-06-23 之江实验室 一种便携式超声扫描图像的识别方法、装置及电子设备
CN116309582B (zh) * 2023-05-19 2023-08-11 之江实验室 一种便携式超声扫描图像的识别方法、装置及电子设备
CN116823728A (zh) * 2023-05-29 2023-09-29 苏州大学 一种基于图像处理的睑板腺腺体分割方法
CN116977624A (zh) * 2023-07-06 2023-10-31 无锡学院 一种基于YOLOv7模型的目标识别方法、***、电子设备及介质
CN116579616A (zh) * 2023-07-10 2023-08-11 武汉纺织大学 一种基于深度学习的风险识别方法
CN116579616B (zh) * 2023-07-10 2023-09-29 武汉纺织大学 一种基于深度学习的风险识别方法
CN116612287A (zh) * 2023-07-17 2023-08-18 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN116612287B (zh) * 2023-07-17 2023-09-22 腾讯科技(深圳)有限公司 图像识别方法、装置、计算机设备和存储介质
CN116664845A (zh) * 2023-07-28 2023-08-29 山东建筑大学 基于块间对比注意力机制的智慧工地图像分割方法及***
CN116664845B (zh) * 2023-07-28 2023-10-13 山东建筑大学 基于块间对比注意力机制的智慧工地图像分割方法及***
CN117115177A (zh) * 2023-08-22 2023-11-24 南通大学 基于动态通道图卷积与多尺度注意力的闪电通道分割方法
CN117036891B (zh) * 2023-08-22 2024-03-29 睿尔曼智能科技(北京)有限公司 一种基于跨模态特征融合的图像识别方法及***
CN117036891A (zh) * 2023-08-22 2023-11-10 睿尔曼智能科技(北京)有限公司 一种基于跨模态特征融合的图像识别方法及***
CN116824279A (zh) * 2023-08-30 2023-09-29 成都信息工程大学 一种具有全局特征捕获能力的轻量级地基云图分类方法
CN116824279B (zh) * 2023-08-30 2024-02-20 成都信息工程大学 一种具有全局特征捕获能力的轻量级地基云图分类方法
CN116883952B (zh) * 2023-09-07 2023-11-17 吉林同益光电科技有限公司 基于人工智能算法的电力施工现场违章识别方法及***
CN116883952A (zh) * 2023-09-07 2023-10-13 吉林同益光电科技有限公司 基于人工智能算法的电力施工现场违章识别方法及***
CN116994076B (zh) * 2023-09-28 2024-01-19 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN116994076A (zh) * 2023-09-28 2023-11-03 中国海洋大学 一种基于双分支相互学习特征生成的小样本图像识别方法
CN117095694A (zh) * 2023-10-18 2023-11-21 中国科学技术大学 一种基于标签层级结构属性关系的鸟类鸣声识别方法
CN117095694B (zh) * 2023-10-18 2024-02-23 中国科学技术大学 一种基于标签层级结构属性关系的鸟类鸣声识别方法
CN117454756A (zh) * 2023-10-23 2024-01-26 广州航海学院 一种微带天线建模方法、装置、电子设备和介质
CN117557775A (zh) * 2023-11-06 2024-02-13 武汉大学 基于红外和可见光融合的变电站电力设备检测方法及***
CN117557775B (zh) * 2023-11-06 2024-04-26 武汉大学 基于红外和可见光融合的变电站电力设备检测方法及***
CN117372881A (zh) * 2023-12-08 2024-01-09 中国农业科学院烟草研究所(中国烟草总公司青州烟草研究所) 一种烟叶病虫害智能识别方法、介质及***
CN117372881B (zh) * 2023-12-08 2024-04-05 中国农业科学院烟草研究所(中国烟草总公司青州烟草研究所) 一种烟叶病虫害智能识别方法、介质及***
CN117392552B (zh) * 2023-12-13 2024-02-20 江西农业大学 一种基于双路径卷积神经网络的叶片病害识别方法及***
CN117392552A (zh) * 2023-12-13 2024-01-12 江西农业大学 一种基于双路径卷积神经网络的叶片病害识别方法及***
CN117407785B (zh) * 2023-12-15 2024-03-01 西安晟昕科技股份有限公司 雷达信号识别模型的训练方法、雷达信号识别方法及装置
CN117407785A (zh) * 2023-12-15 2024-01-16 西安晟昕科技股份有限公司 雷达信号识别模型的训练方法、雷达信号识别方法及装置
CN117612029A (zh) * 2023-12-21 2024-02-27 石家庄铁道大学 一种基于渐进特征平滑和尺度适应性膨胀卷积的遥感图像目标检测方法
CN117612029B (zh) * 2023-12-21 2024-05-24 石家庄铁道大学 一种基于渐进特征平滑和尺度适应性膨胀卷积的遥感图像目标检测方法
CN117541587A (zh) * 2024-01-10 2024-02-09 山东建筑大学 太阳能电池板缺陷检测方法、***、电子设备及存储介质
CN117541587B (zh) * 2024-01-10 2024-04-02 山东建筑大学 太阳能电池板缺陷检测方法、***、电子设备及存储介质
CN117764988A (zh) * 2024-02-22 2024-03-26 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及***
CN117764988B (zh) * 2024-02-22 2024-04-30 山东省计算中心(国家超级计算济南中心) 基于异核卷积多感受野网络的道路裂缝检测方法及***
CN117949028A (zh) * 2024-03-26 2024-04-30 山东和同信息科技股份有限公司 一种基于物联网的智能水务仪表运行管控***及方法

Also Published As

Publication number Publication date
CN114049584A (zh) 2022-02-15

Similar Documents

Publication Publication Date Title
WO2023056889A1 (zh) 模型训练和场景识别方法、装置、设备及介质
CN110135366B (zh) 基于多尺度生成对抗网络的遮挡行人重识别方法
US10671855B2 (en) Video object segmentation by reference-guided mask propagation
CN107808389B (zh) 基于深度学习的无监督视频分割方法
Yin et al. FD-SSD: An improved SSD object detection algorithm based on feature fusion and dilated convolution
CN112507777A (zh) 一种基于深度学习的光学遥感图像舰船检测与分割方法
Wang et al. Small-object detection based on yolo and dense block via image super-resolution
CN114627360A (zh) 基于级联检测模型的变电站设备缺陷识别方法
CN113850242B (zh) 一种基于深度学习算法的仓储异常目标检测方法及***
CN112633220B (zh) 一种基于双向序列化建模的人体姿态估计方法
CN111696110B (zh) 场景分割方法及***
CN112395951B (zh) 一种面向复杂场景的域适应交通目标检测与识别方法
CN112949508A (zh) 模型训练方法、行人检测方法、电子设备及可读存储介质
CN109145747A (zh) 一种水面全景图像语义分割方法
CN113610087B (zh) 一种基于先验超分辨率的图像小目标检测方法及存储介质
CN113191204B (zh) 一种多尺度遮挡行人检测方法及***
Zheng et al. T-net: Deep stacked scale-iteration network for image dehazing
WO2024040973A1 (zh) 一种基于堆叠沙漏网络的多尺度融合去雾方法
CN112560865A (zh) 一种室外大场景下点云的语义分割方法
CN116030498A (zh) 面向虚拟服装走秀的三维人体姿态估计方法
CN114255456A (zh) 基于注意力机制特征融合与增强的自然场景文本检测方法和***
CN114627269A (zh) 一种基于深度学***台
Ren et al. A lightweight object detection network in low-light conditions based on depthwise separable pyramid network and attention mechanism on embedded platforms
Chen et al. Coupled global–local object detection for large vhr aerial images
Sun et al. Flame Image Detection Algorithm Based onComputer Vision.

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22877908

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022877908

Country of ref document: EP

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2022877908

Country of ref document: EP

Effective date: 20240510