CN112348117A

CN112348117A - Scene recognition method and device, computer equipment and storage medium

Info

Publication number: CN112348117A
Application number: CN202011376434.3A
Authority: CN
Inventors: 郭卉
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2021-02-09

Abstract

The application relates to a scene recognition method, a scene recognition device, computer equipment and a storage medium. The method comprises the following steps: acquiring a training sample image; carrying out background recognition on a training sample image to obtain a background sample image corresponding to the training sample image; inputting the background sample image into a scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result; obtaining a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result; the loss value of the target model and the difference of the recognition result form a positive correlation relationship; and adjusting model parameters of the scene recognition model based on the target model loss value to obtain a trained scene recognition model. The scene recognition model can be deployed in a cloud server, and the cloud server provides artificial intelligence cloud service. By adopting the method, the scene recognition accuracy can be improved.

Description

Scene recognition method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of image recognition technologies, and in particular, to a scene recognition method, an apparatus, a computer device, and a storage medium.

Background

With the development of information technology, users increasingly transmit information through videos, and the videos can relate to various scenes such as food scenes, portrait scenes, landscape scenes, cartoon scenes and the like. The method has important significance in the fields of video content analysis, video retrieval and the like by identifying the scene to which the video belongs.

Scene recognition of an image or video can be performed in various ways, for example, a scene in an image can be recognized by an artificial intelligence-based scene recognition model.

However, since the kinds of images are various, there is a case where the scene cannot be accurately recognized in the conventional method, that is, the accuracy of scene recognition is low.

Disclosure of Invention

In view of the above, it is necessary to provide a scene recognition method, a scene recognition apparatus, a computer device, and a storage medium, which can improve scene recognition accuracy, in order to solve the technical problem of low scene recognition efficiency.

A method of scene recognition, the method comprising: acquiring a training sample image; carrying out background recognition on a training sample image to obtain a background sample image corresponding to the training sample image; inputting the background sample image into a scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result; obtaining a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result; the target model loss value and the identification result difference are in positive correlation; and adjusting model parameters of the scene recognition model based on the target model loss value to obtain a trained scene recognition model.

A scene recognition apparatus, the apparatus comprising: the training sample image acquisition module is used for acquiring a training sample image; the background sample image obtaining module is used for carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image; a scene recognition result obtaining module, configured to input the background sample image into a scene recognition model to obtain a first scene recognition result, and input the training sample image into the scene recognition model to obtain a second scene recognition result; a target model loss value obtaining module, configured to obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; the target model loss value and the identification result difference are in positive correlation; and the trained scene recognition model obtaining module is used for adjusting the model parameters of the scene recognition model based on the target model loss value to obtain the trained scene recognition model.

In some embodiments, the background sample image obtaining module comprises: a saliency image obtaining unit, configured to perform saliency recognition on the training sample image to obtain a saliency image corresponding to the training sample image; and the background sample image obtaining unit is used for obtaining a background mask according to the saliency image, and processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image.

In some embodiments, the background sample image obtaining unit is further configured to obtain a grayscale statistic value corresponding to the saliency image, and obtain a grayscale threshold according to the grayscale statistic value; and carrying out binarization processing on the significant image according to the gray threshold value to obtain a background mask.

In some embodiments, the background sample image obtaining unit is further configured to compare a gray value of a pixel point in the saliency image with the gray threshold; and when the gray value of the pixel point is smaller than the gray threshold, determining the value of the pixel point in the background mask as a pixel shielding value.

In some embodiments, the background sample image obtaining unit is further configured to obtain a significance retention threshold; selecting a plurality of gray level threshold values between the gray level statistic value and the significance retention threshold value; carrying out binarization processing on the significant image by utilizing the gray level threshold values to obtain background masks corresponding to the gray level threshold values respectively; processing the training sample image by using each background mask, and adding the processed image into a candidate background image set corresponding to the training sample image; and selecting at least one image from the candidate background image set as a background sample image corresponding to the training sample image.

In some embodiments, the first scene recognition result comprises a first scene recognition probability, the second scene recognition result comprises a second scene recognition probability, and the object model loss value derivation module comprises: a probability difference value calculation unit configured to calculate a probability difference value between the first scene recognition probability and the second scene recognition probability; and the target model loss value obtaining unit is used for calculating to obtain the target model loss value according to the probability difference value, and the probability difference value and the target model loss value form a positive correlation relationship.

In some embodiments, the target model loss value obtaining unit is further configured to calculate a first model loss value according to the probability difference value, where the first model loss value is in a positive correlation with the probability difference value; obtaining a second model loss value according to a predicted recognition result, wherein the predicted recognition result comprises at least one of a first scene recognition result or a second scene recognition result, the second model loss value and a standard result difference form a positive correlation relationship, and the standard result difference is a difference between the predicted recognition result and a standard recognition result corresponding to the training sample image; and obtaining the target model loss value according to the first model loss value and the second model loss value.

In some embodiments, the target model loss value obtaining unit is further configured to obtain a first classification loss value according to a difference between the first scene recognition result and a standard recognition result corresponding to the training sample image; obtaining a second classification loss value according to the difference between the second scene recognition result and the standard recognition result; and obtaining a second model loss value according to the first classification loss value and the second classification loss value.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above-described scene recognition method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned scene recognition method.

The scene recognition method, the device, the computer equipment and the storage medium obtain a training sample image, perform background recognition on the training sample image to obtain a background sample image corresponding to the training sample image, input the background sample image into a scene recognition model to obtain a first scene recognition result, input the training sample image into the scene recognition model to obtain a second scene recognition result, obtain a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result, the model loss value and the recognition result difference have a positive correlation relationship, adjust the model parameters of the scene recognition model based on the target model loss value to obtain a trained scene recognition model, wherein the first scene recognition result is the result obtained by recognizing the background sample image, so the target model loss value is obtained according to the recognition result of the background sample image, therefore, the trained scene recognition model obtained according to the target model loss value is more sensitive to the perception of the background in the image, so that the trained scene recognition model focuses more on the background in the image, the scene can be accurately recognized based on the background of the image, and the accuracy of scene recognition is improved.

A video search method, the method comprising: responding to scene searching operation corresponding to the target video, and determining a target scene to be inquired; triggering video scene query based on the target scene, and determining a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to the target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the recognized scene is obtained as the target video image of the target scene, and the scene recognition model is obtained according to a training sample image and a background sample image corresponding to the training sample image; and playing the video clip.

A video search apparatus, the apparatus comprising: the target scene determining module is used for responding to scene searching operation corresponding to the target video and determining a target scene to be inquired; the video clip determining module is used for triggering video scene query based on the target scene and determining a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to the target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the recognized scene is obtained as the target video image of the target scene, and the scene recognition model is obtained according to a training sample image and a background sample image corresponding to the training sample image; and the video clip playing module is used for playing the video clip.

In some embodiments, the target scene determination module is further configured to display a scene search area on a playing interface of the target video, and receive an input target scene and the scene search operation through the scene search area; the video clip playing module is further configured to jump the current playing position of the target video to the starting position of the video clip on the playing interface, and play the video clip from the starting position.

In some embodiments, the scene recognition model training module comprises: the training sample image acquisition unit is used for acquiring a training sample image; the background sample image obtaining unit is used for carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image; a scene recognition result obtaining unit, configured to input the background sample image into a scene recognition model to obtain a first scene recognition result, and input the training sample image into the scene recognition model to obtain a second scene recognition result; a target model loss value obtaining unit, configured to obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; the target model loss value and the identification result difference are in positive correlation; and the trained scene recognition model obtaining unit is used for adjusting the model parameters of the scene recognition model based on the target model loss value to obtain the trained scene recognition model.

A computer device comprising a memory storing a computer program and a processor implementing the steps of the above video search method when executing the computer program.

A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned video search method.

The video searching method, the video searching device, the computer equipment and the storage medium respond to the scene searching operation corresponding to the target video, determine the target scene to be inquired, trigger the video scene inquiry based on the target scene, determine the video clip corresponding to the target scene acquired based on the video scene inquiry, wherein the video clip comprises the target video image corresponding to the target scene, input the video image corresponding to the target video into the scene recognition model for scene recognition, acquire the target video image of which the recognized scene is the target scene, play the video clip, and as the scene recognition model is obtained by training according to the training sample image and the background sample image corresponding to the training sample image, the scene recognition model is more sensitive to the background in the image, namely the scene recognition model can accurately recognize the scene, thereby improving the accuracy of the scene recognition, the video search accuracy can be improved.

Drawings

FIG. 1 is a diagram of an environment in which a method for scene recognition may be applied in some embodiments;

FIG. 2 is a flow diagram illustrating a method for scene recognition in some embodiments;

FIG. 3A is a schematic diagram of obtaining a set of candidate sample images in some embodiments;

FIG. 3B is a schematic illustration of a candidate sample image obtained in some embodiments;

FIG. 3C is a schematic illustration of a saliency mask obtained in some embodiments;

FIG. 4 is a schematic flow chart of a scene recognition result output by the scene recognition model in some embodiments;

FIG. 5 is a block diagram of a residual module in some embodiments;

FIG. 6 is a flow diagram illustrating a video search method in some embodiments;

FIG. 7 is a schematic diagram of a video playback interface in some embodiments;

FIG. 8 is a diagram illustrating an application scenario of the scene recognition method in some embodiments;

FIG. 9 is a block diagram of a scene recognition device in some embodiments;

FIG. 10 is a block diagram of the video search apparatus in some embodiments;

FIG. 11 is a diagram of the internal structure of a computer device in some embodiments;

FIG. 12 is a diagram of the internal structure of a computer device in some embodiments.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The scene recognition method provided by the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The terminal 102 is provided with a client for displaying the content, and the terminal 102 displays the push content through the client. The client corresponds to a type of the content, and includes at least one of a video client, a browser client, an instant messaging client, or an education client, for example. Such as terminal 102 may present the push video through a video client. The terminal 102 may collect an image or a video, transmit the collected image or video to the server 104, and the server may use a video frame in the obtained image or video as a training sample image. Specifically, the server 104 may obtain a training sample image, perform background recognition on the training sample image to obtain a background sample image corresponding to the training sample image, input the background sample image into the scene recognition model to obtain a first scene recognition result, input the training sample image into the scene recognition model to obtain a second scene recognition result, and obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; and the loss value of the target model and the difference of the recognition result form a positive correlation relationship, and model parameters of the scene recognition model are adjusted based on the loss value of the target model to obtain the trained scene recognition model. The terminal 102 may determine a target scene to be queried in response to a scene search operation corresponding to a target video, trigger video scene query based on the target scene, and send a video scene query request to the server, the server 104 may input a video image corresponding to the target video into the scene recognition model in response to the video scene query request to perform scene recognition, obtain a target video image in which the recognized scene is the target scene, determine a video clip corresponding to the target video image, and return the position of the video clip in the target video to the terminal 102, and the terminal 102 may play the video clip according to the position of the video clip in the target video.

The terminal 102 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart television, a smart speaker, a smart watch, a vehicle-mounted computer, and the like, but is not limited thereto. The server 104 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a web service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

The scene recognition model presented in the present application may be an artificial intelligence based model. Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

Computer Vision technology (CV) Computer Vision is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

The trained scene recognition model provided by the embodiment of the application can be deployed in a cloud server, and the cloud server provides artificial intelligence cloud Service, so-called artificial intelligence cloud Service, which is also generally called AIaaS (AI as a Service, in chinese). The method is a service mode of an artificial intelligence platform, and particularly, the AIaaS platform splits several types of common AI services and provides independent or packaged services at a cloud. This service model is similar to the one opened in an AI theme mall: all developers can access one or more artificial intelligence services provided by the platform through an API (application programming interface) interface, and part of the qualified developers can also use the AI framework and AI infrastructure provided by the platform to deploy and operate and maintain own dedicated cloud artificial intelligence services.

In some embodiments, as shown in fig. 2, a scene recognition method is provided, which is described by taking the method as an example applied to the server 104 in fig. 1, and includes the following steps:

s202, acquiring a training sample image.

Wherein the training sample image is an image for model training. The training sample image is selected from the candidate sample image set, for example, 10 ten thousand images may be in the candidate sample image set, random sampling may be performed, that is, an image is selected from the candidate sample image set at random as the training sample image, or weighted random sampling may be performed, sampling is performed based on the sampling weight of the image in the candidate sample image set, and the larger the sampling weight is, the larger the probability is as the training sample image.

In order to obtain a scene recognition model with recognition capability, model training can be performed by using an auto-supervised training method, so that the scene recognition model can learn the capability of performing scene recognition on an image according to a training sample image and a scene label corresponding to the training sample image, that is, model parameters for scene recognition can be obtained through learning.

Specifically, the server may receive a model training instruction, and obtain a training sample image from the candidate sample image set according to the model training instruction. For example, 2 million images may be acquired from the set of candidate sample images as training sample images at each round of training. The set of candidate sample images may be stored in the server or in another device separate from the server, for example, in a database server separate from the server.

In some embodiments, acquiring the training sample image comprises: obtaining model learning difficulty corresponding to a candidate sample image in a candidate sample image set; determining the sampling weight corresponding to the candidate sample image according to the model learning difficulty corresponding to each candidate sample image, wherein the model learning difficulty corresponding to the candidate sample image and the sampling weight corresponding to the candidate sample image form a positive correlation; and sampling from the candidate sample image set based on the corresponding sampling weight of the candidate sample image to obtain a training sample image.

The model learning difficulty refers to the difficulty of recognizing the sample image by model learning. The greater the difficulty, the less likely the model will learn the ability to identify the category of the sample image. The sampling weight represents a degree of possibility that the candidate sample image is selected as a training sample image to perform model training, and a value range of the sampling weight may be 0 to 1. The greater the difficulty of model learning, the greater the likelihood that a candidate sample image will be chosen as a training sample image.

Specifically, after obtaining the Sampling weight corresponding to the candidate sample image, the server may sample from the candidate sample image set by using a Weighted Random Sampling algorithm (Weighted Random Sampling) to obtain the training sample image. For example, assuming that the sum of the sampling weights in the candidate sample image set is 1, assuming that a random number is generated within a range from 1 to 10000, wherein the sampling weight of one candidate sample image is 0.1, an integer (1000 in total) between 1 and 1000 may be assigned as a number corresponding to the candidate sample image, and a random number may be generated, and if the range of the random number is between 1 and 1000, the candidate sample image is selected as the training sample image.

In some embodiments, the model learning difficulty level may be determined according to a loss value corresponding to the candidate sample image, for example, the loss value may be used as the model learning difficulty level corresponding to the candidate sample image.

And S204, carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image.

The background sample image refers to an image corresponding to a background area of the training sample image, and the background sample image may include a small amount of foreground areas but mainly includes a background portion, for example, a portion greater than a preset threshold, for example, 90% is the background area. The server may store the training sample image and the corresponding background sample image in an associated manner, and set the training sample image stored in the associated manner and the label set in the background sample image as the correct label corresponding to the training sample image.

Specifically, the server may identify a foreground region of the training sample image, and process the training sample image according to the identified foreground region to obtain a background sample image of the training sample image. For example, a mask corresponding to the background region may be determined according to the foreground region, and the training sample image may be processed by using the mask corresponding to the background region, so as to obtain a background sample image corresponding to the training sample image. The foreground may include, but is not limited to, text, objects, animals or humans, etc. in the image.

In some embodiments, the server may remove the text in the training sample image to obtain a background sample image corresponding to the training sample image. The server can remove the significant features in the training sample image in a cutout mode, wherein the significant features refer to foreground features and can comprise characters, articles, animals or human objects and other features in the image. For example, the server may remove characters in the training sample image, and for example, the server may identify a text in the training sample image through a text identification model, determine a text region of the text in the training sample image, and remove content in the text region to obtain a background sample image. The text Recognition model may be a custom model or an existing model, for example, a trained OCR (Optical Character Recognition) model. Characters in the image, such as identification card numbers, names, addresses and bank card numbers on the identification cards, can be recognized through the OCR model.

In some embodiments, the server may identify the target in the training sample image through the target identification model, determine a target area of the target in the training sample image, and remove content in the target area to obtain the background sample image. The object recognition model is used to recognize the image, identifying the class of objects included in the image, which may be objects, animals or people. For example, the categories of people, dogs, cats, and birds in the image are identified and the name of the object is given. The target recognition model may be a custom model or an existing model. The target recognition model may also be referred to as an image saliency recognition model.

In some embodiments, the server may determine a text region and a target region in the training sample image, remove content of the text region from the training sample image, and extract the content region in the target region, resulting in the background sample image.

S206, inputting the background sample image into the scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result.

The scene may be used to represent a scene or a scene where the image is located, and may represent at least one of an environment or a mood where the image is located. The scene recognition model is used for recognizing the scene type of the image. Scene types include, but are not limited to, a place name, which may be, for example, a pool hall, a seaside, or a forest, etc., or a place type, which may be, for example, a landscape, an office, or an entertainment place. The scene recognition model herein refers to a scene recognition model to be trained, i.e., a model that needs to be trained in one step, and may be an initial scene recognition model or a scene recognition model obtained through one or more rounds of model training. The scene recognition model may be an image recognition model. Image recognition, which may also be referred to as image saliency recognition, is a type of class level recognition that considers only the class of an object, regardless of the particular instance of the object, and gives a corresponding class. The class of the subject is, for example, human, dog, cat, or bird, etc. Image recognition, for example, recognizes recognition tasks in the source data set Imagenet for large, general objects to identify which of 1000 categories a certain object is. Image recognition may also be image multi-label recognition. Image multi-label recognition refers to recognizing whether an image has a combination of specified attribute labels or not through a computer, wherein one image can have a plurality of attributes, and the task of multi-label recognition is to judge which preset attribute labels a certain image has. Noisy identification may also be included in the image identification. Noisy identification refers to an object identification task performed by using a noisy sample, the noisy sample may also be referred to as a noisy sample, and the noisy sample may include an incorrect category label caused by a mistake of a labeling person or an incomplete correspondence between an image and a corresponding category label caused by an unclear concept. For example, there is a partial overlap between the concepts of the two categories, resulting in an image having attributes of both categories, but labeled as only one of the categories. The scene recognition model is used for recognizing the scene where the image is located, namely the scene recognition model, the scene recognition model can be a stage model, and the generalization capability can be improved while the model reasoning complexity is reduced. The scene recognition model in the application can adopt different network structures or different pre-training model weights as a basic model.

The scene recognition result may include at least one of a scene type or a scene confidence, which may also be referred to as a scene recognition probability. The scene confidence is used for representing the possibility that the image is of each scene type, the higher the confidence is, the higher the possibility is, and the value range of the confidence can be 0 to 1. For example, assuming that the scene recognition model is a model that recognizes a pool hall and a tennis court, the scene recognition model may output a probability that the training sample image is the pool hall, a probability that it is the tennis court, and a probability that it is neither the pool hall nor the tennis court. The first scene recognition result is a scene recognition result obtained by performing scene recognition on the background sample image by using the scene recognition model. The second scene recognition result is a scene recognition result obtained by performing scene recognition on the training sample image by using the scene recognition model.

Specifically, the server may input the background sample image into the scene recognition model to be trained, the feature extraction layer of the scene recognition model may perform feature extraction on the background sample image, the extracted feature vector is input into the classification layer of the scene recognition model, and the classification layer processes the image feature vector to obtain a scene confidence coefficient corresponding to each candidate scene type, which is used as the first scene recognition result. Similarly, the server may input the training sample image into the scene recognition model to be trained, the feature extraction layer of the scene recognition model may perform feature extraction on the training sample image, the extracted feature vector is input into the classification layer of the scene recognition model, and the classification layer processes the image feature vector to obtain a scene confidence corresponding to each candidate scene type as a second scene recognition result. The candidate scene type refers to a candidate scene type, and the finally obtained scene type is selected from the candidate scene types. For example, when performing scene recognition on an image by using a scene recognition model, a scene type with the highest scene confidence may be used as the scene type corresponding to the image.

In some implementations, the scene recognition model may be a deep Neural network model, such as a CNN (Convolutional Neural Networks) based model, with feature extraction on the image by the Convolutional layer. The deep neural network model can have a multi-layer neural network structure, the neural network structure can comprise a plurality of stacked convolutional layers and can also comprise a pooling layer, and the neural network structure can also be connected across layers.

S208, obtaining a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result; the loss value of the target model and the difference of the recognition result are in positive correlation.

Wherein the recognition result difference is a difference between the first scene recognition result and the second scene recognition result. The first scene recognition result may include a first scene recognition probability and the second scene recognition result may include a second scene recognition probability. The recognition result difference may also be a difference between the first scene recognition probability and the second scene recognition probability, or a ratio between the first scene recognition probability and the second scene recognition probability.

The loss value (loss) is obtained from a loss function, which is a function representing the "risk" or "loss" of an event. The target model loss value is obtained from a recognition result difference between the first scene recognition result and the second scene recognition result.

The positive correlation refers to: under the condition that other conditions are not changed, the changing directions of the two variables are the same, and when one variable changes from large to small, the other variable also changes from large to small. It is understood that a positive correlation herein means that the direction of change is consistent, but does not require that when one variable changes at all, another variable must also change. For example, it may be set that the variable b is 100 when the variable a is 10 to 20, and the variable b is 120 when the variable a is 20 to 30. Thus, the change directions of a and b are both such that when a is larger, b is also larger. But b may be unchanged in the range of 10 to 20 a.

Specifically, the server may obtain a first model loss value according to a recognition result difference between the first scene recognition result and the second scene recognition result, and obtain a target model loss value according to the first model loss value. The first model loss value can be calculated, for example, by K _ L (Kullback-Leibler) divergence. The K _ L divergence may also be referred to as relative entropy (relative entropy).

In some embodiments, the server may calculate the first model loss value according to the first scene identification probability and the second scene identification probability, for example, a difference value between the first scene identification probability and the second scene identification probability may be used as the first model loss value, or a function operation may be performed on the first scene identification probability and the second scene identification probability respectively to obtain a processed second scene identification probability and a processed second scene identification probability, and the first model loss value may be obtained according to a difference between the processed first scene identification probability and the processed second scene identification probability. The first model loss value is in positive correlation with a difference between the processed first scene recognition probability and the processed second scene recognition probability.

In some embodiments, the server may obtain a standard recognition result corresponding to the training sample image, obtain a first classification loss value according to the first scene recognition result and the standard recognition result, and obtain a target model loss value according to a recognition result difference between the first scene recognition result and the second scene recognition result and the first classification loss value. The classification-loss value (classification-loss) can be calculated by various loss functions, for example, by a cross entropy loss function, without any limitation.

In some embodiments, the server may obtain a second classification loss value according to the second scene recognition result and the standard recognition result, and obtain a target model loss value according to the first scene recognition result and the first model loss value.

In some embodiments, the server may derive the target model loss value based on the first model loss value, the first classification loss value, and the second classification loss value. For example, the first model loss value, the first classification loss value, and the second classification loss value may be summed, and the result of the summation may be used as the target model loss value.

S210, adjusting model parameters of the scene recognition model based on the loss value of the target model to obtain the trained scene recognition model.

The model parameters refer to variable parameters inside the scene recognition model, and may also be referred to as neural network weights (weights) for the neural network model. The trained scene recognition model may be obtained through one or more training.

Specifically, the server may adjust model parameters in the scene recognition model to be trained toward a direction in which the loss value becomes smaller, and may obtain the trained scene recognition model through multiple iterative training.

In some embodiments, adjusting model parameters of the scene recognition model based on the target model loss value, resulting in the trained scene recognition model comprises: and performing back propagation based on the loss value of the target model, and updating model parameters of the scene recognition model along the gradient descending direction in the process of back propagation to obtain the trained scene recognition model.

The backward direction refers to the opposite direction of the updating of the parameters and the recognition of the scene, and the updating of the parameters is backward propagated, so that the descending gradient can be obtained according to the loss value of the target model, and the gradient updating of the model parameters is started from the last layer of the scene recognition model according to the descending gradient until the first layer of the scene recognition model is reached. The gradient descent method may be a random gradient descent method or a batch gradient descent method. It is understood that the training of the model may be iterated multiple times, that is, the trained scene recognition template may be obtained by iterative training, and the training is stopped when the model convergence condition is satisfied. The model convergence condition may be that the model loss value is smaller than a preset loss value, or that when the change of the model parameter is smaller than a preset parameter change value. The preset loss value and the preset parameter variation value refer to preset values.

In some embodiments, the scene recognition model is a deep neural network model, and the Gradient Descent method of SGD (Stochastic Gradient Descent) may be adopted to solve the convolution template parameter w and the bias parameter b of the neural network model. All parameters of the neural network model may be set to a state that requires learning. At each iterative training, m training sample images and 1 background sample image corresponding to each of the m training sample images are extracted, wherein the background sample image may be 1 randomly extracted from a plurality of background sample images of the training sample images. m may be 1, may be greater than 1, or may be equal to the total number of training sample images. The neural network model carries out forward calculation on the input m training sample images and the corresponding background sample images to obtain a first model loss value and a second model loss value, a target model loss value is obtained according to the first model loss value and the second model loss value, the target model loss value is reversely transmitted to the convolutional neural network model, the gradient is calculated, and the parameter of the convolutional neural network model is updated, namely the target model loss value is transmitted back to the network, and the network weight parameter is updated through a random gradient descent method, so that once weight optimization is realized, and a well-behaved scene recognition model is finally obtained through multiple times of optimization. The initial learning rate may be set to 0.01, and the post-iterative learning rate may be multiplied by a reduction coefficient to reduce the learning rate, such as 0.1 every 10 rounds.

In the scene recognition method, a training sample image is obtained, background recognition is carried out on the training sample image to obtain a background sample image corresponding to the training sample image, the background sample image is input into a scene recognition model to obtain a first scene recognition result, the training sample image is input into the scene recognition model to obtain a second scene recognition result, a target model loss value is obtained based on the recognition result difference between the first scene recognition result and the second scene recognition result, the model loss value and the recognition result difference form a positive correlation relationship, model parameters of the scene recognition model are adjusted based on the target model loss value to obtain a trained scene recognition model, and the target model loss value is obtained according to the recognition result of the background sample image as the first scene recognition result is the result obtained by recognizing the background sample image, therefore, the trained scene recognition model obtained according to the target model loss value is more sensitive to the perception of the background in the image, so that the trained scene recognition model focuses more on the background in the image, the scene can be accurately recognized based on the background of the image, and the accuracy of scene recognition is improved.

According to the scene recognition method, the sample area is subjected to semantic inference in the training stage, and the semantic related samples are automatically generated to perform self-supervision assisted learning, so that the discrimination capability of the model on the scene fuzzy background is improved. By ignoring characteristics such as texture of an interference foreground object in different degrees, the learning capability of the model to the background in the scene is enhanced, meanwhile, a certain perception degree is kept on the foreground, and the effect of better final generalization effect is achieved. And the model is used for learning the self-supervision consistency of the original image and the background to improve the background feature description capacity of the abstract scene, so that the recognition effect of the abstract scene is optimized. The scene corresponding to the image without the detection target can be accurately detected, the image without the detection target can be an image of seaside or an image of forest, and seaside and forest generally have common detection targets. And manpower input such as labeling can be reduced.

The scene recognition method provided by the application can be used for carrying out scene recognition on video contents and assisting in understanding videos. Video understanding includes identifying scenes within a video where a scenario occurs. Scene features need to be identified during scene identification, the scene features usually exist in a background environment of image identification, and a conventional image identification or image identification pre-training model is concentrated on foreground identification, which easily causes the scene identification to be over-fitted to the foreground in a target scene, namely, the scene identification model memorizes the foreground (such as wearing of foreground characters) in some scenes instead of identifying the features of the background environment.

In some embodiments, performing background recognition on the training sample image to obtain a background sample image corresponding to the training sample image includes: carrying out saliency recognition on the training sample image to obtain a saliency image corresponding to the training sample image; and obtaining a background mask according to the saliency image, and processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image.

Wherein, the significance recognition refers to the recognition of the foreground in the image. A saliency image may also be referred to as a saliency map. The mask is a binary image consisting of 0 and 1. The background mask refers to a mask for obtaining a background in the training sample image. The training sample image can be masked through a background mask to obtain a background sample image corresponding to the training sample image.

Specifically, the saliency of the training sample image can be identified through the saliency identification model, and a saliency image corresponding to the training sample image is obtained. When the training sample image is masked according to the background mask, the gray value of the corresponding pixel point in the training sample image is shielded through 0 in the mask, namely the gray value of the corresponding pixel point is set to be 0, and the gray value of the corresponding pixel point in the training sample image is reserved through 1 in the mask, namely the gray value of the corresponding pixel point is kept unchanged.

In some embodiments, the server may perform statistics on the gray value of the significant image, and process the significant image according to the statistical result to obtain the background mask. For example, the background mask can be obtained by performing binarization processing on the saliency image according to the statistical result.

In the embodiment, the saliency of the training sample image is identified to obtain the saliency image corresponding to the training sample image, the background mask is obtained according to the saliency image, the training sample image is processed by using the background mask to obtain the background sample image corresponding to the training sample image, and therefore the background sample image is obtained through the saliency identification, the obtained background sample image is high in accuracy, and the background sample image is conveniently and quickly obtained.

In some embodiments, deriving the background mask from the saliency image comprises: acquiring a gray scale statistic value corresponding to the saliency image, and acquiring a gray scale threshold value according to the gray scale statistic value; and carrying out binarization processing on the saliency image according to the gray threshold value to obtain a background mask.

The gray scale statistic value is a result obtained by counting gray scales corresponding to all pixel points in the significant image. Wherein the statistical value may comprise at least one of a mean or a variance. And obtaining the gray threshold value according to the gray statistic value. There may be a plurality of gray scale thresholds. Plural means at least one. The number of gray level threshold values may be preset, or may be determined as needed, for example, may be calculated according to a gray level statistic. The number of gray level threshold values may be in a negative correlation with the gray level statistic value, that is, the smaller the gray level statistic value is, the larger the number of gray level threshold values is, and the larger the gray level statistic value is, the smaller the number of gray level threshold values is. The binarization processing of the saliency image refers to setting the gray value corresponding to the pixel point in the saliency image as 1 or 0.

The negative correlation relationship refers to: under the condition that other conditions are not changed, the changing directions of the two variables are opposite, and when one variable is changed from large to small, the other variable is changed from small to large. Negative correlations also do not require that when one variable changes a little, the other must also change.

Specifically, the server may calculate an average value of gray values corresponding to each pixel point in the saliency image to obtain an average gray value, and obtain a gray statistic of the saliency image according to the average gray value. For example, the average gray value may be used as the gray statistics value of the saliency image, or the average gray value may be normalized, and the result after normalization may be used as the gray statistics value of the saliency image.

In some embodiments, the server may select a value greater than or equal to the grayscale statistic as the grayscale threshold. The grayscale threshold may be smaller than a preset value, the preset value may be preset, or may be set as needed, for example, the grayscale threshold may be determined according to the size of the foreground region to be retained, and the preset value may have a positive correlation with the size of the foreground region to be retained, that is, the larger the foreground region to be retained is, the larger the preset value is, the smaller the foreground region to be retained is, and the smaller the preset value is.

In some embodiments, when the binarization processing is performed on the saliency image according to the gray threshold, the gray value of a pixel point in the saliency image may be compared with the gray threshold, when the gray value of the pixel point is determined to be greater than the gray threshold, the gray value of the pixel point is set to 0, and when the gray value of the pixel point is determined to be less than the gray threshold, the gray value of the pixel point is set to 1, so as to obtain the background mask with the gray value of 1 or 0.

In some embodiments, the server may perform binarization processing on the saliency image according to a grayscale threshold to obtain a foreground mask, and obtain a background mask according to the foreground mask, where the foreground mask may also be referred to as a saliency mask. Specifically, the server may compare the gray value of the pixel point in the saliency image with a gray threshold, set the gray value of the pixel point to 1 when the gray value of the pixel point is determined to be greater than the gray threshold, and set the gray value of the pixel point to 0 when the gray value of the pixel point is determined to be smaller than the gray threshold, so as to obtain the foreground mask with the gray value of 1 or 0. The server may perform negation on the gray scale value in the foreground mask, that is, set the gray scale value 1 to 0, and set the gray scale value 0 to 1, to obtain the background mask.

In some embodiments, there are a plurality of gray level threshold values, and binarization processing may be performed on the saliency image according to each gray level threshold value to obtain a background mask corresponding to each gray level threshold value.

In this embodiment, since the grayscale statistic can reflect the overall situation of the grayscale, the binarization processing is performed on the significant image according to the grayscale threshold obtained by the grayscale statistic, so as to improve the accuracy of the obtained background mask.

In some embodiments, the binarizing the saliency image according to the gray threshold to obtain the background mask includes: comparing the gray value of the pixel point in the saliency image with a gray threshold; and when the gray value of the pixel point is smaller than the gray threshold, determining the value of the pixel point in the background mask as a pixel shielding value.

The pixel masking value is used to mask the gray value of the pixel in the saliency image, that is, to filter the gray value of the pixel, and may be, for example, a value 0. The pixel reservation value is used to reserve a gray value in the saliency image, that is, to keep the gray value of the pixel unchanged, and may be, for example, a value 1.

Specifically, the server may compare the gray value of the pixel point in the saliency image with a gray threshold, determine that the value of the pixel point in the background mask is 0 when the gray value of the pixel point is determined to be greater than the gray threshold, and determine that the value of the pixel point in the background mask is 1 when the gray value of the pixel point is determined to be smaller than the gray threshold.

In this embodiment, the gray value of the pixel point in the significant image is compared with the gray threshold, when the gray value of the pixel point is greater than the gray threshold, the value of the pixel point in the background mask is determined as the pixel shielding value, and when the gray value of the pixel point is less than the gray threshold, the value of the pixel point in the background mask is determined as the pixel reserved value, so that the gray value of the pixel point in the background mask is changed into the pixel shielding value or the pixel reserved value, and the background mask has the functions of reserving the gray value and shielding the gray value.

In some embodiments, deriving the grayscale threshold from the grayscale statistic includes: acquiring a significance retention threshold; selecting a plurality of gray level threshold values between the gray level statistic value and the significance retention threshold value; the binarization processing of the significant image according to the gray threshold value to obtain a background mask comprises the following steps: performing binarization processing on the saliency image by using a plurality of gray level threshold values to obtain background masks corresponding to the gray level threshold values respectively; processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image comprises: processing the training sample image by using each background mask, and adding the processed image into a candidate background image set corresponding to the training sample image; and selecting at least one image from the candidate background image set as a background sample image corresponding to the training sample image.

Wherein the significance preservation threshold is used to reflect the size of the preserved foreground region. The significance preservation threshold is positively correlated with the size of the preserved foreground region. The larger the significance retention threshold, the larger the foreground region that is retained, i.e., the more significance that is retained, the smaller the significance retention threshold, the smaller the foreground region that is retained, i.e., the less significance that is retained. When the grayscale statistic is a normalized value, the significance retention threshold may be less than 1.

The set of candidate background images may include a plurality of candidate background images. The candidate background image is an image obtained by masking the training sample image with a background mask. The candidate background image set may include images obtained by masking the training sample image with each background mask.

Specifically, the server may randomly select at least one image from the candidate background image set as a background sample image corresponding to the training sample image. For example, 1 image is randomly selected as the background sample image corresponding to the training sample image.

In some embodiments, the server may select a plurality of grayscale thresholds between the grayscale statistic and the significance retention threshold, for example, a plurality of numerical values with a numerical difference being a fixed difference may be selected from the grayscale statistic to obtain each grayscale threshold, where the numerical difference refers to a difference between two numerical values, the fixed difference may be preset or may be set as needed, for example, 0.05, and the grayscale threshold may be greater than or equal to the grayscale statistic and less than the significance retention threshold. For example, the server calculates an average value of the gray values of the saliency image, normalizes the average value of the gray values to obtain a gray statistic value r, when the gray value threshold is 0.9, the server may select a plurality of values in the range of [ r, 0.9) to obtain each gray value threshold, and when r is 0.62, the server may perform segmentation on [0.62, 0.9) by using 0.05 as a step size to obtain a plurality of gray value thresholds, which may be, for example, 0.62, 0.67, 0.72, 0.77, 0.82, and 0.87. The purpose of 0.9 is to keep a certain degree of saliency image outline, which is more friendly to a scene depending on a few saliency objects.

In some embodiments, the server may perform saliency recognition on a training sample image by using different saliency recognition models to obtain a saliency mask, determine a background mask according to the saliency mask, perform mask processing on the training sample image by using the background mask to obtain candidate background images, and form each candidate background image into a candidate background image set, as shown in fig. 3A, a schematic diagram of processing the training sample image by using two saliency recognition models to obtain the candidate background image set is shown. In the figure, the saliency recognition model 1 may be, for example, a target recognition model for detecting a target in an image, such as an object, and the saliency recognition model 1 may be, for example, an OCR model for recognizing a character. As shown in fig. 3B, a first candidate background image 304 and a second candidate background image 306 obtained by processing a training sample image 302 through two saliency recognition models are shown, where "saliency background detection" in the figure refers to firstly performing saliency detection and then obtaining a background based on a result of the saliency detection. As shown in fig. 3C, a saliency mask 310 corresponding to the training sample image 302 is shown.

In some embodiments, the server may rank the images in the candidate background image set, and select an image ranked before a preset rank or before a preset proportion from the candidate background image set according to a ranking result, as a background sample image corresponding to the training sample image. The server may sort the candidate background images according to the gray threshold corresponding to the candidate background images, for example, the candidate background images may be sorted according to the gray threshold from large to small, or the candidate background images may be sorted according to the gray threshold from small to large.

In this embodiment, a plurality of gray threshold values are selected between the gray statistics value and the significance retention threshold value, the significance image is binarized by using the gray threshold values to obtain background masks corresponding to the gray threshold values, the training sample image is processed by using the background masks, the processed image is added into a candidate background image set corresponding to the training sample, at least one image is selected from the candidate background image set as a background sample image corresponding to the training sample, thereby obtaining candidate background images of foreground regions with different sizes according to the background masks corresponding to the gray threshold values, therefore, when the model is trained according to the background sample image, the learning ability of the model to the background in the scene can be enhanced, and meanwhile, the perception degree is kept on the foreground, so that the generalization effect of the model can be improved.

In some embodiments, the first scene recognition result comprises a first scene recognition probability, the second scene recognition result comprises a second scene recognition probability, and deriving the target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result comprises: calculating a probability difference value between the first scene recognition probability and the second scene recognition probability; and calculating to obtain a target model loss value according to the probability difference value, wherein the probability difference value and the target model loss value form a positive correlation relationship.

The scene identification probability is used to represent the probability that the image is of each scene type, and the probability is greater, and the value range of the probability may be 0 to 1. The scene recognition probability may also be referred to as the above-mentioned scene confidence. The first scene recognition probability refers to a scene recognition probability corresponding to the first scene recognition result. The second scene recognition probability refers to a scene recognition probability corresponding to the second scene recognition result. The probability difference value is a result obtained by calculation according to the first scene identification probability and the second scene identification probability, and may be a numerical value obtained by subtracting the first scene identification probability and the second scene identification probability, or an absolute value of the numerical value obtained by subtracting the first scene identification probability and the second scene identification probability.

Specifically, the server may perform function operation processing on the first scene identification probability to obtain a processed first scene identification probability, perform function operation processing on the second scene identification probability to obtain a processed second scene identification probability, and obtain a probability difference value according to the processed first scene identification probability and the processed second scene identification probability, for example, a difference value between the processed first scene identification probability and the processed second scene identification probability may be used as the probability difference value. The function may be any function, and may be a logarithm, for example. That is, the logarithm value of the first scene recognition probability may be calculated to obtain the processed first scene recognition probability.

In some embodiments, the server may use the probability difference value as the target model loss value, or may calculate according to the probability difference value and the second scene recognition probability to obtain the target model loss value, for example, may calculate a product of the probability difference value and the second scene recognition probability to obtain the target model loss value.

In some embodiments, there are a plurality of training sample images, and the server may calculate a target model loss value according to the probability difference value corresponding to each training sample image, for example, may calculate a statistical calculation of each probability difference value as the target model loss value. The server may also calculate a product of the probability difference value and the second scene recognition probability to obtain a probability product corresponding to the probability difference value, and perform statistical calculation on each probability product to obtain a target model loss value. The statistical calculation may be, for example, a sum calculation or a mean calculation.

In some embodiments, the server may determine a first classification loss value according to the first scenario recognition result, determine a second classification loss value according to the second scenario recognition result, and calculate a target model loss value according to at least one of the first classification loss value or the second classification loss value and the probability difference value.

In this embodiment, a probability difference value between the first scene identification probability and the second scene identification probability is calculated, and a target model loss value is calculated according to the probability difference value, so that the smaller the target model loss value is, the smaller the probability difference value is, and the first scene identification probability and the second scene identification probability can be made to be the same by adjusting the target model loss value in the direction of decreasing, thereby improving the accuracy of the model.

In some embodiments, calculating a target model loss value according to the probability difference value, wherein the positive correlation between the probability difference value and the target model loss value includes: calculating to obtain a first model loss value according to the probability difference value, wherein the first model loss value and the probability difference value form a positive correlation; obtaining a second model loss value according to the predicted recognition result, wherein the predicted recognition result comprises at least one of the first scene recognition result or the second scene recognition result, the second model loss value and the standard result difference form a positive correlation relationship, and the standard result difference is the difference between the predicted recognition result and the standard recognition result corresponding to the training sample image; and obtaining a target model loss value according to the first model loss value and the second model loss value.

Wherein the first model loss value is a loss value calculated from the probability difference value. The first model loss value is positively correlated with the probability difference value. The first model loss value may be a product of the probability difference value and the second scene recognition probability, i.e., a probability product. When there are a plurality of training sample images, the first model loss value may be a statistical value of each probability product, for example, a result of a summation operation.

The predicted recognition result may include at least one of the first scene recognition result or the second scene recognition result. The second model loss value is a loss value calculated from the predicted recognition result, and may include at least one of a loss value calculated from the first scene recognition result or a loss value calculated from the second scene recognition result.

The standard recognition result corresponding to the training sample image refers to a correct recognition result corresponding to the training sample image, and can be a real scene type label corresponding to the training sample image. When the scene recognition model is used for recognizing the coffee hall, and when the scene type label obtained by the model represents the scene type label corresponding to the coffee hall by 1, the real scene type label corresponding to the training sample image is 1.

The standard result difference is a difference between the predicted recognition result and a standard recognition result corresponding to the training sample image, and may be, for example, a difference between a scene type label predicted by the model and a real scene type label.

The larger the second model loss value is, the larger the difference between the predicted recognition result and the standard recognition result corresponding to the training sample image is, that is, the larger the standard result difference is, and the smaller the second model loss value is, the smaller the difference between the predicted recognition result and the standard recognition result corresponding to the training sample image is, that is, the smaller the standard result difference is.

Specifically, the target model loss value may include a first model loss value and a second model loss value, and the server may sum the first model loss value and the second model loss value to obtain the target model loss value. The first model penalty value may also be referred to as a consistency-penalty (consistency-loss).

In some embodiments, the first model loss value may be calculated using equation (1). Where N represents the number of training sample images used in one iterative training, p (x)_i) Represents a second scene recognition probability, i.e., an output result of the Fc _ cr layer after the training sample image is input to the model, q (x)_i) The first scene recognition probability, that is, the output result of Fc _ cr layer after the background sample image is input to the model, is represented. D_KL(p||q)Representing a first model loss value. Equation (1) is a function that maintains the consistency of the distribution of the first scene recognition probability and the second scene recognition probability. q (x)_i) It is understood that the image is subjected to attack change, for example, an image obtained by processing through one or more image enhancement methods, for example, background identification, and a background sample image obtained, and the image enhancement methods may include, for example, adding gaussian noise, adding salt and pepper noise, performing cropping, performing rotation, performing watermarking, or performing tone conversion. Obtaining the background sample image according to the training sample image can also be understood as an image obtained after image enhancement.

In this embodiment, since the predicted recognition result includes at least one of the first scene recognition result or the second scene recognition result, and the second model loss value and the standard result difference are in a positive correlation relationship, the target model loss value is obtained according to the first model loss value and the second model loss value, and the target model loss value can be adjusted, so that the adjustment of the difference between the predicted recognition result and the standard recognition result corresponding to the training sample image is realized, and thus the difference between the predicted recognition result and the standard recognition result is adjusted in a direction of decreasing, and the accuracy of the trained model is improved.

In some embodiments, deriving the second model loss value from the predicted identification comprises: obtaining a first classification loss value according to the difference between the first scene recognition result and the standard recognition result corresponding to the training sample image; obtaining a second classification loss value according to the difference between the second scene recognition result and the standard recognition result; and obtaining a second model loss value according to the first classification loss value and the second classification loss value.

Specifically, the server may calculate a difference between the first scene recognition result and the standard recognition result as a first result difference, and may use the first result difference as a first classification loss value. When there are a plurality of training sample images, a statistical value of the first result difference corresponding to each training sample image may be used as the first classification loss value. The server may calculate a difference between the second scene recognition result and the standard recognition result as a second result difference, and may use the second result difference as a second classification loss value. When there are a plurality of training sample images, the statistical value of the second result difference corresponding to each training sample image may be used as the second classification loss value. The second model penalty values include a first classification penalty value and a second classification penalty value.

In some embodiments, the server may perform a summation calculation of the first classification loss value and the second classification loss value to obtain a second model loss value. The server may also select one of the first classification loss value and the second classification loss value as the second model loss value.

In some embodiments, the first classification loss value and the second classification loss value may be calculated using the loss function provided by equation (2). Where y represents the standard recognition result, when the first classification loss value is calculated,

representing a first scene recognition result, and, when a second classification penalty value is calculated,

representing a second scene recognition result.

In this embodiment, a first classification loss value is obtained according to a difference between a first scene recognition result and a standard recognition result corresponding to a training sample image, a second classification loss value is obtained according to a difference between a second scene recognition result and the standard recognition result, and a second model loss value is obtained according to the first classification loss value and the second classification loss value, so that a total classification loss value is obtained according to the first classification loss value generated by the training sample image and the second classification loss value generated by the background sample image, that is, the second model loss value is obtained, parameters of the model are adjusted according to the classification loss value, and accuracy of the model is improved.

As shown in fig. 4, a schematic diagram of outputting a scene recognition result for a scene recognition model in a training process and obtaining a loss value of a target model according to the scene recognition result in some embodiments. The training sample image and the background sample image are input into a scene recognition model for forward calculation, feature extraction is carried out through a CNN layer, an image feature vector is output through a pool layer, and a first scene recognition result and a second scene recognition result are output through a FC (full connected layer). The server may obtain a first classification loss value according to the first scene recognition result, obtain a second classification loss value according to the second scene recognition result, and obtain a first model loss value according to the first scene recognition result and the second scene recognition result.

In some embodiments, the scene recognition model provided by the embodiments of the present application may be a deep neural model, and the deep neural model may have a multi-layer neural network structure, for example, the deep neural model may be a network model based on ResNet101 (depth residual network 101 layer), and the model mechanism of the depth residual network 101 layer is shown in fig. 5, which is a three-layer residual module for reducing the number of parameters. 3x3 represents the size of the convolution kernel and 64 represents the number of channels. A plus sign inside the circle represents an addition, i.e. identity mapping. The ReLU (Rectified Linear Unit) indicates activation by an activation function. 256-d represents an input of 256 dimensions.

As shown in table 1, the structure table of ResNet101 in the deep neural network model in some embodiments is shown, where x3, x4, and x23 respectively represent 3 modules, 4 modules, and 23 modules. There are 5 types of convolutional layers, Conv5_ x is the 5 th convolutional layer. As shown in table 2, it is a structural table of the output layer in the deep neural model.

The Conv5_ x outputs a depth feature map (a feature map obtained by convolving an image and a filter) of a training sample image, the pool layer outputs an image feature vector, the deep high-dimensional feature output by the image after forward calculation through a depth learning neural network can be a one-dimensional feature vector obtained after pooling operation is performed after a certain feature map, and the FC layer outputs scene confidence of each scene. M represents the number of candidate image scenes.

TABLE 1 ResNet101 structural Table

TABLE 2 output layer Structure Table

Layer name	Output size	Layer(s)
			Pool	1x2048	Maximum pooling
FC	1xM_class	Full connection

In some embodiments, as shown in fig. 6, a video search method is provided, which is described by taking the method as an example applied to the terminal 102 in fig. 1, and includes the following steps:

s602, responding to the scene searching operation corresponding to the target video, and determining the target scene to be inquired.

The target video refers to a video needing scene search. The scene search operation is used to trigger a search for a scene in the target video. The target scene refers to a scene that needs to be searched. The scene search operation may be any operation, such as a click operation on a "scene search" button, or an "enter" operation when the content input is completed. The target scene may be preset or determined according to scene-related content input by a user, the scene-related content may include a scene type, and the scene-related content may be, for example, "cafe segment in XX television theater", so that the scene type may be determined to be "cafe".

Specifically, the terminal may display a video playing interface corresponding to the target video, a scene search area may be displayed in the video playing interface, the scene search area may correspond to a candidate scene list, the candidate scene list may include multiple scenes, when the terminal obtains a trigger operation on the scene search area, for example, a click operation, the candidate scene list may be correspondingly displayed, and when a selection operation on a candidate scene in the candidate scene list is obtained, a candidate scene corresponding to the selection operation is taken as the target scene.

S604, triggering video scene query based on a target scene, and determining a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to a target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the target video image of which the recognized scene is the target scene is obtained, and the scene recognition model is obtained by training according to a training sample image and a background sample image corresponding to the training sample image.

The video scene query is used for querying a video clip related to the target scene from the target video. The target video image refers to a video image in which a scene is a target scene. The video segment corresponding to the target scene may be a video segment including at least one target video image. The video segment corresponding to the target scene may be a part or all of the target video, that is, the video segment corresponding to the target scene may be the target video itself.

Specifically, the terminal may determine a target scene to be queried in response to a scene search operation corresponding to the target video, and send a video scene query request to the server, where the video scene query request may carry an identifier of the target video and the target scene. The server may acquire the target video according to the identifier of the target video in the video scene query request, and determine the video clip corresponding to the target scene from the target video, for example, the server may input the video image in the target video into a trained scene recognition model, determine the image scene corresponding to each video image, compare the image scene with the target scene, and use the video image with the same comparison as the target video image.

In some embodiments, the server may obtain at least one forward video image or at least one backward video image of the target video image from the target video, and generate a video clip corresponding to the target scene. The server can also determine a first playing time point corresponding to a first target video image in the target video and a second playing time point corresponding to a last target video image in the target video, and take a video clip between the first playing time point and the second playing time point in the target video as a video clip corresponding to the target scene. The server may return the starting playing time point of the video segment corresponding to the target scene to the terminal, and may also return at least one of the ending playing time point or the playing time length of the video segment corresponding to the target scene to the terminal. Of course, the server may also intercept a video clip corresponding to the target scene from the target video as an independent video clip, and return the video clip corresponding to the intercepted target scene to the terminal

S606, playing the video clip.

Specifically, the terminal may play the video clip according to the video clip start play time point corresponding to the target scene returned by the server, for example, may jump the current play position of the target video to the start position of the video clip, and start playing the video clip from the start position. The start position refers to a start playing time point of the video segment in the target video. Of course, the terminal may also obtain a video clip corresponding to the target scene, that is, an independent video clip, captured by the server from the target video, and play the independent video clip.

In the video searching method, the target scene to be inquired is determined in response to the scene searching operation corresponding to the target video, the video scene inquiry based on the target scene is triggered, the video clip corresponding to the target scene acquired based on the video scene inquiry is determined, wherein the video clip comprises a target video image corresponding to a target scene, the video image corresponding to the target video is input into the scene recognition model for scene recognition, the target video image of which the recognized scene is the target scene is obtained, the video clip is played, because the scene recognition model is obtained by training according to the training sample image and the background sample image corresponding to the training sample image, the scene recognition model is more sensitive to the perception of the background in the image, namely, the scene recognition model can accurately recognize the scene, so that the accuracy of scene recognition is improved, and the accuracy of video search can be improved.

In some embodiments, in response to a scene search operation corresponding to the target video, determining the target scene to be queried includes: displaying a scene search area on a playing interface of a target video, and receiving an input target scene and a scene search operation through the scene search area; playing the video clip comprises: and jumping the current playing position of the target video to the initial position of the video clip on the playing interface, and playing the video clip from the initial position.

The playing interface of the target video refers to an interface for playing the target video. The scene search area is used for acquiring content related to the scene search input by the user, such as a target scene and a scene search operation. The start position refers to a start playing time point of the video segment in the target video.

The playing interface of the target video may be, for example, a "video playing interface" 702 in fig. 7, where the "video playing interface" 702 includes a video playing area 704, a scene searching area 706, a video playing progress bar 708, and an "ok" button 710, and a black area in the video playing progress bar 708 indicates the progress of video playing, i.e., the current playing position. When the terminal acquires the click operation corresponding to the "ok" button 710, the "cafe" input in the scene search area 706 may be acquired, the video clip corresponding to the "cafe" in the "XXX video" is determined from the server, the current playing position of the target video is skipped to the start position of the video clip, and the video clip is played from the start position.

In the embodiment, the scene search area is displayed on the playing interface of the target video, the input target scene and the scene search operation are received through the scene search area, the current playing position of the target video is jumped to the initial position of the video clip on the playing interface, and the video clip is played from the initial position, so that the video can be played from the clip corresponding to the scene concerned by the user, and the user experience is improved.

In some embodiments, the training of the scene recognition model comprises: acquiring a training sample image; carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image; inputting the background sample image into a scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result; obtaining a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result; the loss value of the target model and the difference of the recognition result form a positive correlation; and adjusting model parameters of the scene recognition model based on the loss value of the target model to obtain the trained scene recognition model.

In some embodiments, a scene recognition method is provided, comprising the steps of:

1. and acquiring a training sample image, and performing saliency recognition on the training sample image to obtain a saliency image corresponding to the training sample image.

For example, when the scene recognition model is a Resnet101 based, deep neural network model based on the structure in Table one. In order to quickly converge and ensure a better recognition effect, an open-source ResNet101 model parameter pre-trained on an ImageNet data set can be adopted as an initial model parameter of the scene recognition model, and a newly added layer of the scene recognition model, such as an FC layer, can be initialized by adopting a Gaussian distribution with a variance of 0.01 and a mean value of 0. Wherein the ImageNet data set identifies a source data set for a large-scale general object. The initial model parameters may also be referred to as parameters of an Imagenet pre-training model, where the Imagenet pre-training model refers to training a deep learning network model based on an Imagenet data set to obtain parameter weights of the model.

2. And acquiring a gray scale statistic value corresponding to the saliency image, and acquiring a gray scale threshold value according to the gray scale statistic value.

Specifically, a significance retention threshold may be obtained, and a plurality of grayscale thresholds may be selected between the grayscale statistic and the significance retention threshold.

3. And carrying out binarization processing on the saliency image according to the gray threshold value to obtain a background mask.

Specifically, a plurality of gray level threshold values may be used to perform binarization processing on the saliency image, so as to obtain a background mask corresponding to each gray level threshold value. For example, the gray value of a pixel point in the saliency image may be compared with a gray threshold; and when the gray value of the pixel point is smaller than the gray threshold, determining the value of the pixel point in the background mask as a pixel shielding value.

4. And processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image.

Specifically, each background mask may be used to process a training sample image, the processed image is added to a candidate background image set corresponding to the training sample image, and at least one image is selected from the candidate background image set to serve as a background sample image corresponding to the training sample image.

5. And inputting the background sample image into the scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result.

The first scene recognition result comprises a first scene recognition probability, and the second scene recognition result comprises a second scene recognition probability.

6. And calculating a probability difference value between the first scene identification probability and the second scene identification probability, and calculating to obtain a first model loss value according to the probability difference value.

7. Obtaining a first classification loss value according to the difference between the first scene recognition result and the standard recognition result corresponding to the training sample image; obtaining a second classification loss value according to the difference between the second scene recognition result and the standard recognition result; and obtaining a second model loss value according to the first classification loss value and the second classification loss value.

8. And obtaining a target model loss value according to the first model loss value and the second model loss value.

Wherein the first model loss value is positively correlated with the probability difference value.

9. And adjusting model parameters of the scene recognition model based on the loss value of the target model to obtain the trained scene recognition model.

The scene recognition method can realize the scene recognition learning method based on the consistency of the background saliency of the image and the text, and realize the scene recognition effect with more accuracy and better generalization capability through the combination of the background saliency enhancement, the consistency learning of the image with the saliency of different levels and the scene semantic data enhancement and the learning strategy. The foreground factors which easily cause overfitting in the image are filtered out through the background significance, so that the pixel level enhancement is realized, the effect of marking data is fully exerted, and a better identification effect is achieved. By generating significant background images with different conditions and different degrees and carrying out consistency learning, the robust effect of anti-overfitting is realized. The semantic data enhancement and model learning strategies are optimized, and the effect can be improved under the condition of maintaining the reasoning speed of the original model. Under the data collection that does not need too much manpower marking to drop into, can provide the better scene recognition model of generalization ability fast through this scheme. The scene category of the false recognition can be subjected to data amplification through a significance extraction method, and meanwhile, the attention feature correction of the model is promoted through consistency learning, so that the method is a frame for badcase optimization.

Fig. 8 is a diagram illustrating an application scenario of the scenario recognition method provided in some embodiments. The trained scene recognition model is deployed in the cloud server, the front end a802 can send a scene recognition request corresponding to an image carrying a scene to be recognized to the cloud server 804, the cloud server can obtain the image to be recognized according to the scene recognition request, and the scene recognition method provided by the application is used for carrying out scene recognition on the image of the scene to be recognized to obtain a scene recognition result. Cloud server 804 may send the scene recognition results to front end B806. The front end B may be a computer or a mobile phone, for example, and the front end a may be an image capturing device. It is understood that the front end a and the front end B may be the same device or different devices.

It should be understood that although the various steps in the flow charts of fig. 2-8 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-8 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.

In some embodiments, as shown in fig. 9, a scene recognition apparatus is provided, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a training sample image obtaining module 902, a background sample image obtaining module 904, a scene recognition result obtaining module 906, a target model loss value obtaining module 908, and a trained scene recognition model obtaining module 910, wherein:

a training sample image obtaining module 902, configured to obtain a training sample image.

And a background sample image obtaining module 904, configured to perform background recognition on the training sample image to obtain a background sample image corresponding to the training sample image.

A scene recognition result obtaining module 906, configured to input the background sample image into the scene recognition model to obtain a first scene recognition result, and input the training sample image into the scene recognition model to obtain a second scene recognition result.

A target model loss value obtaining module 908, configured to obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; the loss value of the target model and the difference of the recognition result are in positive correlation.

A trained scene recognition model obtaining module 910, configured to adjust model parameters of the scene recognition model based on the target model loss value, so as to obtain a trained scene recognition model.

In the embodiment, a training sample image is obtained, background recognition is performed on the training sample image to obtain a background sample image corresponding to the training sample image, the background sample image is input into a scene recognition model to obtain a first scene recognition result, the training sample image is input into the scene recognition model to obtain a second scene recognition result, a target model loss value is obtained based on a recognition result difference between the first scene recognition result and the second scene recognition result, the model loss value and the recognition result difference form a positive correlation, model parameters of the scene recognition model are adjusted based on the target model loss value to obtain a trained scene recognition model, and since the first scene recognition result is a result obtained by recognizing the background sample image, the target model loss value is obtained according to the recognition result of the background sample image, so that the accuracy of the target model loss value is improved, therefore, the trained scene recognition model obtained according to the target model loss value is more sensitive to the perception of the background in the image, so that the trained scene recognition model focuses more on the background in the image, the scene can be accurately recognized based on the background of the image, and the accuracy of scene recognition is improved.

In some embodiments, the background sample image obtaining module 904 includes:

and the saliency image obtaining unit is used for carrying out saliency recognition on the training sample image to obtain a saliency image corresponding to the training sample image.

And the background sample image obtaining unit is used for obtaining a background mask according to the saliency image, and processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image.

In some embodiments, the background sample image obtaining unit is further configured to obtain a gray scale statistic value corresponding to the saliency image, and obtain a gray scale threshold value according to the gray scale statistic value; and carrying out binarization processing on the saliency image according to the gray threshold value to obtain a background mask.

In some embodiments, the background sample image obtaining unit is further configured to compare a gray value of a pixel point in the saliency image with a gray threshold; and when the gray value of the pixel point is smaller than the gray threshold, determining the value of the pixel point in the background mask as a pixel shielding value.

In some embodiments, the background sample image obtaining unit is further configured to obtain a significance retention threshold; selecting a plurality of gray level threshold values between the gray level statistic value and the significance retention threshold value; performing binarization processing on the saliency image by using a plurality of gray level threshold values to obtain background masks corresponding to the gray level threshold values respectively; processing the training sample image by using each background mask, and adding the processed image into a candidate background image set corresponding to the training sample image; and selecting at least one image from the candidate background image set as a background sample image corresponding to the training sample image.

In some embodiments, the first scenario identification result includes a first scenario identification probability, the second scenario identification result includes a second scenario identification probability, and the target model loss value deriving module 908 includes:

and the probability difference value calculating unit is used for calculating the probability difference value between the first scene recognition probability and the second scene recognition probability.

And the target model loss value obtaining unit is used for obtaining a target model loss value through calculation according to the probability difference value, and the probability difference value and the target model loss value form a positive correlation relationship.

In some embodiments, the target model loss value obtaining unit is further configured to obtain a first model loss value according to the probability difference value, where the first model loss value and the probability difference value form a positive correlation; obtaining a second model loss value according to the predicted recognition result, wherein the predicted recognition result comprises at least one of the first scene recognition result or the second scene recognition result, the second model loss value and the standard result difference form a positive correlation relationship, and the standard result difference is the difference between the predicted recognition result and the standard recognition result corresponding to the training sample image; and obtaining a target model loss value according to the first model loss value and the second model loss value.

In some embodiments, as shown in fig. 10, there is provided a video search apparatus, which may be a part of a computer device using a software module or a hardware module, or a combination of the two, and specifically includes: a target scene determination module 1002, a video clip determination module 1004, and a video clip play module 1006, wherein:

the target scene determining module 1002 is configured to determine a target scene to be queried in response to a scene search operation corresponding to a target video.

A video clip determining module 1004, configured to trigger a video scene query based on a target scene, and determine a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to a target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the target video image of which the recognized scene is the target scene is obtained, and the scene recognition model is obtained by training according to a training sample image and a background sample image corresponding to the training sample image.

And a video clip playing module 1006, configured to play the video clip.

In the embodiment, in response to the scene search operation corresponding to the target video, the target scene to be queried is determined, the video scene query based on the target scene is triggered, the video clip corresponding to the target scene obtained based on the video scene query is determined, wherein the video clip comprises a target video image corresponding to a target scene, the video image corresponding to the target video is input into the scene recognition model for scene recognition, the target video image of which the recognized scene is the target scene is obtained, the video clip is played, because the scene recognition model is obtained by training according to the training sample image and the background sample image corresponding to the training sample image, the scene recognition model is more sensitive to the perception of the background in the image, namely, the scene recognition model can accurately recognize the scene, so that the accuracy of scene recognition is improved, and the accuracy of video search can be improved.

In some embodiments, the target scene determining module 1002 is further configured to display a scene search area on a playing interface of the target video, and receive an input target scene and a scene search operation through the scene search area; and the video clip playing module is also used for jumping the current playing position of the target video to the initial position of the video clip on the playing interface and playing the video clip from the initial position.

In some embodiments, the scene recognition model training module comprises:

and the training sample image acquisition unit is used for acquiring a training sample image.

And the background sample image obtaining unit is used for carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image.

And the scene recognition result obtaining unit is used for inputting the background sample image into the scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result.

A target model loss value obtaining unit configured to obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; the loss value of the target model and the difference of the recognition result are in positive correlation.

And the trained scene recognition model obtaining unit is used for adjusting the model parameters of the scene recognition model based on the target model loss value to obtain the trained scene recognition model.

For the specific definition of the scene recognition device, reference may be made to the above definition of the scene recognition method, which is not described herein again. The modules in the scene recognition device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In some embodiments, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 11. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing relevant data involved in the scene recognition method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a scene recognition method.

In some embodiments, a computer device is provided, which may be a terminal, and its internal structure diagram may be as shown in fig. 12. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless communication can be realized through WIFI, an operator network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a video search method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, a key, a track ball or a touch pad arranged on the shell of the computer equipment, an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the configurations shown in fig. 11-12 are only block diagrams of some of the configurations relevant to the present disclosure, and do not constitute a limitation on the computing devices to which the present disclosure may be applied, and that a particular computing device may include more or less components than shown, or combine certain components, or have a different arrangement of components.

In some embodiments, there is further provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above method embodiments when executing the computer program.

In some embodiments, a computer-readable storage medium is provided, in which a computer program is stored, which computer program, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.

In some embodiments, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps in the above-mentioned method embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical storage, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for scene recognition, the method comprising:

acquiring a training sample image;

carrying out background recognition on a training sample image to obtain a background sample image corresponding to the training sample image;

inputting the background sample image into a scene recognition model to obtain a first scene recognition result, and inputting the training sample image into the scene recognition model to obtain a second scene recognition result;

obtaining a target model loss value based on the recognition result difference between the first scene recognition result and the second scene recognition result; the target model loss value and the identification result difference are in positive correlation;

and adjusting model parameters of the scene recognition model based on the target model loss value to obtain a trained scene recognition model.

2. The method according to claim 1, wherein the performing background recognition on the training sample image to obtain a background sample image corresponding to the training sample image comprises:

carrying out saliency recognition on the training sample image to obtain a saliency image corresponding to the training sample image;

and obtaining a background mask according to the saliency image, and processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image.

3. The method of claim 2, wherein the deriving a background mask from the saliency image comprises:

acquiring a gray scale statistic value corresponding to the significant image, and obtaining a gray scale threshold value according to the gray scale statistic value;

and carrying out binarization processing on the significant image according to the gray threshold value to obtain a background mask.

4. The method according to claim 3, wherein the binarizing the saliency image according to the gray threshold to obtain a background mask comprises:

comparing the gray value of the pixel point in the significant image with the gray threshold value;

and when the gray value of the pixel point is smaller than the gray threshold, determining the value of the pixel point in the background mask as a pixel shielding value.

5. The method of claim 3, wherein said deriving a gray level threshold value according to the gray level statistic comprises:

acquiring a significance retention threshold;

selecting a plurality of gray level threshold values between the gray level statistic value and the significance retention threshold value;

the binarization processing of the significant image according to the gray threshold value to obtain a background mask comprises the following steps:

carrying out binarization processing on the significant image by utilizing the gray level threshold values to obtain background masks corresponding to the gray level threshold values respectively;

the processing the training sample image by using the background mask to obtain a background sample image corresponding to the training sample image includes:

processing the training sample image by using each background mask, and adding the processed image into a candidate background image set corresponding to the training sample image;

and selecting at least one image from the candidate background image set as a background sample image corresponding to the training sample image.

6. The method of claim 1, wherein the first scene recognition result comprises a first scene recognition probability, wherein the second scene recognition result comprises a second scene recognition probability, and wherein deriving the target model loss value based on the recognition difference between the first scene recognition result and the second scene recognition result comprises:

calculating a probability difference value between the first scene recognition probability and the second scene recognition probability;

and calculating to obtain the target model loss value according to the probability difference value, wherein the probability difference value and the target model loss value form a positive correlation relationship.

7. The method according to claim 6, wherein the calculating the target model loss value according to the probability difference value includes:

calculating to obtain a first model loss value according to the probability difference value, wherein the first model loss value and the probability difference value form a positive correlation;

obtaining a second model loss value according to a predicted recognition result, wherein the predicted recognition result comprises at least one of a first scene recognition result or a second scene recognition result, the second model loss value and a standard result difference form a positive correlation relationship, and the standard result difference is a difference between the predicted recognition result and a standard recognition result corresponding to the training sample image;

and obtaining the target model loss value according to the first model loss value and the second model loss value.

8. The method of claim 7, wherein deriving the second model loss value based on the predicted identification comprises:

obtaining a first classification loss value according to the difference between the first scene recognition result and a standard recognition result corresponding to the training sample image;

obtaining a second classification loss value according to the difference between the second scene recognition result and the standard recognition result;

and obtaining a second model loss value according to the first classification loss value and the second classification loss value.

9. A method for video search, the method comprising:

responding to scene searching operation corresponding to the target video, and determining a target scene to be inquired;

triggering video scene query based on the target scene, and determining a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to the target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the recognized scene is obtained as the target video image of the target scene, and the scene recognition model is obtained according to a training sample image and a background sample image corresponding to the training sample image;

and playing the video clip.

10. The method of claim 9, wherein the determining the target scene to be queried in response to the scene search operation corresponding to the target video comprises:

displaying a scene search area on a playing interface of the target video, and receiving an input target scene and the scene search operation through the scene search area;

the playing the video clip comprises:

and jumping the current playing position of the target video to the starting position of the video clip on the playing interface, and playing the video clip from the starting position.

11. The method of claim 9, wherein the training step of the scene recognition model comprises:

acquiring a training sample image;

12. A scene recognition apparatus, characterized in that the apparatus comprises:

the training sample image acquisition module is used for acquiring a training sample image;

the background sample image obtaining module is used for carrying out background recognition on the training sample image to obtain a background sample image corresponding to the training sample image;

a scene recognition result obtaining module, configured to input the background sample image into a scene recognition model to obtain a first scene recognition result, and input the training sample image into the scene recognition model to obtain a second scene recognition result;

a target model loss value obtaining module, configured to obtain a target model loss value based on a recognition result difference between the first scene recognition result and the second scene recognition result; the target model loss value and the identification result difference are in positive correlation;

and the trained scene recognition model obtaining module is used for adjusting the model parameters of the scene recognition model based on the target model loss value to obtain the trained scene recognition model.

13. A video search apparatus, characterized in that the apparatus comprises:

the target scene determining module is used for responding to scene searching operation corresponding to the target video and determining a target scene to be inquired;

the video clip determining module is used for triggering video scene query based on the target scene and determining a video clip corresponding to the target scene obtained based on the video scene query; the video clip comprises a target video image corresponding to the target scene, the video image corresponding to the target video is input into a scene recognition model for scene recognition, the recognized scene is obtained as the target video image of the target scene, and the scene recognition model is obtained according to a training sample image and a background sample image corresponding to the training sample image;

and the video clip playing module is used for playing the video clip.

14. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any one of claims 1 to 8 or 9 to 11 when executing the computer program.

15. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8 or 9 to 11.