CN111881849A

CN111881849A - Image scene detection method and device, electronic equipment and storage medium

Info

Publication number: CN111881849A
Application number: CN202010753370.8A
Authority: CN
Inventors: 邹子杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2020-11-03

Abstract

The embodiment of the application discloses an image scene detection method and device, electronic equipment and a storage medium. The method comprises the following steps: acquiring a scene image to be detected, randomly generating N random frames in the scene image, and respectively extracting image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer; extracting the image characteristics of the whole scene image to obtain the global characteristics of the scene image; fusing the N local features with the global feature; and carrying out target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image. The image scene detection method, the image scene detection device, the electronic equipment and the storage medium can accurately detect the target scene of the scene image and improve the accuracy of image scene classification.

Description

Image scene detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image scene detection method and apparatus, an electronic device, and a storage medium.

Background

The method has the advantages that the image scenes are identified and classified, and as a basic and important task of image classification, great research interest is brought to people in recent years, and meanwhile, the method is successfully deployed in a plurality of application products, so that a plurality of practical problems are intelligently solved. For image scene classification, scene recognition and object recognition have a great correlation, and objects contained in a scene have a great influence on the category of the scene. The class of the scene does not depend only on the objects, but it is actually determined by the individual semantic regions and their hierarchical and spatial layout. For some scenes which are complex in definition, contain many situations and objects and are easy to confuse, the accuracy of the traditional image scene classification method is low.

Disclosure of Invention

The embodiment of the application discloses an image scene detection method and device, electronic equipment and a storage medium, which can accurately detect a target scene of a scene image and improve the accuracy of image scene classification.

The embodiment of the application discloses an image scene detection method, which comprises the following steps: acquiring a scene image to be detected, randomly generating N random frames in the scene image, and respectively extracting image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer; extracting the image characteristics of the whole scene image to obtain the global characteristics of the scene image; fusing the N local features with the global feature; and carrying out target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image.

The embodiment of the application discloses image scene detection device includes: the device comprises a random module, a video acquisition module and a processing module, wherein the random module is used for acquiring a scene image to be detected and randomly generating N random frames in the scene image; the first feature extraction module is used for respectively extracting image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer; the second feature extraction module is used for extracting the image features of the whole scene image to obtain the global features of the scene image; a fusion module for fusing the N local features with the global feature; and the detection module is used for carrying out target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image.

The embodiment of the application discloses an electronic device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor is enabled to realize the method.

An embodiment of the application discloses a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the method as described above.

The embodiment of the application discloses an image scene detection method, an image scene detection device, electronic equipment and a storage medium, wherein N random frames are randomly generated in a scene image to be detected, image features in each random frame are respectively extracted to obtain N local features of the scene image, image features of the whole scene image are extracted to obtain global features of the scene image, the N local features and the global features are fused, target scene detection is carried out on the scene image according to the fused features to obtain a target scene detection result of the scene image, the local features and the global features of the scene image are fused, the image belonging to a target scene can be accurately distinguished from the image belonging to a non-target scene, effective local features can be extracted by utilizing the randomly generated random frames, and the problem that the effective local features are difficult to extract due to the fact that some complex scenes have no definite fixed features is solved, the target scene detection can be accurately carried out on the scene image, and the accuracy of image scene classification is improved.

The embodiment of the application discloses a network model training method for image scene detection, which comprises the following steps: randomly generating N first sample random frames in the first sample image; inputting image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images; calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics, and adjusting the parameters of the neural network model according to the loss; obtaining training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of a sample image; and training a classifier according to the training sample characteristics so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold.

The embodiment of the application discloses a network model training device for image scene detection, including: the random sample module is used for randomly generating N first sample random frames in the first sample image; the sample feature extraction module is used for inputting image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images; the adjusting module is used for calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics and adjusting the parameters of the neural network model according to the loss; the sample acquisition module is used for acquiring training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of a sample image; and the classifier training module is used for training a classifier according to the training sample characteristics so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold value.

The embodiment of the application discloses an electronic device, which comprises a memory and a processor, wherein a computer program is stored in the memory, and when the computer program is executed by the processor, the processor is enabled to realize the network model training method for image scene detection.

The embodiment of the application discloses a computer readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to implement the network model training method for image scene detection.

The embodiment of the application discloses a network model training method, a device, electronic equipment and a storage medium for image scene detection, wherein N first sample random frames are randomly generated in a first sample image, image areas respectively corresponding to the N first sample random frames and the first sample image are input into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample image, the neural network model is trained according to the N local sample features and the global sample features, training sample features for training and extracting a classifier are obtained through the neural network model, the randomly generated random frames can be used for effectively obtaining the local sample features, and the local sample features and the global sample features are combined to enable the neural network model and the classifier obtained through training to be more accurate, the accuracy of detecting the target scene of the image based on the neural network model and the classifier is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

FIG. 1 is a block diagram of image processing circuitry in one embodiment;

FIG. 2 is a flow diagram of a method for image scene detection in one embodiment;

FIG. 3 is a diagram of generating a random frame in a scene image in one embodiment;

FIG. 4 is a flow chart of a method for image scene detection in another embodiment;

FIG. 5A is a diagram illustrating target scene detection for a scene image, according to an embodiment;

FIG. 5B is a diagram illustrating target scene detection performed on a scene image according to another embodiment;

FIG. 6 is a flow chart of a method for image scene detection in another embodiment;

FIG. 7A is a diagram illustrating training of a neural network model, according to one embodiment;

FIG. 7B is a diagram illustrating training of a neural network model in another embodiment;

FIG. 8 is a flow diagram of training a classifier in one embodiment;

FIG. 9 is a diagram illustrating training a classifier in one embodiment;

FIG. 10 is a flow diagram of a network model training method for image scene detection in one embodiment;

FIG. 11 is a block diagram of an image scene detection apparatus in one embodiment;

FIG. 12 is a block diagram of a network model training apparatus for image scene detection in one embodiment;

FIG. 13 is a block diagram of an electronic device in one embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

It is to be noted that the terms "comprises" and "comprising" and any variations thereof in the examples and figures of the present application are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, a first image may be referred to as a second image, and similarly, a second image may be referred to as a first image, without departing from the scope of the present application. The first image and the second image are both images, but they are not the same image.

For scene classification of an image, scene recognition in the image is related to and different from object recognition. Various methods have been developed for generic scene and specific scene classification tasks. Since scene classification is usually performed in a feature space, the representation of effective features plays an important role in constructing a high-performance scene classification method. Existing scene classification methods can be generally classified into three major categories: 1. a manual feature-based approach, 2 an unsupervised feature learning-based approach, 3 a deep feature learning-based approach, etc. These three categories are not necessarily independent, and sometimes different categories exist for the same method.

The following briefly introduces several scene classification methods described above:

1. methods based on manual characterization.

The methods for scene classification based on handcrafted features mainly focus on designing various ergonomic features, such as color, texture, shape, spatial and spectral information, using a great deal of engineering and domain expertise. For the above several main features, the main solutions of the scene classification method based on manual features may include: color histograms, texture descriptors, general Search tree (GIST) descriptors, Scale-invariant feature transform (SIFT), Histogram of Oriented Gradients (HOG), and the like.

2. A method based on unsupervised feature learning.

Unsupervised feature learning aims at learning a set of basis functions (or filters) for feature encoding, where the input to the function is a set of handmade features or raw pixel intensity values, and the output is a set of learned features. By learning features from the image rather than relying on manually designed features, more discriminative features can be obtained. Typical unsupervised feature learning methods may include Principal Component Analysis (PCA), k-means clustering, sparse coding, and auto-encoders, among others.

3. A method for deep feature learning.

The Deep feature learning model can obtain good feature representation, and is applicable to many Deep learning models, such as Deep Belief Networks (DBN), Deep Boltzmann Machines (DBM), Stacked Autoencoders (SAE), Convolutional Neural Networks (CNN), and the like.

In the current image scene classification technology, because the semantics in a specific scene are complex, the scene is diverse, the inclusion condition and the number of objects are large, confusion is easily caused, and the accuracy of scene classification is low. And some specific scenes have no definite fixed features, for example, for a street scene, objects on streets such as people and cars also appear in scenes such as exhibition halls, so that scene confusion is easily caused, effective fixed features cannot be extracted, and classification of image scenes is not facilitated.

The application provides an image scene detection method, an image scene detection device, an electronic device and a storage medium, local features of scene images are fused with global features, images belonging to target scenes can be accurately distinguished from images belonging to non-target scenes, effective local features can be extracted by using random frames generated randomly, the problem that effective local features are difficult to extract due to the fact that certain complex scenes are not provided with definite and fixed features is solved, target scene detection can be accurately performed on the scene images, and accuracy of image scene classification is improved.

The embodiment of the application provides electronic equipment. The electronic device includes therein an Image Processing circuit, which may be implemented using hardware and/or software components, and may include various Processing units defining an ISP (Image Signal Processing) pipeline. FIG. 1 is a block diagram of an image processing circuit in one embodiment. For ease of illustration, FIG. 1 illustrates only aspects of image processing techniques related to embodiments of the present application.

As shown in fig. 1, the image processing circuit includes an ISP processor 140 and control logic 150. The image data captured by the imaging device 110 is first processed by the ISP processor 140, and the ISP processor 140 analyzes the image data to capture image statistics that may be used to determine one or more control parameters of the imaging device 110. The imaging device 110 may include one or more lenses 112 and an image sensor 114. Image sensor 114 may include an array of color filters (e.g., Bayer filters), and image sensor 114 may acquire light intensity and wavelength information captured by each imaging pixel and provide a set of raw image data that may be processed by ISP processor 140. The attitude sensor 120 (e.g., a three-axis gyroscope, hall sensor, accelerometer, etc.) may provide parameters of the acquired image processing (e.g., anti-shake parameters) to the ISP processor 140 based on the type of interface of the attitude sensor 120. The attitude sensor 120 interface may employ an SMIA (Standard Mobile Imaging Architecture) interface, other serial or parallel camera interfaces, or a combination thereof.

In addition, the image sensor 114 may also transmit raw image data to the attitude sensor 120, the attitude sensor 120 may provide the raw image data to the ISP processor 140 based on the type of interface of the attitude sensor 120, or the attitude sensor 120 may store the raw image data in the image memory 130.

The ISP processor 140 processes the raw image data pixel by pixel in a variety of formats. For example, each image pixel may have a bit depth of 8, 10, 12, or 14 bits, and the ISP processor 140 may perform one or more image processing operations on the raw image data, gathering statistical information about the image data. Wherein the image processing operations may be performed with the same or different bit depth precision.

The ISP processor 140 may also receive image data from the image memory 130. For example, the attitude sensor 120 interface sends raw image data to the image memory 130, and the raw image data in the image memory 130 is then provided to the ISP processor 140 for processing. The image Memory 130 may be a portion of a Memory device, a storage device, or a separate dedicated Memory within an electronic device, and may include a DMA (Direct Memory Access) feature.

Upon receiving raw image data from the image sensor 114 interface or from the attitude sensor 120 interface or from the image memory 130, the ISP processor 140 may perform one or more image processing operations, such as temporal filtering. The processed image data may be sent to image memory 130 for additional processing before being displayed. ISP processor 140 receives the processed data from image memory 130 and performs image data processing on the processed data in the raw domain and in the RGB and YCbCr color spaces. The image data processed by ISP processor 140 may be output to display 160 for viewing by a user and/or further processed by a Graphics Processing Unit (GPU). Further, the output of the ISP processor 140 may also be sent to the image memory 130, and the display 160 may read image data from the image memory 130. In one embodiment, image memory 130 may be configured to implement one or more frame buffers.

The statistics determined by the ISP processor 140 may be sent to the control logic 150. For example, the statistical data may include image sensor 114 statistics such as gyroscope vibration frequency, auto-exposure, auto-white balance, auto-focus, flicker detection, black level compensation, lens 112 shading correction, and the like. The control logic 150 may include a processor and/or microcontroller that executes one or more routines (e.g., firmware) that may determine control parameters of the imaging device 110 and control parameters of the ISP processor 140 based on the received statistical data. For example, the control parameters of the imaging device 110 may include attitude sensor 120 control parameters (e.g., gain, integration time of exposure control, anti-shake parameters, etc.), camera flash control parameters, camera anti-shake displacement parameters, lens 112 control parameters (e.g., focal length for focusing or zooming), or a combination of these parameters. The ISP control parameters may include gain levels and color correction matrices for automatic white balance and color adjustment (e.g., during RGB processing), as well as lens 112 shading correction parameters.

In one embodiment, the scene may be image-captured by a lens 112 and an image sensor 114 in the imaging device (camera) 110, and the ISP processor may acquire an image of the scene to be detected captured by the imaging device 110. The ISP processor can randomly generate N random frames in the scene image, and respectively extract image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer. The ISP can extract the image characteristics of the whole scene image to obtain the global characteristics of the scene image. After the N local features and the global features of the scene image are extracted, the N local features and the global features may be fused, and the target scene detection may be performed on the scene image according to the fused features to obtain a target scene detection result of the scene image, so as to determine whether the scene image belongs to the target scene.

In some embodiments, the ISP processor 140 may send the scene images to the image memory 130 for storage. The ISP processor 140 may also output the scene image to the display 160 for viewing by the user, etc.

As shown in fig. 2, in an embodiment, an image scene detection method is provided, which is applicable to electronic devices such as a mobile phone, an intelligent wearable device, a tablet computer, and a digital camera, and the embodiment of the present application is not limited thereto. The method may comprise the steps of:

step 210, obtaining a scene image to be detected, randomly generating N random frames in the scene image, and respectively extracting image features in each random frame to obtain N local features of the scene image.

In some embodiments, the scene image to be detected may be an image captured by an imaging device such as a camera of the electronic device, or may be an image stored in a memory in advance. The electronic equipment can detect the target scene of the acquired scene image and judge whether the scene image belongs to the target scene, so that the image of the target scene can be distinguished. In the embodiment of the present application, the target scene may be a scene which is complex, includes objects, and is in many cases, such as a street scene, a bazaar scene, and the like, but is not limited thereto.

The electronic device can randomly generate N random frames in the acquired scene image, wherein N is a positive integer, optionally, N can be a preset value (such as any numerical value of 1-200) or a value obtained through experiments, so that the number of the randomly generated random frames can ensure the accuracy of scene classification, and the problem of overlarge calculation amount is avoided.

As an embodiment, randomly generating N random frames in the scene image may include: and generating N random frames with random sizes and random positions in the scene image.

The random size means that the size of each random frame is random, and optionally, the shape of the random frame may include any one of various shapes such as a rectangle, a square, and a circle, for example, if the random frame is a rectangle, N random frames with random length-width ratios may be generated, and if the random frame is a circle, N random frames with random radius may be generated. Optionally, a size threshold of the random frame may be preset, so that the sizes of the N randomly generated random frames are all smaller than the size threshold, to prevent the generated random frame from being too large to accurately extract the local features of the scene image.

The random position can mean that the position of each random frame in a scene image is random, an overlapped image area can exist between every two random frames, and by generating the random frame with the random position, a plurality of image areas in different positions can be collected, so that the richness of the extracted local image is ensured.

FIG. 3 is a diagram illustrating the generation of a random frame in an image of a scene, in one embodiment. As shown in fig. 3, the electronic device may acquire a scene image 310 and randomly generate N random boxes 302 in the scene image 310, wherein 7 random boxes 302 are generated in total in fig. 3, the size of the generated random boxes 302 is random, and the position of each random box 302 in the scene image 310 is also random.

The image area contained in each random frame generated can be identified, the image features in each random frame are extracted, and the image area contained in each random frame belongs to the local image of the scene image, so the image features in each random frame can be used as the local features of the scene image to obtain the N local features of the scene image. Optionally, the extracted image features of each random frame may include, but are not limited to, edge features, color features, texture features, and the like, and the image features may be extracted in various ways, for example, the image features may be extracted by using the color histogram, texture descriptor, and the like, or by using a neural network model (e.g., CNN, SAE, and the like), which is not limited herein. Effective local features in a scene image can be extracted by using a randomly generated random frame, and the problem that the effective local features are difficult to extract due to the fact that some complex scenes do not have definite fixed features is solved.

Step 220, extracting the image characteristics of the whole scene image to obtain the global characteristics of the scene image.

The electronic device may identify the entire scene image, and extract image features of the entire scene image as global features of the scene image, where the global features of the scene image may include, but are not limited to, edge features, color features, texture features, shape features, spatial relationship features, and the like of the entire scene image, and the spatial relationship features may refer to features such as mutual spatial positions or relative directional relationships among multiple objects segmented from the image. The method for extracting the global features of the scene image can be the same as the method for extracting the image features in each random frame, and the local features and the global features of the scene image can also be extracted in different methods respectively.

It is to be understood that the local feature extraction in step 210 and the execution sequence of step 220 are not limited herein, and step 210 may be executed first and then step 220 is executed, or after the scene image to be detected is acquired, step 220 may be executed first and then step 210 local feature extraction is executed, or both are executed at the same time.

Step 230, fusing the N local features with the global feature.

Optionally, N local features and a global feature of the scene image may be fused, and the N local features and the global feature may be integrated to obtain a fusion feature carrying both the local feature and the global feature. The fusion mode can include adopting parallel strategy fusion, adopting serial strategy fusion and the like, wherein the adoption of the parallel strategy fusion means that the two characteristic vectors are combined into a complex vector, and the adoption of the serial strategy fusion is that the two characteristics are directly connected.

In some embodiments, N local features and a global feature are fused, or the N local features and the global feature may be processed in a feature mapping manner to obtain a fused feature, where the fused feature has a mapping relationship with the feature in which the N local features and the global feature are fused. By fusing the N local features and the global feature, the local features can be complementary with the global feature, and the N local features can make up for disadvantages with each other, so that the fused features can more accurately describe information in a scene image, and the robustness of the image features is enhanced.

And 240, performing target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image.

The electronic equipment can detect whether the scene image belongs to the target scene or not according to the fused features and obtain a target scene detection result, wherein the target scene detection result can be used for representing whether the scene image belongs to the target scene or the non-target scene.

In some embodiments, the electronic device may compare the obtained fused feature with a feature preset in the target scene, calculate a distance between the fused feature and the feature preset in the target scene, and calculate a probability that the scene image belongs to the target scene according to the distance. By fusing the N local features and the global features of the scene image and then carrying out target scene detection, the feature dimension can be improved, more effective features can be utilized for carrying out scene detection, and the accuracy of scene detection is improved.

As a particular embodiment, the target scene may include a street scene. The electronic device can calculate the probability that the scene image belongs to the street scene according to the fused features, and obtain a street scene detection result of the scene image according to the probability, wherein the street scene detection result can be used for representing whether the scene image belongs to the street scene or the non-street scene. Alternatively, the probability that the scene image belongs to a street scene may be compared to the probability that the scene image belongs to a non-street scene, and if the probability that the scene image belongs to a street scene is greater than the probability that the scene image belongs to a non-street scene, it may be determined that the scene image belongs to a street scene, and if the probability that the scene image belongs to a street scene is not greater than the probability that the scene image belongs to a non-street scene, it may be determined that the scene image belongs to a non-street scene. Optionally, it may be further determined whether the probability that the scene image belongs to the street scene is greater than a probability threshold, and if the probability is greater than the probability threshold, it may be determined that the scene image belongs to the street scene, and if the probability is not greater than the probability threshold, it may be determined that the scene image belongs to the non-street scene.

In some embodiments, if the target scene detection result of the scene image is that the scene image belongs to the target scene, the electronic device may process the scene image according to an image adjustment parameter corresponding to the target scene, where the image adjustment parameter may include, but is not limited to, one or more of a color adjustment parameter, a brightness adjustment parameter, a white balance processing parameter, and the like, so as to beautify the scene image in a targeted manner, thereby improving the visual effect of the scene image.

In the embodiment of the application, N random frames are randomly generated in a scene image to be detected, the image characteristics in each random frame are respectively extracted to obtain N local characteristics of the scene image, the image characteristics of the whole scene image are extracted to obtain the global characteristics of the scene image, the N local characteristics and the global characteristics are fused, the target scene detection is carried out on the scene image according to the fused characteristics to obtain the target scene detection result of the scene image, the images belonging to the target scene can be accurately distinguished from the images not belonging to the target scene by fusing the local characteristics and the global characteristics of the scene image, effective local characteristics can be extracted by utilizing the randomly generated random frames, the problem that the effective local characteristics are difficult to extract due to the fact that some complex scenes do not have definite fixed characteristics is solved, and the target scene detection can be accurately carried out on the scene image, the accuracy of image scene classification is improved.

As shown in fig. 4, in an embodiment, another image scene detection method is provided, which is applicable to the electronic device described above, and the method may include the following steps:

step 402, obtaining a scene image to be detected, and randomly generating N random frames in the scene image.

The description of step 402 can refer to the related description of step 210 in the above embodiments, and is not repeated herein.

And step 404, respectively extracting the image features of the image area corresponding to each random frame through the trained neural network model, and outputting N local feature vectors.

The neural network model for extracting the image features can be obtained by pre-training, and can be a CNN model such as AlexNet, GoogleNet, SPPNet and the like, and can also be an SAE model and the like. The neural network model may be obtained by performing feature extraction training on a large number of sample images. Further, the neural network model may be obtained by performing feature extraction training on a local image region of a large number of sample images.

After the electronic equipment randomly generates N random frames in the scene image, the image areas of the N random frames can be input into a neural network model, the neural network model can analyze the input image areas of the random frames, extract the image characteristics of the image area of each random frame, and output local characteristic vectors for representing the image characteristics of the image area of each random frame.

In some embodiments, the neural network model may include convolutional layers, pooling layers, fully-connected layers, and the like, wherein the convolutional core size of each convolutional layer may be set according to actual requirements, and the output of each convolutional layer may be used as the input of the next convolutional layer. The pooling layer is arranged between the two convolution layers, the role of the pooling layer in the neural network model is in feature fusion and dimension reduction, the last convolution layer can be connected with the full-connection layer, and the full-connection layer can convert a feature matrix output by the convolution layers into a feature vector.

In some embodiments, the electronic device may order the N random boxes in order of their size from large to small or small to large before inputting the image regions of the N random boxes into the neural network model. The sorting order of the N random frames can be the same as the size sorting order of the random frames adopted by the neural network model during training. The sizes of the N random frames are randomly generated, the sizes of the random frames may be the same or different, and after feature extraction is performed on an image area of the random frame through a neural network model, the lengths of output local feature vectors are consistent, but the actual meaning represented by each local feature vector may be different due to the fact that the sizes of the random frames are different. In order to ensure the accuracy of the description of the output local feature vector, when the neural network model is performed, N samples of the sample image input to the neural network model need to be arranged according to a certain size sequence, such as from small to large or from large to small, along with the frame. When the image features in the N random frames are extracted by using the neural network model obtained by training, the N random frames can be sequenced according to the random frame size arrangement sequence adopted by the neural network model during training, then the image features of the image area corresponding to each random frame are sequentially extracted through the neural network model obtained by training according to the sequencing of the N random frames, and N local feature vectors corresponding to the sequencing are output. Therefore, the accuracy of the local feature vector output by the neural network model can be ensured.

For example, the sizes of 4 random frames generated randomly are 4 × 3, 5 × 6, 3 × 3, and 4 × 4, if the random frames of samples are sorted in the order of sizes from large to small during the training of the neural network model, the random frames of samples may be sorted in the order of sizes from large to small into 5 × 6, 4 × 4, 4 × 3, and 3 × 3, and then the neural network model obtained through the training sequentially extracts features from image regions of each sorted random frame, and outputs local feature vectors corresponding to the sort.

In some embodiments, before feature extraction is performed by using the neural network model, the images may be preprocessed, so that the images input to the neural network model have the same specification size, which is convenient for the neural network model to perform calculation, and improves the calculation efficiency. Before inputting the image areas corresponding to the N random frames into the neural network model, the electronic device may perform scaling processing on the image areas corresponding to the N sorted random frames to obtain N local images with a target size, and then sequentially extract the image features of the local images corresponding to each random frame through the trained neural network model according to the sorting of the N random frames. Alternatively, the target size may be set according to actual requirements, and may be the same as the input size adopted by the neural network model during training.

And 406, extracting the image characteristics of the whole scene image through the neural network model, and outputting a global characteristic vector.

The electronic equipment can input the scene image into the trained neural network model, and extract the image characteristics of the whole scene image through the neural network model to obtain a global characteristic vector of the global characteristics for representing the scene image.

In one embodiment, before the scene image is input into the neural network model, the scene image may be scaled according to the target size, and then the image features of the scaled scene image are extracted through the neural network model. The size of each input image is ensured to be the same, and the calculation efficiency of the neural network model is improved.

As an embodiment, the global features and the local features of the scene image may be extracted through the same neural network model, which may be obtained by performing feature extraction training on a large number of local image regions of the sample image and the entire sample image. By sharing the same set of weight parameters, the extraction of local features and global features can be completed by only using one neural network model, so that the occupation of a memory can be reduced, the model efficiency is improved, and the method is suitable for being deployed in embedded equipment.

As another embodiment, the global feature and the local feature of the scene image may be extracted through different neural network models, respectively. The electronic equipment can respectively extract the image characteristics of the image area corresponding to each random frame through the first network model, output N local characteristic vectors, extract the image characteristics of the whole scene image through the second network model, and output the global characteristic vector. The first network model may be trained by using a plurality of local image regions of the sample image, and the second network model may be trained by using a plurality of entire sample images. Local features and global features are respectively extracted through two different neural network models, so that weight parameters of the neural network models are more targeted, and the extracted features are more accurate.

And step 408, splicing the N local feature vectors and the global feature vector to obtain an M-dimensional fusion feature vector.

In some embodiments, the local feature vector and the global feature vector may each be an M-dimensional feature vector, which may be a positive integer less than or equal to a dimension threshold, e.g., M may be 1, 2, etc. The stitching of the N local feature vectors with the global feature vector may be the stitching of the global feature vector at a specific position of the N local feature vectors, e.g. after the N local feature vectors, etc. Because the local feature vector and the global feature vector are both low-dimensional feature vectors, and the fusion feature vector is also a low-dimensional feature vector, the subsequent calculation amount is small, and the processing efficiency is improved.

For example, the local feature vector and the global feature vector may be one-dimensional feature vectors, N random frames may be represented by { i1, i2, i3, …, in }, so that the neural network model may output N one-dimensional local feature vectors { V1, V2, V3, …, vn }, and may also obtain one-dimensional global feature vector V, and then the N local feature vectors and the global feature vector may be spliced to obtain a fused feature vector { V1, V2, V3, …, vn, V }.

And step 410, inputting the fusion feature vector into a classifier, calculating the probability of the fusion feature vector belonging to the target scene through the classifier, and obtaining a target scene detection result of the scene image according to the probability.

The classifier may be configured to output a classification result, and in this embodiment, the classification result may include a classification result belonging to a target scene and a classification result belonging to a non-target scene. The classifier may be a linear classifier such as softmax, or may be a nonlinear classifier such as a Support Vector Machine (SVM). The classifier can be obtained by training in advance according to a large number of fusion features extracted from the sample images. The classifier can calculate the probability that the scene image belongs to the target scene according to the fusion feature vector, and for each feature contained in the input fusion feature vector, the classifier can include weight coefficients corresponding to the features one by one, and the probability that the scene image belongs to the target scene can be calculated by using the weight coefficients corresponding to the features, so that a target scene detection result is obtained.

For example, the calculation formula for obtaining the target scene detection result may be expressed as f (x) wTx + b, where w may refer to a weight coefficient, b is a bias coefficient, and x may be a fused feature vector. When F (x) ≧ 0, the target scene detection result can be determined as that the scene image belongs to the target scene, and when F (x) <0, the target scene detection result can be determined as that the scene image belongs to the non-target scene.

FIG. 5A is a diagram illustrating target scene detection on a scene image, in one embodiment. As shown in fig. 5A, the same neural network model can be used to extract local features and global features of the scene image. After the electronic device acquires the scene image, the scene image needs to be processed in two parts:

the first part is that N random frames are randomly generated in a scene image, the N random frames are sequenced according to the size of the random frame, an image area corresponding to each sequenced random frame can be zoomed, a local image with a target size is obtained, and the local image is sequentially input into a neural network model according to the sequence. The neural network model can sequentially extract the image features of each local image according to the input sequence of the N local images and output N local feature vectors.

And a second part, zooming the scene image according to the target size, inputting the zoomed scene image into a neural network model, wherein the neural network model can extract the image characteristics of the zoomed scene image and output a global characteristic vector.

After the processing of the two parts is completed, the electronic equipment can fuse the N local feature vectors and the global feature vectors output by the neural network model to obtain fused feature vectors, then the fused feature vectors are input into the classifier, and the classifier can calculate the probability that the fused feature vectors belong to the target scene, so that the target scene detection result is obtained.

Fig. 5B is a schematic diagram of object scene detection performed on a scene image in another embodiment. As shown in fig. 5B, different neural network models may be respectively used to extract the local features and the global features of the scene image. After the electronic device acquires the scene image, the scene image needs to be processed in two parts:

the first part is that N random frames are randomly generated in the scene image, the N random frames are sorted according to the size of the random frame, the image area corresponding to each sorted random frame can be zoomed, a local image with a target size is obtained, and the local image is sequentially input into the first network model according to the sorting. The first network model can sequentially extract the image features of each local image according to the input sequence of the N local images and output N local feature vectors.

And a second part, zooming the scene image according to the target size, inputting the zoomed scene image into a second network model, wherein the second network model can extract the image characteristics of the zoomed scene image and output a global characteristic vector.

After the processing of the two parts is completed, the electronic device can fuse the N local feature vectors output by the first network model and the global feature vector output by the second network model to obtain a fused feature vector, and then the fused feature vector is input into the classifier, and the classifier can calculate the probability that the fused feature vector belongs to the target scene, so that the target scene detection result is obtained.

In the embodiment of the application, the local features and the global features of the scene image can be respectively extracted through the neural network model, and the local features and the global features of the scene image are fused, so that the feature dimension can be increased, and the accuracy of target scene detection on the image is effectively improved. Effective local features can be extracted by using the randomly generated random frame, and the problem that the effective local features are difficult to extract due to the fact that certain complex scenes do not have definite fixed features is solved.

As shown in fig. 6, in an embodiment, another image scene detection method is provided, which is applicable to the electronic device described above, and the method may include the following steps:

step 602, randomly generating N first sample random frames in the first sample image.

In some embodiments, the electronic device may train the neural network model and the classifier, the training of the neural network model may include steps 602-606, and the training of the classifier may include steps 608-610.

The electronic device may acquire a first sample image set for training the neural network model, where the first sample image set may include multiple first sample images, and optionally, the first sample image may be an image belonging to a target scene. N first sample random frames may be randomly generated in the first sample image, where N is a positive integer, and N may be a preset value or a value obtained through an experiment. For each first sample image, the number N of the randomly generated first sample random frames is the same, and when the trained neural network model is used for local feature extraction, the number N of the random frames generated by the scene image to be detected needs to be consistent with the number N adopted during training of the neural network model, so that the number of the local feature vectors output by the neural network model can be ensured to be fixed.

The way of randomly generating N first sample random frames in the first sample image may be the same as or similar to the way of randomly generating N random frames in the scene image in step 210 of the foregoing embodiment, and details are not repeated here.

And step 604, inputting the image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images.

In one embodiment, before the image regions and the first sample images respectively corresponding to the N first sample random frames are input into the neural network model, the input image regions and the first sample images may be preprocessed to ensure that the input image regions and the first sample images are input into the neural network model according to a certain sequence and specification size.

Optionally, step 604 may include: sequencing the N first sample random frames according to the sequence of the sizes of the random frames from large to small or from small to large; carrying out zooming processing on image areas respectively corresponding to the N sorted first sample random frames to obtain N first images with target sizes; sequentially inputting the N first images into a neural network model for feature extraction according to the sequence; and zooming the first sample image to obtain a second image with a target size, and inputting the second image into the neural network model for feature extraction.

Since the sizes of the N random frames of the first samples are randomly generated, the N random frames of the first samples may be sorted in an order from large to small or from small to large, and for each image of the first sample, the selected sorting rules need to be consistent, for example, all sorts are sorted in an order from small to large in size, or all sorts are sorted in an order from large to small, and the like. The N local feature vectors sequentially output by the neural network model can be ensured to correspond to the sequence, so that the meaning represented by the output local feature vectors is accurate.

After the N first sample random frames are sequenced according to a certain sequencing rule, the image areas corresponding to the sequenced N first sample random frames respectively can be subjected to scaling processing to obtain N first images with target sizes, and the target sizes can be set according to actual requirements to ensure that the sizes of the images input into the neural network model are consistent, so that calculation is performed by using the neural network model. And sequentially inputting the N first images into the neural network model for feature extraction according to the sequence of the N first sample random frames to obtain N local sample features of the first sample image so as to train the neural network model.

The electronic equipment can also perform scaling processing on the first sample image to obtain a second image with a target size, and then input the second image into the neural network model for feature extraction to obtain global sample features of the first sample image so as to train the neural network model.

And 606, calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics, and adjusting the parameters of the neural network model according to the loss.

The N local sample features and the global sample features of the first sample image extracted by the neural network model can be used as a prediction result output by the neural network model, the distance between the prediction result and an expected result of the first sample image can be calculated through a loss function, parameters in the neural network model are corrected and adjusted according to the distance, and the distance can represent the error between the prediction result and the expected result. The expected result of the first sample image may be a preset feature vector for representing local sample features and global sample features, and the expected result may be a feature vector generated by manually selected features, or a feature vector obtained by performing feature extraction through a perfect convolutional neural network and outputting the feature vector.

As a specific embodiment, the parameters of each layer in the neural network model can be updated through a back propagation mechanism. Alternatively, the parameters of the neural network model may include weight parameters and the like in each convolutional layer, fully-connected layer and the like, and the weight parameters of each convolutional layer and fully-connected layer may be randomly set before training, so that the predicted result output by the neural network model is closer to the expected result through continuous adjustment in the training process.

In some embodiments, the same neural network model may be trained through the local image and the entire image of the first sample image, and the trained neural network model may be used to extract the local feature and the global feature of the image at the same time. The electronic equipment can input the image areas corresponding to the N first sample random frames into the neural network model, respectively extract the image characteristics of the image area corresponding to each first sample random frame through the neural network model, and obtain N first characteristic vectors as N local sample characteristics of the first sample image. The electronic equipment can also input the first sample image into the same neural network model, and extract the image characteristics of the whole first sample image through the neural network model to obtain a second characteristic vector as the global sample characteristics of the first sample image.

Optionally, before inputting the N image regions of the first sample image and the entire first sample image into the same neural network model, the N image regions and the entire first sample image arranged according to a certain ordering rule may be first added into an input queue, each image in the queue may be sequentially scaled according to a target size, and then the scaled image is input into the neural network model for feature extraction.

In some embodiments, after the same neural network model outputs N first feature vectors for characterizing local sample features and N second feature vectors for characterizing global sample features, the N first feature vectors and the N second feature vectors may be fused to obtain fused sample features, and then the loss of the neural network model is calculated according to the fused sample features, and parameters of the neural network model are adjusted according to the loss. Optionally, the fusing of the N first feature vectors and the second feature vectors may include stitching the N first feature vectors and the second feature vectors, for example, the second feature vectors may be stitched after the N first feature vectors. The same model is trained, so that the global features and the local features share the same set of weights, the generalization capability of the model is stronger, and the model efficiency is improved. And the training time is saved, and the training efficiency is improved.

FIG. 7A is a diagram illustrating training of a neural network model, according to an embodiment. As shown in fig. 7A, the same neural network model is trained using the partial image and the entire sample image of the sample image. For each sample image, two parts of processing can be performed:

the first part is that N random sample frames are randomly generated in a sample image, the N random sample frames are sequenced according to the sequence of the random frame size from large to small or from small to large, the image area corresponding to each sequenced random sample frame is subjected to scaling processing to obtain N first images with target size, and the N first images are added into a queue according to the sequence.

And a second part, which is used for carrying out scaling processing on the sample image to obtain a second image with a target size and adding the second image into the queue. The order in which the second image is added to the queue may be fixed, such as after N first images, etc., but is not limited thereto.

Each image in the queue can be sequentially input into a neural network model for feature extraction, the neural network model obtains N first feature vectors and one second feature vector, the N first feature vectors and the second feature vector can be fused to obtain fused sample features, the distance between the fused sample features and an expected result is calculated by utilizing a loss function, and parameters in the neural network model are adjusted according to the distance. Multiple sample images can be continuously processed according to the flow shown in fig. 7A, and parameters of the neural network model are continuously adjusted until the loss of the neural network model is less than the loss threshold, that is, the prediction result output by the neural network model and the expected result satisfy the error condition, and the training can be considered to be completed. Optionally, the loss threshold may be set according to actual requirements, and the smaller the loss threshold, the higher the accuracy of the neural network model may be.

In some embodiments, different neural network models may be trained through a local image and an entire image of the first sample image, respectively, to obtain a first network model for extracting local features and a second network model for extracting global features.

The electronic equipment can input image areas corresponding to the N first sample random frames into the first network model, respectively extract image features of the image area corresponding to each first sample random frame through the first network model to obtain N first feature vectors serving as N local sample features of the first sample image, then calculate first loss of the first network model according to the N first feature vectors, and adjust parameters of the first network model according to the first loss.

The electronic device may input the first sample image into the second network model, extract image features of the entire first sample image through the second network model, obtain a second feature vector as global sample features of the first sample image, calculate a second loss of the second network model according to the second feature vector, and adjust parameters of the second network model according to the second loss. Two neural network models are obtained by training aiming at the local features and the global features respectively, so that the weights of the trained models are more targeted, and the accuracy of feature extraction is extracted.

It should be noted that, the way of training the first network model and the second network model respectively may be the same as or similar to the way of training the same neural network model in the above embodiments, and the difference is only that the input images are different, and the ways of the first network model and the second network model are not repeated here.

FIG. 7B is a diagram illustrating training of a neural network model, according to one embodiment. As shown in fig. 7B, different neural network models are trained using the partial image and the entire sample image of the sample image, respectively. For each sample image, two parts of processing can be performed:

the first part is that N random sample frames are randomly generated in a sample image, the N random sample frames are sequenced according to the sequence of the sizes of the random frames from large to small or from small to large, and the image area corresponding to each sequenced random sample frame is subjected to scaling processing to obtain N first images with target sizes. Each first image can be sequentially input into the first network model for feature extraction to obtain N first feature vectors, the N first feature vectors are fused, the distance between the fused feature vectors and an expected result (namely, the expected N local feature vectors) can be calculated by using a loss function, and parameters in the first network model are adjusted according to the distance.

And a second part, which is used for carrying out scaling processing on the sample image to obtain a second image with a target size, inputting the second image into a second network model for feature extraction to obtain a second feature vector, calculating the distance between the second feature vector and an expected result (namely an expected global feature vector) by using a loss function, and adjusting parameters in the second network model according to the distance.

Step 608, the second sample image is processed through the neural network model obtained through training, and training sample characteristics are obtained.

And step 610, training the classifier according to the training sample characteristics, so that the error between the target scene prediction result obtained by the classifier and the expected result of the second sample image is smaller than an error threshold.

In some embodiments, the electronic device may train the classifier using the trained neural network model. As shown in FIG. 8, step 608 may include steps 802-808, and step 610 may include steps 810-812.

In step 802, a random frame of N second samples is randomly generated in the second sample image.

The electronic device may acquire a second sample set of images for training the classifier, the second sample set of images may include a plurality of second sample images, and the second sample images may include images belonging to a target scene and images of a non-target scene. Each second sample image may carry an image tag, which may be used to characterize whether the second sample image belongs to a target scene or a non-target scene. Alternatively, the image tag may be represented by one or more of a number, character, etc. For example, the image label is "1" to indicate that the second sample image belongs to the target scene, and the image label is "0" to indicate that the second sample image belongs to the non-target scene, but is not limited thereto.

The N second sample random frames are randomly generated in the second sample image, and the description of generating the N random frames in the scene image in the above embodiment may be referred to, and details are not repeated herein.

And step 804, respectively extracting the image characteristics of the image area corresponding to each second sample random frame through the trained neural network model to obtain N local sample characteristics of the second sample image.

Step 806, extracting image features of the whole second sample image through the trained neural network model to obtain global sample features of the second sample image.

The way of extracting the N local sample features and the global sample features of the second sample image respectively by the trained neural network model may be the same as or similar to the way of extracting the N local sample features and the global sample features of the first sample image in the above embodiment, and reference may be made to the related description in the above embodiment, which is not repeated herein.

In one embodiment, the same neural network model is derived using sample image training. N first feature vectors and second feature vectors of a second sample image can be respectively extracted through the same neural network model, the first feature vectors can be used for representing local sample features, and the second feature vectors can be used for representing global sample features.

In another embodiment, the first network model and the second network model are obtained by training respectively with the sample images. N first feature vectors of the second sample image may be extracted through the first network model, and a second feature vector of the second sample image may be extracted through the second network model.

And 808, fusing the N local sample features and the global sample feature of the second sample image to obtain the training sample feature.

N first feature vectors and second feature vectors output by the neural network model obtained through training can be fused, and training sample features obtained through fusion are used as samples of the classifier to train the classifier.

And 810, calculating the prediction probability of the training sample characteristics belonging to the target scene through the classifier, and obtaining the target scene prediction result of the second sample image according to the prediction probability.

The prediction probability is the probability predicted by the classifier for the input training sample feature belonging to the target scene. The electronic equipment can input the obtained training sample characteristics into the classifier, and the classifier can predict the probability that the training sample characteristics belong to the target scene and obtain a target scene prediction result. Wherein the target scene prediction result may include a probability that the second sample image belongs to the target scene and a probability that the second sample image belongs to the non-target scene.

Step 812, adjusting parameters of the classifier according to an error between the target scene prediction result and an expected result of the second sample image.

An error between the target scene prediction result and the expected result of the second sample image may be calculated using an objective function, which may be used to measure a gap between the calculated prediction result and the expected result. For example, if the target scene prediction result is g (x) ═ 0.8, 0.2 and the desired result is (1, 0), the distance between the two, i.e., the error, can be calculated, and the parameters of the classifier, such as the weight coefficient and the bias coefficient, can be adjusted according to the calculated error.

The classifier can be continuously trained through a large number of second sample images, and parameters of the classifier are iteratively updated until the error between the target scene prediction result obtained by the classifier and the expected result of the second sample image is smaller than an error threshold value, so that the parameters of the classifier are optimal. Optionally, the error threshold may be set according to actual requirements, and the smaller the error threshold is, the more accurate the prediction result obtained by the classifier is.

FIG. 9 is a diagram illustrating training of a classifier in one embodiment. As shown in fig. 9, after the N local sample features and the global sample features of the sample image are obtained through the neural network model obtained through training, the N local sample features and the global sample features of the sample image may be fused to obtain the training sample features, and the training sample features are input into the classifier. The error between the predicted result and the expected result of the classifier can be calculated in a target supervision mode, and the parameters of the classifier are continuously updated.

In the embodiment of the application, the classifier is trained by using the features extracted by the neural network model obtained by training, and the classifier of a nonlinear class, such as a nonlinear classifier like an SVM (support vector machine), can be trained, the nonlinear classifier can effectively expand the classification dimension, and the defects of the linear classifier like softmax on nonlinear classification can be reduced.

In some embodiments, if the classifier is a linear classifier, the classifier may be trained simultaneously when training the neural network model. After the N local sample features and the global sample features of the sample image are respectively extracted through the neural network model, the N local sample features and the global sample features can be directly input into the classifier after being fused, and the classifier is trained. The training speed can be accelerated, and the training efficiency is improved.

Step 612, acquiring a scene image to be detected, and randomly generating N random frames in the scene image.

And 614, respectively extracting the image characteristics of the image area corresponding to each random frame through the trained neural network model, and outputting N local characteristic vectors.

And 616, extracting the image characteristics of the whole scene image through the neural network model, and outputting a global characteristic vector.

And step 618, splicing the N local feature vectors and the global feature vector to obtain an M-dimensional fusion feature vector.

And step 620, inputting the fusion feature vector into a classifier, calculating the probability of the fusion feature vector belonging to the target scene through the classifier, and obtaining a target scene detection result of the scene image according to the probability.

The descriptions of steps 612 to 620 refer to the descriptions of steps 402 to 410 in the above embodiments, and are not repeated herein.

In the embodiment of the application, the neural network model is trained according to the N local sample features and the global sample features, the training sample features for training the classifier are obtained through the neural network model, effective local sample features can be extracted by utilizing a random frame generated randomly, and the local sample features and the global sample features are combined, so that the neural network model and the classifier obtained through training are more accurate, and the accuracy of image-based target scene detection based on the neural network model and the classifier is improved.

As shown in fig. 10, in an embodiment, a network model training method for image scene detection is provided, which is applicable to electronic devices such as a Personal Computer (PC), a notebook Computer, a mobile phone, and a tablet Computer, and the embodiment of the present invention is not limited thereto. The method may comprise the steps of:

at step 1010, randomly generating N first sample random frames in the first sample image.

Step 1020, inputting the image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction, so as to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images.

In one embodiment, the step of inputting the image areas respectively corresponding to the N first sample random frames and the first sample image into the neural network model for feature extraction includes: sequencing the N first sample random frames according to the sequence of the sizes of the random frames from large to small or from small to large; carrying out zooming processing on image areas respectively corresponding to the N sorted first sample random frames to obtain N first images with target sizes; sequentially inputting the N first images into a neural network model for feature extraction according to the sequence; and zooming the first sample image to obtain a second image with a target size, and inputting the second image into the neural network model for feature extraction.

And step 1030, calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics, and adjusting the parameters of the neural network model according to the loss.

In one embodiment, the same neural network model is trained using the local image regions of the sample image and the entire sample image. Step 1020, comprising: inputting image areas corresponding to the N first sample random frames into a neural network model, and extracting image characteristics of the image area corresponding to each first sample random frame through the neural network model to obtain N first characteristic vectors serving as N local sample characteristics of the first sample image; and inputting the first sample image into the neural network model, and extracting the image characteristics of the whole first sample image through the neural network model to obtain a second characteristic vector as the global sample characteristics of the first sample image.

Step 1030, comprising: fusing the N first feature vectors and the second feature vectors to obtain fused sample features; and calculating the loss of the neural network model according to the fusion sample characteristics, and adjusting the parameters of the neural network model according to the loss.

In one embodiment, the local image area of the sample image and the whole sample image are used for training different neural network models respectively. The neural network model comprises a first network model and a second network model. Step 1020, comprising: inputting image areas corresponding to the N first sample random frames into a first network model, respectively extracting image features of the image area corresponding to each first sample random frame through the first network model, and obtaining N first feature vectors as N local sample features of the first sample image; and inputting the first sample image into a second network model, and extracting the image characteristics of the whole first sample image through the second network model to obtain a second characteristic vector as the global sample characteristics of the first sample image.

Step 1030, comprising: calculating first loss of the first network model according to the N first eigenvectors, and adjusting parameters of the first network model according to the first loss; and calculating a second loss of the second network model according to the second feature vector, and adjusting parameters of the second network model according to the second loss.

Step 1040, obtaining training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of the sample image.

In one embodiment, step 1040, comprises: randomly generating N second sample random frames in the second sample image; respectively extracting the image characteristics of the image area corresponding to each second sample random frame through the trained neural network model to obtain N local sample characteristics of the second sample image; extracting the image characteristics of the whole second sample image through the trained neural network model to obtain the global sample characteristics of the second sample image; and fusing the N local sample features and the global sample features of the second sample image to obtain the training sample features.

In one embodiment, the trained neural network model includes a trained first network model and a trained second network model. Respectively extracting image characteristics of image areas corresponding to each second sample random frame through a neural network model obtained through training, wherein the image characteristics comprise: respectively extracting image characteristics of image areas corresponding to the random frames of each second sample through a first network model obtained through training; extracting image characteristics of the whole second sample image through a neural network model obtained through training, wherein the image characteristics comprise: and extracting the image characteristics of the whole second sample image through the trained second network model.

And 1050, training the classifier according to the characteristics of the training samples, so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold.

In one embodiment, step 1050 includes: calculating the prediction probability of the training sample characteristics belonging to the target scene through a classifier, and obtaining the target scene prediction result of the second sample image according to the prediction probability; the parameters of the classifier are adjusted according to the error between the target scene prediction result and the expected result of the second sample image.

It should be noted that, for the description of the network model training method for image scene detection provided in the embodiment of the present application, reference may be made to the related description about training of the neural network model and the classifier in the image scene detection methods provided in the foregoing embodiments, and details are not repeated here.

In the embodiment of the application, N first sample random frames are randomly generated in the first sample image, image areas respectively corresponding to the N first sample random frames, inputting the first sample image into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample image, training the neural network model according to the N local sample features and the global sample features, training sample characteristics for training the classifier are obtained through a neural network model, effective local sample characteristics can be extracted by utilizing a random frame generated randomly, and the local sample characteristics and the global sample characteristics are combined, the neural network model and the classifier obtained by training can be more accurate, and the accuracy of image target scene detection based on the neural network model and the classifier is improved.

As shown in fig. 11, in an embodiment, an image scene detection apparatus 1100 is provided, which may be applied to the electronic device described above, and the image scene detection apparatus 1100 may include a random module 1110, a first feature extraction module 1120, a second feature extraction module 1130, a fusion module 1140, and a detection module 1150.

The random module 1110 is configured to obtain a scene image to be detected, and randomly generate N random frames in the scene image.

The first feature extraction module 1120 is configured to extract image features in each random frame respectively to obtain N local features of the scene image, where N is a positive integer.

The second feature extraction module 1130 is configured to extract image features of the entire scene image, so as to obtain global features of the scene image.

A fusion module 1140 for fusing the N local features with the global feature.

The detecting module 1150 is configured to perform target scene detection on the scene image according to the fused features, so as to obtain a target scene detection result of the scene image.

In one embodiment, the detection module 1150 is further configured to calculate a probability that the scene image belongs to a street scene according to the fused features, and obtain a street scene detection result of the scene image according to the probability, where the street scene detection result is used to characterize whether the scene image belongs to a street scene or a non-street scene.

In the embodiment of the application, the local features of the scene images are fused with the global features, the images belonging to the target scene can be accurately distinguished from the images belonging to the non-target scene, effective local features can be extracted by using the randomly generated random frame, the problem that effective local features are difficult to extract due to the fact that certain complex scenes do not have definite fixed features is solved, the target scene detection can be accurately carried out on the scene images, and the accuracy of image scene classification is improved.

In an embodiment, the first feature extraction module 1120 is further configured to extract, through the trained neural network model, image features of the image region corresponding to each random frame, and output N local feature vectors.

The second feature extraction module 1130 is further configured to extract image features of the entire scene image through the neural network model, and output a global feature vector.

In one embodiment, the local feature vector and the global feature vector are both M-dimensional feature vectors, M being a positive integer less than or equal to a dimension threshold. The fusion module 1140 is further configured to splice the N local feature vectors and the global feature vector to obtain an M-dimensional fusion feature vector.

In one embodiment, the detection module 1150 is further configured to input the fusion feature vector into a classifier, calculate a probability that the fusion feature vector belongs to a target scene through the classifier, and obtain a target scene detection result of the scene image according to the probability.

In one embodiment, the first feature extraction module 1120 includes a sorting unit and an extraction unit.

And the sequencing unit is used for sequencing the N random frames according to the sequence of the sizes of the random frames from large to small or from small to large, and the sequencing of the N random frames is the same as the size sequencing sequence of the random frames adopted by the neural network model during training.

And the extraction unit is used for sequentially extracting the image characteristics of the image area corresponding to each random frame through the trained neural network model according to the sequence of the N random frames.

In an embodiment, the extracting unit is further configured to scale image regions corresponding to the N sorted random frames respectively to obtain N local images with a target size, and sequentially extract image features of the local images corresponding to each random frame through a trained neural network model according to the sorting of the N random frames.

In one embodiment, the second feature extraction module 1130 is further configured to scale the scene image according to the target size, and extract the image features of the scaled scene image through the neural network model.

In one embodiment, the neural network model includes a first network model and a second network model.

The first feature extraction module 1120 is further configured to extract, through the first network model, image features of the image area corresponding to each random frame.

The second feature extraction module 1130 is further configured to extract image features of the entire scene image through the second network model.

In one embodiment, the image scene detection apparatus 1100 includes a random sample module, a sample feature extraction module, an adjustment module, a sample acquisition module, and a classifier training module, in addition to the random module 1110, the first feature extraction module 1120, the second feature extraction module 1130, the fusion module 1140, and the detection module 1150.

And the random sample module is used for randomly generating N first sample random frames in the first sample image.

And the sample characteristic extraction module is used for inputting the image areas respectively corresponding to the N first sample random frames and the first sample images into the neural network model for characteristic extraction to obtain local sample characteristics in each first sample random frame and global sample characteristics corresponding to the first sample images.

In an embodiment, the sample feature extraction module is further configured to sort the N first sample random frames in a sequence from a large random frame size to a small random frame size or from a small random frame size to a large random frame size, perform scaling processing on image areas respectively corresponding to the N sorted first sample random frames to obtain N first images with a target size, sequentially input the N first images into the neural network model according to the sort to perform feature extraction, and perform scaling processing on the first sample image to obtain a second image with the target size, and input the second image into the neural network model to perform feature extraction.

And the adjusting module is used for calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics and adjusting the parameters of the neural network model according to the loss.

In one embodiment, the sample feature extraction module includes a first extraction unit and a second extraction unit.

Optionally, the same neural network model is trained using the local image and the entire sample image of the sample image.

And the first extraction unit is used for inputting the image areas corresponding to the N first sample random frames into the neural network model, respectively extracting the image characteristics of the image area corresponding to each first sample random frame through the neural network model, and obtaining N first characteristic vectors as N local sample characteristics of the first sample image.

And the second extraction unit is used for inputting the first sample image into the neural network model, extracting the image characteristics of the whole first sample image through the neural network model and obtaining a second characteristic vector as the global sample characteristics of the first sample image.

And the adjusting module is further used for fusing the N first characteristic vectors and the second characteristic vectors to obtain fused sample characteristics, calculating the loss of the neural network model according to the fused sample characteristics, and adjusting the parameters of the neural network model according to the loss.

Optionally, the different neural network models are trained by using the local image and the whole sample image of the sample image respectively. The neural network model comprises a first network model and a second network model.

The first extraction unit is further configured to input image areas corresponding to the N first sample random frames into the first network model, and extract image features of the image area corresponding to each first sample random frame through the first network model, so as to obtain N first feature vectors serving as N local sample features of the first sample image.

And the second extraction unit is also used for inputting the first sample image into a second network model, extracting the image characteristics of the whole first sample image through the second network model, and obtaining a second characteristic vector as the global sample characteristics of the first sample image.

The adjusting module is further configured to calculate a first loss of the first network model according to the N first eigenvectors and adjust a parameter of the first network model according to the first loss, and to calculate a second loss of the second network model according to the second eigenvector and adjust a parameter of the second network model according to the second loss.

And the sample acquisition module is used for acquiring training sample characteristics through a neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of the sample image.

And the classifier training module is used for training the classifier according to the training sample characteristics so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold value.

In one embodiment, the device comprises a sample acquisition module, a generation unit, a local feature extraction unit, a global feature extraction unit and a feature fusion unit.

And the generating unit is used for randomly generating N second sample random frames in the second sample image.

And the local feature extraction unit is used for respectively extracting the image features of the image area corresponding to each second sample random frame through the trained neural network model to obtain N local sample features of the second sample image, wherein the loss of the trained neural network model is less than an expected loss threshold.

And the global feature extraction unit is used for extracting the image features of the whole second sample image through the trained neural network model to obtain the global sample features of the second sample image.

And the feature fusion unit is used for fusing the N local sample features and the global sample features of the second sample image to obtain the training sample features.

And the classifier training module comprises a probability prediction unit and a parameter adjusting unit.

And the probability prediction unit is used for calculating the prediction probability of the training sample characteristics belonging to the target scene through the classifier and obtaining the target scene prediction result of the second sample image according to the prediction probability.

And the parameter adjusting unit is used for adjusting the parameters of the classifier according to the error between the target scene prediction result and the expected result of the second sample image.

As shown in fig. 12, in an embodiment, a network model training apparatus 1200 for image scene detection is provided, which is applicable to the electronic device described above, and the network model training apparatus 1200 for image scene detection may include a random sample module 1210, a sample feature extraction module 1220, an adjustment module 1230, a sample acquisition module 1240 and a classifier training module 1250.

A random sample module 1210 for randomly generating N first sample random frames in the first sample image.

The sample feature extraction module 1220 is configured to input the image regions respectively corresponding to the N first sample random frames and the first sample images into the neural network model for feature extraction, so as to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images.

In an embodiment, the sample feature extraction module 1220 is further configured to sort the N first sample random frames in the order from large to small or from small to large according to the random frame size, perform scaling processing on image areas respectively corresponding to the N sorted first sample random frames to obtain N first images with a target size, sequentially input the N first images into the neural network model according to the sort to perform feature extraction, and perform scaling processing on the first sample image to obtain a second image with the target size, and input the second image into the neural network model to perform feature extraction.

And the adjusting module 1230 is configured to calculate a loss of the neural network model according to the N local sample features and the global sample feature, and adjust a parameter of the neural network model according to the loss.

In one embodiment, the sample feature extraction module 1220 includes a first extraction unit and a second extraction unit.

The adjusting module 1230 is further configured to fuse the N first feature vectors and the second feature vectors to obtain fused sample features, calculate a loss of the neural network model according to the fused sample features, and adjust parameters of the neural network model according to the loss.

The adjusting module 1230 is further configured to calculate a first loss of the first network model according to the N first eigenvectors and adjust a parameter of the first network model according to the first loss, and calculate a second loss of the second network model according to the second eigenvector and adjust a parameter of the second network model according to the second loss.

And the sample obtaining module 1240 is used for obtaining training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of the sample image.

And a classifier training module 1250 configured to train the classifier according to the training sample features so that an error between a target scene prediction result obtained by the classifier and an expected result is smaller than an error threshold.

In one embodiment, the sample acquiring module 1240 includes a generating unit, a local feature extracting unit, a global feature extracting unit, and a feature fusing unit.

The classifier training module 1250 includes a probability prediction unit and a parameter adjustment unit.

FIG. 13 is a block diagram showing the structure of an electronic apparatus according to an embodiment. The electronic equipment can be mobile phones, tablet computers, intelligent wearable equipment, PCs, notebook computers and other equipment. As shown in fig. 13, electronic device 1300 may include one or more of the following components: a processor 1310, a memory 1320 coupled to the processor 1310, wherein the memory 1320 may store one or more applications, and the one or more applications may be configured to be executed by the one or more processors 1310, and the one or more programs are configured to perform the image scene detection method as described in the embodiments above.

Processor 1310 may include one or more processing cores. The processor 1310 interfaces with various interfaces and circuitry throughout the electronic device 1300 to perform various functions of the electronic device 1300 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 1320, as well as invoking data stored in the memory 1320. Alternatively, the processor 1310 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 1310 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is to be understood that the modem may not be integrated into the processor 1310, but may be implemented by a communication chip.

The Memory 1320 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 1320 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 1320 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The stored data area may also store data created during use by the electronic device 1300, and the like.

It is understood that the electronic device 1300 may include more or less structural elements than those shown in the above structural block diagrams, such as a power supply, an input button, a camera, a speaker, a screen, an RF (Radio Frequency) circuit, a Wi-Fi (Wireless Fidelity) module, a bluetooth module, a sensor, etc., and is not limited thereto.

An embodiment of the present application further provides an electronic device, which includes a memory and a processor, where the memory stores a computer program, and when the computer program is executed by the processor, the processor implements the network model training method for image scene detection as described in the foregoing embodiments.

The embodiment of the application discloses a computer readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to realize the image scene detection method described in the embodiments.

The embodiment of the application discloses a computer-readable storage medium, which stores a computer program, wherein the computer program, when executed by a processor, implements the network model training method for image scene detection as described in the embodiments above.

Embodiments of the present application disclose a computer program product comprising a non-transitory computer readable storage medium storing a computer program, and the computer program is executable by a processor to implement the image scene detection method as described in the embodiments above.

Embodiments of the present application disclose a computer program product comprising a non-transitory computer-readable storage medium storing a computer program, and the computer program is executable by a processor to implement a network model training method for image scene detection as described in the embodiments above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a non-volatile computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the program is executed. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), or the like.

Any reference to memory, storage, database, or other medium as used herein may include non-volatile and/or volatile memory. Suitable non-volatile memory can include ROM, Programmable ROM (PROM), Erasable PROM (EPROM), Electrically Erasable PROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of illustration and not limitation, RAM is available in many forms, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), Rambus Direct RAM (RDRAM), and Direct Rambus DRAM (DRDRAM).

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. Those skilled in the art should also appreciate that the embodiments described in this specification are all alternative embodiments and that the acts and modules involved are not necessarily required for this application.

In various embodiments of the present application, it should be understood that the size of the serial number of each process described above does not mean that the execution sequence is necessarily sequential, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

The integrated units, if implemented as software functional units and sold or used as a stand-alone product, may be stored in a computer accessible memory. Based on such understanding, the technical solution of the present application, which is a part of or contributes to the prior art in essence, or all or part of the technical solution, may be embodied in the form of a software product, stored in a memory, including several requests for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute part or all of the steps of the above-described method of the embodiments of the present application.

The image scene detection method, the image scene detection device, the electronic device, and the storage medium disclosed in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principles and embodiments of the present application. Meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. An image scene detection method, comprising:

acquiring a scene image to be detected, randomly generating N random frames in the scene image, and respectively extracting image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer;

extracting the image characteristics of the whole scene image to obtain the global characteristics of the scene image;

fusing the N local features with the global feature;

and carrying out target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image.

2. The method of claim 1, wherein the target scene comprises a street scene, and the performing the target scene detection on the scene image according to the fused feature to obtain the target scene detection result of the scene image comprises:

calculating the probability that the scene image belongs to the street scene according to the fused features;

and obtaining a street scene detection result of the scene image according to the probability, wherein the street scene detection result is used for representing whether the scene image belongs to the street scene or the non-street scene.

3. The method of claim 1, wherein the extracting the image features in each random frame separately to obtain N local features of the scene image comprises:

respectively extracting image features of image areas corresponding to the random frames through a neural network model obtained through training, and outputting N local feature vectors;

the extracting of the image features of the whole scene image to obtain the global features of the scene image comprises:

and extracting the image characteristics of the whole scene image through the neural network model, and outputting a global characteristic vector.

4. The method of claim 3, wherein the size of the N random frames is randomly generated; the neural network model obtained through training respectively extracts the image characteristics of the image area corresponding to each random frame, and the method comprises the following steps:

sequencing the N random frames according to the sequence of the sizes of the random frames from large to small or from small to large, wherein the sequencing of the N random frames is the same as the size sequencing sequence of the random frames adopted by the neural network model during training;

and according to the sequence of the N random frames, sequentially extracting the image characteristics of the image area corresponding to each random frame through a trained neural network model.

5. The method according to claim 4, wherein the extracting the image features of the image region corresponding to each random frame sequentially by the trained neural network model according to the sequence of the N random frames comprises:

carrying out zooming processing on image areas respectively corresponding to the N sorted random frames to obtain N local images with target sizes;

and according to the sequence of the N random frames, sequentially extracting the image characteristics of the local image corresponding to each random frame through a trained neural network model.

6. The method of claim 3, wherein said extracting image features of the entire scene image through the neural network model comprises:

and zooming the scene image according to the target size, and extracting the image characteristics of the zoomed scene image through the neural network model.

7. The method of claim 3, wherein the neural network model comprises a first network model and a second network model, the first network model and the second network model being different network models;

the neural network model obtained through training respectively extracts the image characteristics of the image area corresponding to each random frame, and the method comprises the following steps:

respectively extracting the image characteristics of the image area corresponding to each random frame through the first network model;

the extracting, by the neural network model, image features of the entire scene image includes:

and extracting the image characteristics of the whole scene image through the second network model.

8. The method according to any one of claims 2 to 6, wherein the local feature vector and the global feature vector are both M-dimensional feature vectors, wherein M is a positive integer less than or equal to a dimension threshold;

the fusing the N local features with the global feature includes:

and splicing the N local feature vectors and the global feature vector to obtain an M-dimensional fusion feature vector.

9. The method according to claim 8, wherein the performing the target scene detection on the scene image according to the fused feature to obtain the target scene detection result of the scene image comprises:

and inputting the fusion feature vector into a classifier, calculating the probability of the fusion feature vector belonging to a target scene through the classifier, and obtaining a target scene detection result of the scene image according to the probability.

10. The method according to claim 3, wherein before the trained neural network model extracts the image features of the image region corresponding to each random frame, the method further comprises:

randomly generating N first sample random frames in the first sample image;

inputting image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images;

and calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics, and adjusting the parameters of the neural network model according to the loss.

11. The method according to claim 10, wherein the inputting image regions respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images comprises:

inputting image areas corresponding to the N first sample random frames into a neural network model, and respectively extracting image features of the image area corresponding to each first sample random frame through the neural network model to obtain N first feature vectors serving as N local sample features of the first sample image;

inputting the first sample image into the neural network model, extracting image features of the whole first sample image through the neural network model, and obtaining a second feature vector as global sample features of the first sample image;

the calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics and adjusting the parameters of the neural network model according to the loss comprises:

fusing the N first feature vectors and the second feature vectors to obtain fused sample features;

and calculating the loss of the neural network model according to the fusion sample characteristics, and adjusting the parameters of the neural network model according to the loss.

12. The method according to claim 10, wherein inputting the image areas respectively corresponding to the N first sample random frames and the first sample image into a neural network model for feature extraction comprises:

sequencing the N first sample random frames according to the sequence of the sizes of the random frames from large to small or from small to large;

carrying out zooming processing on image areas respectively corresponding to the N sorted first sample random frames to obtain N first images with target sizes;

sequentially inputting the N first images into a neural network model for feature extraction according to the sequence;

and zooming the first sample image to obtain a second image with a target size, and inputting the second image into the neural network model for feature extraction.

13. The method of claim 10, wherein the neural network model comprises a first network model and a second network model;

the image area that will with N first sample random frame corresponds respectively, and first sample image inputs neural network model and carries out the feature extraction, obtain every local sample feature in the first sample random frame and with the global sample feature that first sample image corresponds include:

inputting image areas corresponding to the N first sample random frames into the first network model, and extracting image features of the image area corresponding to each first sample random frame through the first network model to obtain N first feature vectors serving as N local sample features of the first sample image;

inputting the first sample image into the second network model, and extracting image features of the whole first sample image through the second network model to obtain a second feature vector as global sample features of the first sample image;

calculating first loss of the first network model according to the N first eigenvectors, and adjusting parameters of the first network model according to the first loss;

and calculating a second loss of the second network model according to the second eigenvector, and adjusting parameters of the second network model according to the second loss.

14. The method of any one of claims 10 to 13, wherein after said calculating a loss of said neural network model based on said fused sample features and adjusting parameters of said neural network model based on said loss, said method further comprises:

randomly generating N second sample random frames in the second sample image;

respectively extracting image features of image areas corresponding to the random frames of each second sample through a trained neural network model to obtain N local sample features of the images of the second samples, wherein the loss of the trained neural network model is smaller than an expected loss threshold value;

extracting the image characteristics of the whole second sample image through the trained neural network model to obtain the global sample characteristics of the second sample image;

fusing the N local sample features and the global sample features of the second sample image to obtain training sample features;

calculating the prediction probability of the training sample characteristics belonging to the target scene through a classifier, and obtaining the target scene prediction result of the second sample image according to the prediction probability;

adjusting parameters of the classifier according to an error between the target scene prediction result and an expected result of the second sample image.

15. A network model training method for image scene detection is characterized by comprising the following steps:

randomly generating N first sample random frames in the first sample image;

calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics, and adjusting the parameters of the neural network model according to the loss;

obtaining training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of a sample image;

and training a classifier according to the training sample characteristics so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold.

16. The method of claim 15, wherein the deriving training sample features from the neural network model comprises:

randomly generating N second sample random frames in the second sample image;

and fusing the N local sample features and the global sample features of the second sample image to obtain the training sample features.

17. The method of claim 16, wherein the trained neural network model comprises a trained first network model and a trained second network model; the step of respectively extracting the image features of the image area corresponding to each second sample random frame by the trained neural network model comprises the following steps:

respectively extracting image features of image areas corresponding to the random frames of each second sample through the trained first network model;

the extracting of the image features of the whole second sample image by the neural network model obtained by the training includes:

and extracting the image characteristics of the whole second sample image through the second network model obtained through training.

18. The method of claim 17, wherein the inputting image regions respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images comprises:

19. The method according to any one of claims 16 to 18, wherein the training of the classifier according to the training sample features comprises:

20. The method of claim 15, wherein inputting the image regions respectively corresponding to the N first sample random frames and the first sample image into a neural network model for feature extraction comprises:

21. The method of claim 15, wherein the inputting image regions respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images comprises:

22. An image scene detection apparatus, comprising:

the device comprises a random module, a video acquisition module and a processing module, wherein the random module is used for acquiring a scene image to be detected and randomly generating N random frames in the scene image;

the first feature extraction module is used for respectively extracting image features in each random frame to obtain N local features of the scene image, wherein N is a positive integer;

the second feature extraction module is used for extracting the image features of the whole scene image to obtain the global features of the scene image;

a fusion module for fusing the N local features with the global feature;

and the detection module is used for carrying out target scene detection on the scene image according to the fused features to obtain a target scene detection result of the scene image.

23. A network model training device for image scene detection is characterized by comprising:

the random sample module is used for randomly generating N first sample random frames in the first sample image;

the sample feature extraction module is used for inputting image areas respectively corresponding to the N first sample random frames and the first sample images into a neural network model for feature extraction to obtain local sample features in each first sample random frame and global sample features corresponding to the first sample images;

the adjusting module is used for calculating the loss of the neural network model according to the N local sample characteristics and the global sample characteristics and adjusting the parameters of the neural network model according to the loss;

the sample acquisition module is used for acquiring training sample characteristics through the neural network model, wherein the training sample characteristics are obtained by fusing local sample characteristics and global sample characteristics of a sample image;

and the classifier training module is used for training a classifier according to the training sample characteristics so that the error between the target scene prediction result obtained by the classifier and the expected result is smaller than an error threshold value.

24. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that, when executed by the processor, causes the processor to carry out the method of any one of claims 1 to 14.

25. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program that, when executed by the processor, causes the processor to carry out the method of any one of claims 15 to 21.

26. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 14.

27. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 15 to 21.