CN113971761A

CN113971761A - Multi-input scene recognition method, terminal device and readable storage medium

Info

Publication number: CN113971761A
Application number: CN202111310734.6A
Authority: CN
Inventors: 朱捷
Original assignee: Nanchang Black Shark Technology Co Ltd
Current assignee: Nanchang Black Shark Technology Co Ltd
Priority date: 2021-11-05
Filing date: 2021-11-05
Publication date: 2022-01-25

Abstract

The invention provides a multi-input scene recognition method, a terminal device and a readable storage medium, which relate to the technical field of computers and comprise the following steps: acquiring a plurality of scene images under an application program, and selecting a plurality of characteristic areas based on the scene images to obtain a characteristic image set of the scene images; generating a training sample set; creating an identification model, and training the identification model, wherein the identification model comprises a synchronously executed splicing identification model, a stacking identification model and a feature extraction identification model; confirming an optimal model corresponding to the application program; the method comprises the steps of obtaining a scene image to be identified, identifying an application program corresponding to the scene image to be identified, calling an optimal model for processing, obtaining a target result containing scene information corresponding to the scene image to be identified, and solving the problems of high power consumption and high difficulty caused by the adoption of full-image identification in the conventional scene identification.

Description

Multi-input scene recognition method, terminal device and readable storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a multi-input scene recognition method, a terminal device and a readable storage medium.

Background

At present, deep learning is greatly developed in the field of image recognition, a plurality of image states can be recognized by using a deep learning model, a satisfactory effect can be achieved, meanwhile, a mobile terminal can also be recognized by using the model, and based on the situation, screenshots are generally taken on a mobile phone screen at present, a model is used for recognition, a user game scene is recognized, and each business module is informed to perform business processing. However, when recognition of the whole graph is used, various problems exist in terms of power consumption, training difficulty and recognition effect, and when a specific area is selected singly, characteristics are repeated, so that the recognition effect is poor.

Disclosure of Invention

In order to overcome the technical defects, the invention aims to provide a multi-input scene recognition method, a terminal device and a readable storage medium, and solve the problems of high power consumption and high difficulty caused by the adoption of full-image recognition in the existing scene recognition.

The invention discloses a multi-input scene recognition method, which comprises the following steps:

acquiring a plurality of scene images under an application program, acquiring a scene image, selecting a plurality of characteristic regions based on the scene image, intercepting the characteristic images according to the characteristic regions, and collecting to obtain a characteristic image set of the scene image;

correspondingly generating a training sample set according to the feature image set of each scene image and a preset image information label of the scene image;

creating an identification model, and training the identification model by adopting the training sample set, wherein the identification model comprises a splicing identification model, a stacking identification model and a feature extraction identification model which are synchronously executed;

respectively calculating the accuracy rates of the splicing identification model, the stacking identification model and the feature extraction identification model in the training process, and confirming the optimal model corresponding to the application program according to the accuracy rates;

the method comprises the steps of obtaining a scene image to be identified, identifying an application program corresponding to the scene image to be identified, calling an optimal model for processing, and obtaining a target result containing scene information corresponding to the scene image to be identified.

Preferably, training the recognition model using the training sample set includes:

obtaining a training sample from a training sample set;

inputting the training sample into the recognition model, and respectively adopting a splicing recognition model, a stacking recognition model and a feature extraction recognition model for synchronous processing to obtain a first output, a second output and a third output for determining scene information of a scene image in the training sample;

comparing the first output, the second output and the third output with preset image information labels of the scene image respectively, and reversely adjusting the splicing identification model, the stacking identification model and the feature extraction identification model;

and acquiring another training sample pair for the recognition model, and stopping training until a preset training condition is reached.

Preferably, a stitching recognition model process is employed to obtain a first output for determining scene information of scene images in the training sample, including the following:

acquiring a characteristic image set of the scene image based on the training sample;

inputting each characteristic image in the characteristic image set to a splicing identification model;

splicing all the characteristic images in the splicing identification model to obtain a first processed image;

and performing feature extraction on the first processed image by adopting a first depth separable convolution network to obtain a first output.

Preferably, a stack recognition model process is employed to obtain a second output for determining scene information for scene images in the training sample, including the following:

inputting each feature image in the set of feature images to a stack recognition model;

performing channel stacking on each feature image in the stack identification model to obtain a second processed image

And performing feature extraction on the first processed image by adopting a second depth separable convolution network to obtain a second output.

Preferably, a feature extraction recognition model process is employed to obtain a third output for determining scene information of scene images in the training sample, including the following:

inputting each characteristic image in the characteristic image set to a characteristic extraction and identification model;

synchronously extracting the features of each feature image by adopting a plurality of third depth separable convolution networks in the feature extraction and identification model, and then merging each feature image after feature extraction into a third processed image;

processing based on the third processed image using a third depth separable convolutional network to obtain a third output.

Preferably, the accuracy of the splicing identification model, the stacking identification model and the feature extraction identification model respectively, and the optimal model corresponding to the application program is determined according to the accuracy, including the following steps:

acquiring each scene image of an application program and a corresponding characteristic image set thereof;

acquiring a first output, a second output and a third output of the splicing recognition model, the stacking recognition model and the feature extraction recognition model under each training;

calculating the accuracy of the spliced identification model, the stacked identification model and the feature extraction identification model according to the preset image information label of the scene image and the first output, the second output and the third output under each training;

and screening out the model corresponding to the highest confirmation accuracy according to the accuracy as the optimal model corresponding to the application program.

Preferably, the method includes the steps of acquiring a plurality of scene images under an application program, and performing category marking on each scene image, including the following steps:

acquiring an application program and running the application program;

recording a screen when the application program runs to obtain a video of the application program, and framing the video to generate a scene image under the application program;

and marking preset image information labels on the scene images according to preset classification rules.

Preferably, the acquisition of the image of the scene to be identified comprises the following:

receiving an application program starting signal, and calling an optimal model corresponding to the application program;

monitoring the application program, and intercepting an operation picture in the application program according to preset setting parameters to obtain a picture screenshot;

and determining and marking a characteristic area for the picture screenshot to obtain a scene image to be identified.

The invention also provides a terminal device, which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, and is characterized in that the processor realizes the steps of the multi-input scene recognition method when executing the computer program.

The present invention also provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the steps of the multi-input scene recognition method described above

After the technical scheme is adopted, compared with the prior art, the method has the following beneficial effects:

the game scene is recognized by splitting the scene image of the application program into a plurality of characteristic regions, the scale of model training data is greatly reduced, a better model is trained more quickly, meanwhile, the image size corresponding to the characteristic regions is smaller than that of the whole scene image, the time and power consumption for preprocessing more pictures can be saved, meanwhile, the characteristic regions are utilized to remove some redundant information, and the accuracy of the recognition result is further improved.

Drawings

Fig. 1 is a flowchart of a multi-input scene recognition method, a terminal device and a readable storage medium according to a first embodiment of the present invention;

fig. 2 is a flowchart of training the recognition model by using the training sample set in a first embodiment of the multi-input scene recognition method, the terminal device and the readable storage medium according to the present invention;

fig. 3 is a schematic structural diagram of a stitching recognition model in a first embodiment of the multi-input scene recognition method, the terminal device, and the readable storage medium according to the present invention;

fig. 4 is a schematic structural diagram of a stack recognition model in a first embodiment of the multi-input scene recognition method, the terminal device and the readable storage medium according to the present invention;

fig. 5 is a schematic structural diagram of a feature extraction recognition model in the first embodiment of the multi-input scene recognition method, the terminal device and the readable storage medium according to the present invention;

fig. 6 is an exemplary diagram of training samples in a first embodiment of a multi-input scene recognition method, a terminal device and a readable storage medium according to the present invention;

fig. 7 is a flowchart illustrating a method for identifying a multi-input scene, a terminal device and a readable storage medium according to a first embodiment of the present invention to determine an optimal model corresponding to the application program;

fig. 8 is a schematic diagram of device modules of a second embodiment of the multiple-input scene recognition method, the terminal device, and the readable storage medium according to the present invention.

Reference numerals:

6-terminal equipment; 61-a memory; 62-a processor; 63-processing module for a multiple input scene recognition method.

Detailed Description

The advantages of the invention are further illustrated in the following description of specific embodiments in conjunction with the accompanying drawings.

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings, in which like numerals refer to the same or similar elements throughout the different views, unless otherwise specified. The implementations described in the following exemplary examples do not represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

In the description of the present invention, it is to be understood that the terms "longitudinal," "lateral," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are used in the orientation or positional relationship indicated in the drawings, which are used for convenience in describing the present invention and for simplicity in description, and are not intended to indicate or imply that the referenced device or element must have a particular orientation, be constructed and operated in a particular orientation, and are not to be considered limiting.

In the description of the present invention, unless otherwise specified and limited, it is to be noted that the terms "mounted," "connected," and "connected" are used in a broad sense, and for example, they may be mechanically or electrically connected, or they may be connected internally to two elements, directly or indirectly through an intermediate, and those skilled in the art will understand the specific meaning of the terms as they are used in the specific case.

In the following description, suffixes such as "module", "part", or "unit" used to indicate elements are used only for facilitating the explanation of the present invention, and do not have a specific meaning per se. Thus, "module" and "component" may be used in a mixture.

The first embodiment is as follows: the embodiment provides a method for recognizing a multi-input scene, which is shown in fig. 1 and includes the following steps:

s100: acquiring a plurality of scene images under an application program, acquiring a scene image, selecting a plurality of characteristic regions based on the scene image, intercepting the characteristic images according to the characteristic regions, and collecting to obtain a characteristic image set of the scene image;

in the above steps, the selection of the feature region may be preset, for example, feature points between each scene are observed, and feature points that can be distinguished in each scene are summarized and summarized, and a plurality of obvious feature distinguishing positions are selected. The selected area is as small as possible and has obvious features, or, for example, a preset list is set, the preset list includes information of each application program and a feature area position set corresponding to the information, the preset list may be pre-stored in a database, the application program and the feature area position set corresponding to the application program may be stored in the database after being manually marked, or may be determined by self-recognition of a trained model, specifically, a target detection model may be established, a sample is collected and trained, for example, an area meeting preset conditions may be screened out as a feature area according to brightness change when an application program is first opened and enters a guidance interface by the target detection model, and a feature area position set is obtained based on the area, the area where the brightness exceeds a threshold value in the guidance interface, or the area where the brightness is increased in the guidance interface, after the model obtains the feature region position set, the information of the application program and the obtained feature region position set can be correspondingly stored in a database, so that the preset list can be updated along with the update of the application program.

Specifically, the step of acquiring a plurality of scene images under an application program and performing category marking on each scene image includes the following steps:

s110: acquiring an application program and running the application program;

in this embodiment, the application may include any application on the mobile terminal, including but not limited to various types of games, such as a game.

S120: recording a screen when the application program runs so as to obtain a video of the application program, and framing the video so as to generate a scene image under the application program;

in the above steps, the video data obtained after screen recording is converted into picture data by framing, that is, the scene image in the application program.

S130: and marking preset image information labels on the scene images according to preset classification rules.

Specifically, the scene images are labeled with preset image information labels, for example, labeled as "loading" or "waiting", and classified according to the desired scene of the game, and can be quickly identified according to the images of the feature areas in the identification process.

S200: correspondingly generating a training sample set according to the characteristic image set of each scene image and a preset image information label of the scene image;

in the above steps, data may be captured according to a specific area through a script and stored in a data directory, that is, a feature image set of the scene image is generated, and in the process of generating a training sample set, format standardization processing may be performed on the feature image set and a preset image information tag of the scene image, that is, the feature image set and the preset image information tag are converted into a training sample set usable by the following recognition model, which may refer to an example in fig. 6.

S300: creating an identification model, and training the identification model by adopting the training sample set, wherein the identification model comprises a synchronously executed splicing identification model, a stacking identification model and a feature extraction identification model;

specifically, in the above step, the training of the recognition model is performed by using the training sample set, referring to fig. 2, which includes the following steps:

s310: obtaining a training sample from a training sample set;

for example, each training sample includes a scene image, a feature image set of the scene image, and a preset image information tag of the scene image.

S320: inputting the training sample into the recognition model, and respectively adopting a splicing recognition model, a stacking recognition model and a feature extraction recognition model for synchronous processing to obtain a first output, a second output and a third output for determining scene information of a scene image in the training sample;

in the above steps, the stitching recognition model, the stacking recognition model and the feature extraction recognition model are all used for stitching feature images in the feature image set in the training sample, and feature recognition is performed after stitching. The difference lies in that the splicing identification model splices all the characteristic images to identify the characteristics, the stacking identification model folds all the characteristic images to identify the characteristics, the characteristic extraction identification model splices all the characteristic images after identifying the characteristics, and finally identifies the characteristics again to respectively obtain a first output, a second output and a third output. In the present embodiment, an example in which 3 feature images are included in a feature image set is described.

Specifically, in the above steps, the step of processing each model separately includes the following steps:

referring to fig. 3, a first output for determining scene information of a scene image in the training sample is obtained by using a stitching recognition model process, which includes the following steps:

s320-11: acquiring a characteristic image set of the scene image based on the training sample;

the generated training samples are stored in a database, and in each training process, one training sample is acquired from the database one by one.

S320-12: inputting each characteristic image in the characteristic image set to a splicing identification model;

s320-13: splicing all the characteristic images in the splicing identification model to obtain a first processed image;

in the above steps, 3 regions in the game scene are mainly used as 3 inputs of the model, the function of the model is mainly to splice the 3 input pictures into one picture, the 3 input pictures include 3 channels, and the pictures are correspondingly spliced on each channel, that is, the spliced image also includes 3 channels.

S320-14: and performing feature extraction on the first processed image by adopting a first depth separable convolution network to obtain a first output.

In the above step, the first deep separable convolutional network is a model modified based on MobileNetV1, specifically modified to reduce the number of convolutional layers therein for application to mobile terminals and reduce memory usage, MobileNets constructs a lightweight weighted deep neural network using deep separable convolution based on a streamlined structure, and deep separable convolutional is a convolution that decomposes a standard convolution into deep convolution and 1x1, i.e., point-by-point convolution. For MobileNet, the depth convolution applies a single filter for each single input channel to filter, and then the point-by-point convolution applies a convolution operation of 1x1 to combine the resulting outputs of all depth convolutions. This decomposition can effectively reduce the amount of computation and the size of the model by a large amount. mobileNet V1 replaces the normal convolution with a depth separable convolution (depthwidth separable convolution) and reduces the number of parameters using a width multiplier (width multiplex), but reducing the number of parameters and operations also results in loss of features leading to reduced accuracy. Except that the first layer is a standard convolution layer, other convolution layers are depth separable convolution, an average pooling layer is connected after convolution, then a full connection layer is passed through, finally, a probability value of 0-1 is normalized by utilizing a Softmax activation function, and the classification condition of the image can be obtained according to the height of the probability value.

Referring to fig. 4, a second output for determining scene information for a scene image in the training sample is obtained using a stack recognition model process, including the following:

s320-21: acquiring a characteristic image set of the scene image based on the training sample;

similar to the above step S320-11, which is not described herein.

S320-22: inputting each feature image in the set of feature images to a stack recognition model;

s320-23: performing channel stacking on each characteristic image in the stacking identification model to obtain a second processing image;

in the above steps, 3 regions in the game scene are mainly used as 3 inputs of the model, and the function of the model is mainly to stack 3 input pictures first, for example, 3 channels of pictures are stacked into 9 channels, and the second processed image includes information of all 3 pictures.

S320-24: and performing feature extraction on the first processed image by adopting a second depth separable convolution network to obtain a second output.

In the above step, the second depth separable convolutional network is similar to the first depth separable convolutional network and is a model modified based on MobileNetV1, and the specific modification is to reduce the number of convolutional layers therein so as to be suitable for the mobile terminal, and in particular, referring to the above step S320-14, the number of convolutional layers in the second depth separable convolutional network and the first depth separable convolutional network may be the same or different, and may be specifically set according to the mobile terminal actually used.

Referring to fig. 5, a third output for determining scene information of a scene image in the training sample is obtained by processing a feature extraction recognition model, including the following:

s320-31: acquiring a characteristic image set of the scene image based on the training sample;

similar to the above step S320-11, which is not described herein.

S320-32: inputting each characteristic image in the characteristic image set to a characteristic extraction and identification model;

s320-33: synchronously extracting the features of each feature image by adopting a plurality of third depth separable convolution networks in the feature extraction and identification model, and then merging each feature image after feature extraction into a third processed image;

in the above steps, 3 regions in the game scene are used as 3 inputs of the model, and the function of the model is mainly to extract the features of the 3 input pictures, that is, the features of the 3 pictures are extracted, then the 3 input features are combined, and then the feature extraction is performed.

S320-34: processing based on the third processed image using a third depth separable convolutional network to obtain a third output.

In the above step, the third depth separable convolutional network is similar to the first depth separable convolutional network, and is a model modified based on MobileNetV1, the specific modification is to reduce the number of convolutional layers therein so as to be suitable for the mobile terminal, and in particular referring to the above step S320-14, the number of convolutional layers in the second depth separable convolutional network and the first depth separable convolutional network may be consistent or inconsistent, and may be specifically set according to the mobile terminal actually used, the model is modified MobileNetV1, but the top structure thereof is modified to be an operation of extracting and combining 3 input features. The feature extraction and recognition model can better extract the features of each input and can not confuse the information of each input at the beginning.

S330: comparing the first output, the second output and the third output with preset image information labels of the scene image respectively, and reversely adjusting the splicing identification model, the stacking identification model and the feature extraction identification model;

s340: and acquiring another training sample pair for the recognition model, and stopping training until a preset training condition is reached.

In the above step, the preset training condition may be training times, or a comparison difference between the first output, the second output, and the third output and a preset image information tag of the scene image may reach a preset threshold range, which may be set according to an actual usage scene.

S400: respectively calculating the accuracy of the splicing identification model, the stacking identification model and the feature extraction identification model in the training process, and confirming the optimal model corresponding to the application program according to the accuracy;

specifically, in the above step, the accuracy of the splicing identification model, the stacking identification model, and the feature extraction identification model respectively is determined, and an optimal model corresponding to the application program is determined according to the accuracy, which is shown in fig. 7, and includes the following steps:

s410: acquiring each scene image of an application program and a corresponding characteristic image set thereof;

s420: acquiring first output, second output and third output of the splicing recognition model, the stacking recognition model and the feature extraction recognition model under each training;

according to the above S300, the first output, the second output and the third output in each training process can be obtained.

S430: calculating the accuracy of the splicing recognition model, the stacking recognition model and the feature extraction recognition model according to the preset image information label of the scene image and the first output, the second output and the third output under each training;

specifically, in the above step, the calculation accuracy may adopt a similarity calculation method, specifically as an example, such as calculating the euclidean distances between the first output, the second output, and the third output and the preset image information label in each training process, respectively; in addition to the similarity calculation method, other methods for calculating the difference between the preset image information label and the first output, the second output, and the third output at each training may be used.

S440: and screening out the model corresponding to the highest confirmation accuracy according to the accuracy as the optimal model corresponding to the application program.

S500: the method comprises the steps of obtaining a scene image to be identified, identifying an application program corresponding to the scene image to be identified, calling an optimal model for processing, and obtaining a target result containing scene information corresponding to the scene image to be identified.

In the above steps S300 to S400, the training of the recognition model and the determination of the optimal model may be performed at the server or at the mobile terminal, or if the training is performed at the server and the mobile terminal obtains the specific model in the recognition model in advance according to the application program.

Specifically, the acquiring of the scene image to be recognized in the above steps includes the following steps:

s510: receiving an application program starting signal, and calling an optimal model corresponding to the application program;

in the above step, the started application program parameters may be obtained according to the user ID, and the matching optimal model of the identification model and the application program may be downloaded according to the user ID.

S520; monitoring the application program, and intercepting an operation picture in the application program according to preset setting parameters to obtain a picture screenshot;

specifically, the preset parameter may be a preset frame number acquisition frequency or a preset time interval, that is, a scene recognition is performed on an operation picture meeting the requirement to obtain a target result including scene information.

S530: and determining and marking a characteristic area for the picture screenshot to obtain a scene image to be identified.

In the above steps, as described with reference to step S100, the feature region position set matching the application may be obtained by using a preset list, and the feature region is determined and marked, where the ratio of the feature region to the whole map is small, so that enough feature information can be obtained, and the scene information of the game can be provided more quickly and accurately.

In the scheme described in this embodiment, the model is trained by splitting a plurality of feature regions of a scene image of an application program, and the recognition of a game scene is performed on a mobile terminal. The scale of the model training data can be greatly reduced, a better model can be trained more quickly, meanwhile, the size of the identification picture is smaller than that of the original picture, the time and power consumption for preprocessing more pictures can be saved, and the other purpose is that after the characteristic area is selected, the distinguished characteristics can be centralized, some redundant information is removed, and the identification effect is more excellent. Due to the fact that some application programs, such as game scenes, are not changed greatly, the multi-input model can be faster, and the requirement for identifying the game scenes can be met better.

The method comprises the steps of using a plurality of regions in a scene image as model input for scene recognition, reducing the possibility of recognition failure caused by repeated or similar conditions of a single region and a plurality of scenes due to the fact that the plurality of regions are used as input, preventing redundant information from being too much recognized by a whole image to cause poor recognition effect, enabling a mobile terminal to perform splicing on image data inside a recognition model (specifically comprising the three splicing modes), enabling the model to run on a DSP or a GPU, accelerating the processing of the image, reducing the pressure of a CPU, reducing the redundant information, greatly accelerating the convergence speed of the model and greatly reducing the requirement on data volume.

Example two:

to achieve the above object, the present invention also provides a terminal device 6, and referring to fig. 8, the intelligent terminal can be implemented in various forms. For example, the terminal described in the present invention may include an intelligent terminal such as a mobile phone, a smart phone, a notebook computer, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a navigation device, etc., and a fixed terminal such as a digital TV, a desktop computer, etc. However, it will be understood by those skilled in the art that the configuration according to the embodiment of the present invention can be applied to a fixed type terminal in addition to elements particularly used for moving purposes. The terminal device of this embodiment at least includes but is not limited to: a memory 61, a processor 62 and a processing module 63 for a multiple input scene recognition method, which may be communicatively connected to each other through a system bus, as shown in fig. 8. It is noted that fig. 8 only shows a terminal device having components, but it is to be understood that not all of the shown components are required to be implemented, and that more or fewer components may be implemented instead.

In this embodiment, the memory 61 may include a program storage area and a data storage area, wherein the program storage area may store an application program required for at least one function of the system; the storage data area may store skin data information of a user at the terminal device. Further, the memory 61 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some embodiments, the memory 61 may optionally include memory 61 located remotely from the processor, which may be connected 6 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Processor 62 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 692 is generally configured to control overall operation of the terminal device. In this embodiment, the processor 62 is configured to execute the program code stored in the memory 61 or process data, for example, execute the processing module 63 for the multiple-input scene recognition method, so as to implement the method for multiple-input scene recognition according to the first embodiment.

It is noted that fig. 8 only shows the terminal device 6 with components 61-62, but it is to be understood that not all shown components are required to be implemented, and that more or less components may be implemented instead.

In the present embodiment, the processing module 63 for the multiple input scene recognition method stored in the memory 61 may also be divided into one or more program modules, which are stored in the memory 61 and executed by one or more processors (in the present embodiment, the processor 62) to complete the present invention.

Example three:

to achieve the above objects, the present invention also provides a computer-readable storage medium including a plurality of storage media such as a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor 62, implements corresponding functions. The computer-readable storage medium of this embodiment is used to store a processing module 63 for a multiple-input scene recognition method, and when executed by the processor 62, implements the multiple-input scene recognition method of the first embodiment.

It should be noted that the embodiments of the present invention have been described in terms of preferred embodiments, and not limited to any particular form, and those skilled in the art may modify and modify the above-described embodiments in accordance with the principles of the present invention without departing from the scope of the present invention.

Claims

1. A multi-input scene recognition method is characterized by comprising the following steps:

acquiring a plurality of scene images under an application program, acquiring a scene image, selecting a plurality of characteristic regions based on the scene image, intercepting the characteristic images according to the characteristic regions, and collecting to obtain a characteristic image set of the scene image; correspondingly generating a training sample set according to the feature image set of each scene image and a preset image information label of the scene image;

respectively calculating the accuracy of the splicing identification model, the stacking identification model and the feature extraction identification model in the training process, and confirming the optimal model corresponding to the application program according to the accuracy;

2. The method of claim 1, wherein training the recognition model using the training sample set comprises:

obtaining a training sample from a training sample set;

3. The method of claim 2, wherein the processing of the stitching recognition model to obtain the first output for determining scene information of the scene images in the training samples comprises:

4. The method of claim 2, wherein the processing of the stacked recognition models to obtain the second output for determining scene information of the scene images in the training samples comprises:

5. The method of claim 2, wherein the processing using a feature extraction recognition model to obtain a third output for determining scene information of the scene images in the training samples comprises:

6. The method according to claim 1, wherein the accuracy of the respective stitching recognition model, the stacking recognition model and the feature extraction recognition model, and the determination of the optimal model corresponding to the application program according to the accuracy comprise:

acquiring first output, second output and third output of the splicing recognition model, the stacking recognition model and the feature extraction recognition model under each training;

calculating the accuracy of the splicing recognition model, the stacking recognition model and the feature extraction recognition model according to the preset image information label of the scene image and the first output, the second output and the third output under each training;

7. The method of claim 1, wherein the capturing a plurality of scene images under an application program, and the class labeling of each scene image comprises the following steps:

acquiring an application program and running the application program;

8. The multiple-input scene recognition method of claim 1, wherein obtaining the scene image to be recognized comprises:

9. A terminal device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the steps of the multi-input scene recognition method according to claims 1-8 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the multi-input scene recognition method according to one of the preceding claims 1 to 8.