CN117789103A

CN117789103A - Scene recognition method, model training method, device and electronic equipment

Info

Publication number: CN117789103A
Application number: CN202311835354.3A
Authority: CN
Inventors: 张帆
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-12-27
Filing date: 2023-12-27
Publication date: 2024-03-29

Abstract

The embodiment of the application discloses a scene recognition method, a model training method, a device and electronic equipment, wherein the method comprises the following steps: acquiring a first feature vector of data to be identified through a first network model, wherein the data to be identified is data of an image mode or a voice mode; acquiring respective second feature vectors of a plurality of candidate scenes; taking the second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector; and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified. Therefore, when a new candidate scene needs to be added, the scene description data corresponding to the new candidate scene only needs to be added, and the new candidate scene can be identified by converting through the second network model to obtain the corresponding second feature vector, so that the scene which can be identified is more simply and conveniently expanded.

Description

Scene recognition method, model training method, device and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a scene recognition method, a model training method, a device, and an electronic apparatus.

Background

With advances in technology, in addition to requiring electronic devices (e.g., cell phones, tablet computers, etc.) to have their basic capabilities, more intelligent devices are desired. For example, in some cases, the scene in which the user is located may be perceived by the electronic device, thereby achieving better intelligentization. However, the related scene recognition method has a problem that the scene which can be recognized is not easily expanded.

Disclosure of Invention

In view of the above, the present application proposes a scene recognition method, a model training method, a device and an electronic apparatus, so as to improve the above problem.

In a first aspect, the present application provides a scene recognition method, the method including: acquiring a first feature vector of data to be identified through a first network model, wherein the data to be identified is data of an image mode or a voice mode; acquiring respective second feature vectors of a plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode; taking a second feature vector, of the second feature vectors of each of the plurality of candidate scenes, of which the similarity with the first feature vector meets a target similarity condition, as a target feature vector; and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

In a second aspect, the present application provides a model training method, the method comprising: acquiring a first training data set, wherein the first training data set comprises a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a voice mode; training a first network model to be trained and a second network model to be trained through the first training data set to obtain a first network model and a second network model, wherein the second network model is used for converting scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is used for acquiring a first feature vector of data to be identified so as to acquire scenes corresponding to the data to be identified by acquiring the second feature vector, the similarity of the second feature vector and the first feature vector meets a target similarity condition.

In a third aspect, the present application provides a scene recognition device, the device comprising: the data processing unit to be identified is used for acquiring a first feature vector of data to be identified through a first network model, wherein the data to be identified is data of an image mode or a voice mode; the candidate scene acquisition unit is used for acquiring second feature vectors of each of a plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode; the vector comparison unit is used for taking a second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector; and the scene acquisition unit is used for taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

In a fourth aspect, the present application provides a model training apparatus, the apparatus comprising: the training data acquisition unit is used for acquiring a first training data set, wherein the first training data set comprises a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a voice mode; the training unit is used for training a first network model to be trained and a second network model to be trained through the first training data set to obtain a first network model and a second network model, wherein the second network model is used for converting scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is used for acquiring a first feature vector of data to be identified so as to acquire the scenes corresponding to the data to be identified by acquiring the second feature vectors, the similarity of which meets the target similarity condition, of the first feature vector.

In a fifth aspect, the present application provides an electronic device comprising at least a processor, and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the methods described above.

In a sixth aspect, the present application provides a computer readable storage medium having program code stored therein, wherein the program code, when executed by a processor, performs the above-described method.

According to the scene recognition method, the model training method, the device and the electronic equipment, after the data to be recognized of the image mode or the voice mode are obtained, the first feature vector of the data to be recognized can be obtained through the first network model, then the second feature vector of each of the plurality of candidate scenes is obtained, the second feature vector which is most similar to the first feature vector is used as the target feature vector, and the candidate scene corresponding to the target feature vector is used as the scene corresponding to the data to be recognized. Therefore, under the condition that the second feature vectors acquired through the second network model are respectively corresponding to the plurality of candidate scenes, after the data to be identified are acquired, the data to be identified can be converted into the corresponding first feature vectors through the first network model, so that the scene to which the data to be identified belongs can be determined through acquiring the similarity between the first feature vectors and the plurality of second feature vectors, and when a new candidate scene (scene capable of being identified) needs to be added, the scene description data corresponding to the new candidate scene can be only added, and the second feature vectors can be obtained through conversion through the second network model, so that the new candidate scene can be identified, and the scene capable of being identified can be expanded more simply and conveniently.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of an application scenario of a scenario recognition method according to an embodiment of the present application;

fig. 2 is a schematic diagram of another application scenario of the scenario recognition method according to the embodiment of the present application;

FIG. 3 is a flowchart of a scene recognition method according to an embodiment of the present application;

fig. 4 is a schematic diagram illustrating a scenario in which data to be identified belongs obtained by using a scenario identification method according to an embodiment of the present application;

FIG. 5 is a flowchart of a scene recognition method according to another embodiment of the present application;

fig. 6 is a schematic diagram illustrating updating scene vector data in the scene recognition method according to the embodiment of the present application;

FIG. 7 is a flowchart of a scene recognition method according to a further embodiment of the present application;

FIG. 8 is a schematic diagram of training a first model to be trained and a second model to be trained according to an embodiment of the present application;

FIG. 9 shows a schematic diagram of a multi-modal encoder in an embodiment of the present application;

FIG. 10 is a flowchart of a scene recognition method according to another embodiment of the present application;

FIG. 11 illustrates a schematic diagram of updating a learnable part in an embodiment of the present application;

FIG. 12 is a schematic diagram illustrating scene recognition optimization and incremental training in an embodiment of the present application;

FIG. 13 is a flow chart illustrating a model training method according to an embodiment of the present application;

fig. 14 shows a block diagram of a scene recognition device according to an embodiment of the present application;

fig. 15 is a block diagram showing a configuration of a scene recognition device according to another embodiment of the present application;

FIG. 16 is a block diagram showing a model training apparatus according to another embodiment of the present application;

FIG. 17 shows a block diagram of another electronic device of the present application for performing a scene recognition method according to an embodiment of the present application;

fig. 18 is a storage unit for storing or carrying program code for implementing the scene recognition method according to the embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

With the vigorous development of information technology and the internet, electronic devices such as mobile phones and tablet computers are becoming popular. With advances in technology, in addition to requiring electronic devices to have basic capabilities, more intelligent electronic devices are desired. For example, in some cases, the scene in which the user is located may be perceived by the electronic device, thereby achieving better intelligentization.

However, the inventors found in the study that the related scene recognition method has a problem of not being easily expanded. For example, the inventors found that in the case of scene recognition by a scene recognition model, the network parameters (e.g., network weights) of the trained scene recognition model are already fixed, and therefore, the trained scene recognition model can effectively recognize only several scenes in which samples are collected and trained. If a new scene which can be identified with other new scenes needs to be expanded, the data sample corresponding to the new scene needs to be collected again, and then the model is trained again. Therefore, the training model needs to be restarted every time a new supported scene is added, so that the cost of manpower and material resources is high, the process is complicated, and the expansion of the identifiable scene is not easy to carry out.

Therefore, after the inventors found the above problems in the study, they have proposed a scene recognition method, a model training method, a device, and an electronic apparatus that can improve the above problems in the present application. In the method, after the data to be identified of the image mode or the voice mode is obtained, a first feature vector of the data to be identified can be obtained through a first network model, then second feature vectors of a plurality of candidate scenes are obtained, the second feature vector which is the most similar to the first feature vector is used as a target feature vector, and the candidate scene corresponding to the target feature vector is used as the scene corresponding to the data to be identified.

Therefore, under the condition that the second feature vectors acquired through the second network model are respectively corresponding to the plurality of candidate scenes, after the data to be identified are acquired, the data to be identified can be converted into the corresponding first feature vectors through the first network model, so that the scene to which the data to be identified belongs can be determined through acquiring the similarity between the first feature vectors and the plurality of second feature vectors, and when a new candidate scene (scene capable of being identified) needs to be added, the scene description data corresponding to the new candidate scene can be only added, and the second feature vectors can be obtained through conversion through the second network model, so that the new candidate scene can be identified, and the scene capable of being identified can be expanded more simply and conveniently.

Before describing embodiments of the present application in further detail, an application environment related to embodiments of the present application will be described.

The application scenario according to the embodiment of the present application will be described first.

In the embodiment of the application, the provided scene recognition method or model training method may be executed by an electronic device (end-side device). In this manner performed by the electronic device, all steps in the scene recognition method or the model training method provided by the embodiments of the present application may be performed by the electronic device. For example, as shown in fig. 1, in the case where all steps in the scene recognition method or the model training method provided in the embodiment of the present application may be performed by an electronic device, all steps may be performed by a processor of the electronic device 100.

Furthermore, the scene recognition method or the model training method provided in the embodiment of the present application may also be executed by the server. Correspondingly, in this manner executed by the server, the server may start executing steps in the scene recognition method or the model training method provided in the embodiments of the present application in response to the trigger instruction. The triggering instruction may be sent by an electronic device used by a user, or may be triggered locally by a server in response to some automation event.

In addition, the scene recognition method or the model training method provided by the embodiment of the application can be cooperatively executed by the electronic equipment and the server. In this manner, which is cooperatively performed by the electronic device and the server, some steps in the scene recognition method or the model training method provided in the embodiment of the present application are performed by the electronic device, and other parts of the steps are performed by the server. Taking the scene recognition method in the present application as an example, as shown in fig. 2, the electronic device 100 may perform the scene recognition method including: the method comprises the steps of obtaining a first feature vector of data to be identified through a first network model, transmitting the first feature vector to a server 200, then executing obtaining second feature vectors of a plurality of candidate scenes by the server 200, taking the second feature vector, which has the similarity meeting a target similarity condition, of the second feature vectors of the candidate scenes as a target feature vector, taking a candidate scene corresponding to the target feature vector as a scene corresponding to the data to be identified, and returning the scene corresponding to the data to be identified to the electronic equipment 100.

In this way, the steps performed by the electronic device and the server are not limited to those described in the above examples, and in practical applications, the steps performed by the electronic device and the server may be dynamically adjusted according to practical situations.

It should be noted that, the electronic device 100 may be a tablet computer, a smart watch, a smart voice assistant, or other devices besides the smart phone shown in fig. 1 and 2. The server 200 may be a stand-alone physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers. In the case where the scene recognition method provided in the embodiment of the present application is performed by a server cluster or a distributed system formed by a plurality of physical servers, different steps in the scene recognition method may be performed by different physical servers, or may be performed by servers built based on the distributed system in a distributed manner.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 3, a scene recognition method provided in an embodiment of the present application includes:

s110: and acquiring a first feature vector of data to be identified through the first network model, wherein the data to be identified is data of an image mode or a voice mode.

In the embodiment of the present application, the data to be identified may be understood as data for performing scene identification. The data to be identified may be data of an image mode or a voice mode. In the embodiments of the present application, a modality may be understood as a form or state of data existence. The data of the image modality can be understood as data in the form of images. For example, the data of the image modality may include a picture or video. Data of a speech modality may be understood as data in the form of speech. For example, the data of the speech modality may comprise a piece of audio, e.g. a piece of music or a recording. The electronic device may collect a picture through the camera as the data to be identified, or the electronic device may record a video through the camera as the data to be identified. For example, the electronic device may collect a piece of audio through a microphone as the data to be identified. In addition, the electronic device may use data (data in an image mode or a voice mode) transmitted from another device as data to be identified.

In the embodiment of the present application, the first network model may be understood as a model for converting the data to be identified into the corresponding feature vector. The vector converted by the first network model may be a first feature vector. Wherein the first network model may have a corresponding modality. The modality corresponding to the first network model may be understood as a modality that can be input into the first network model to be processed by the first network model. Optionally, the mode corresponding to the first network model may be an image mode or a voice mode.

Wherein, in the case that the mode corresponding to the first network model is an image mode, the first network model can be understood as an image encoder. In this case, the class of the first network model may be a CNN (convolutional neural networks) model. In the case that the mode corresponding to the first network model is a speech mode, the first network model may be understood as a speech coder, and in this case, the class of the first network model may be a Mel spectrum CNN model.

As a way, in the case where the data to be recognized is data of a voice modality, the data to be recognized of the voice modality may be converted into the data to be recognized of an image modality, and then subsequent processing may be performed.

After the data to be identified is obtained, the data to be identified may be input into the first network model to obtain a first feature vector output by the first network model.

S120: and acquiring respective second feature vectors of the plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode.

In the embodiment of the present application, the candidate scene may be understood as a scene that can be identified.

For each scene, there may be corresponding scene content as well as scene description data. Also, for one scene, the scene content corresponds to the scene description data. The scene content may be used to introduce what objects, sounds, light, etc., are objectively present in the scene. The scene description data may be data of a text modality, and the scene description data may be understood as data for defining a corresponding scene. The data defining the scene may enable the user to know what is specifically a scene. For example, a scene includes scene contents such as a driver, a steering wheel, a car window, and passengers, and then the scene description data corresponding to the scene contents may be "a scene inside a bus". For another example, a scene includes a computer, a desk, a chair, etc., and the scene description data corresponding to the scene content may be "a scene inside an office". For another example, if the scene content of a scene includes audio with the content of "the good taste", the scene description data corresponding to the scene content may be "a scene of eating".

In the embodiment of the present application, the data to be identified may be understood as data introducing scene contents in a scene. Under the condition that the scene content of the same scene corresponds to the scene description data, the scene to which the data to be identified belongs can be determined by comparing the data to be identified with the scene description data. In order to facilitate comparison, to determine a scene to which the data to be identified belongs from the multiple candidate scenes, corresponding second feature vectors may be obtained for respective scene description data of the multiple candidate scenes.

As one way, acquiring the second feature vector of each of the plurality of candidate scenes may be understood as acquiring a plurality of second feature vectors converted in advance by the second network model. In this manner, after obtaining the scene description data of the candidate scenes, the second feature vector corresponding to each of the scene description data may be obtained through the second network model, and then the obtained second feature vectors of the plurality of candidate scenes may be stored. Further, after the first feature vector is acquired, the second feature vector of each of the plurality of candidate scenes stored in advance may be directly read. Alternatively, the second feature vectors of each of the plurality of candidate scenes may be stored directly in the local or in another device.

Alternatively, the obtaining of the second feature vector of each of the plurality of candidate scenes may be understood as obtaining the second feature vector of each of the plurality of candidate scenes from scene description data of the plurality of candidate scenes in real time. In this manner, the device running the scene recognition method may convert, in real time, respective scene description data of the candidate scenes through the second network model, so as to obtain respective second feature vectors of the plurality of candidate scenes.

In this embodiment of the present application, the mode of the input corresponding to the second network model is a text mode, and in this case, the second network model may also be understood as a text encoder. Alternatively, the class of the second network model may be a transducer network.

S130: and taking the second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector.

After the second feature vectors of the candidate scenes are obtained, the similarity between the second feature vectors and the first feature vectors of the candidate scenes can be calculated, so that the similarity between the second feature vectors and the first feature vectors of each candidate scene can be obtained.

In the embodiments of the present application, there may be various ways of calculating the similarity between two vectors (e.g., the first feature vector and the second feature vector). For example, a cosine distance or euclidean distance may be used to calculate the distance between the two vectors, and this distance may be used as the similarity.

The target similarity condition is a condition for selecting a scene to which the data to be identified belongs from a plurality of candidate scenes. As one approach, the target similarity condition may be a second feature vector having a similarity to the first feature vector greater than a specified similarity threshold. In this way, in the case of obtaining the similarity between the second feature vector and the first feature vector of each candidate scene, the similarity between the second feature vector and the first feature vector of each candidate scene may be compared with the specified similarity threshold value, so as to take the second feature vector, of which the similarity with the first feature vector is greater than the specified similarity threshold value, as the target feature vector. Illustratively, the plurality of candidate scenes includes candidate scene C1, candidate scene C2, candidate scene C3, and candidate scene C4. The similarity between the second feature vector and the first feature vector of the candidate scene C1 is S1, the similarity between the second feature vector and the first feature vector of the candidate scene C2 is S2, the similarity between the second feature vector and the first feature vector of the candidate scene C3 is S3, and the similarity between the second feature vector and the first feature vector of the candidate scene C4 is S4. If S4 is greater than the specified similarity threshold, the second feature vector of the candidate scene C4 may be used as the target feature vector.

Alternatively, the target similarity condition may be a second feature vector having the greatest similarity with the first feature vector. In this way, under the condition that the similarity between the second feature vector and the first feature vector of each candidate scene is obtained, the similarity between the second feature vector and the first feature vector of each candidate scene is compared with each other, so as to determine the second feature vector with the largest similarity with the first feature vector.

Illustratively, the plurality of candidate scenes includes candidate scene C5, candidate scene C6, candidate scene C7, and candidate scene C8. The similarity between the second feature vector and the first feature vector of the candidate scene C5 is S5, the similarity between the second feature vector and the first feature vector of the candidate scene C6 is S6, the similarity between the second feature vector and the first feature vector of the candidate scene C7 is S7, and the similarity between the second feature vector and the first feature vector of the candidate scene C8 is S8. Wherein S5 is greater than S6, S6 is greater than S7, and S7 is greater than S8. In this case, the second feature vector of the candidate scene C5 is taken as the target feature vector.

In the condition content of the target similarity condition, there may be a plurality of second feature vectors satisfying the target similarity condition. For example, in the case where the target condition is a second feature vector having a similarity with the first feature vector greater than a specified similarity threshold, there may be a plurality of second vectors having a corresponding similarity greater than the specified similarity threshold, and in this case, the second feature vector having the largest corresponding similarity may be further selected as the target feature vector from the second vectors having a similarity greater than the specified similarity threshold.

S140: and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

For example, as shown in fig. 4, the candidate scenes may be related to a scene of a bus, a scene of a subway, a scene of a car. In the case where the input data to be recognized is an image of the internal condition of the bus, the last recognized scene may be "scene about the bus".

According to the scene recognition method provided by the embodiment, under the condition that the second feature vectors acquired through the second network model are respectively corresponding to the plurality of candidate scenes, after the data to be recognized are acquired, the data to be recognized can be converted into the corresponding first feature vectors through the first network model, so that the scene to which the data to be recognized belongs is determined through acquiring the similarity between the first feature vectors and the plurality of second feature vectors, and further, when a new candidate scene (scene capable of being recognized) needs to be added, only scene description data corresponding to the new candidate scene can be added, and the new candidate scene can be recognized through the conversion of the second network model to obtain the corresponding second feature vectors, so that the scene capable of being recognized can be more simply expanded.

Referring to fig. 5, a scene recognition method provided in an embodiment of the present application includes:

s210: and acquiring a first feature vector of data to be identified through the first network model, wherein the data to be identified is data of an image mode or a voice mode.

S220: and acquiring respective second feature vectors of the plurality of candidate scenes from a local scene vector database, wherein the second feature vectors in the scene vector database are obtained by converting a second network model in advance, and the scene description data are data of a text mode.

In the embodiment of the present application, locally may be understood as an apparatus for executing the scene recognition method provided in the embodiment of the present application. For example, if the scene device method is performed by an electronic device, then the local scene vector database may be understood as being stored in the electronic device. The scene vector database recorded with the second feature vectors of the candidate scenes is stored locally, so that the second feature vectors of the candidate scenes can be acquired more quickly.

Alternatively, the first network model may be deployed locally. In this case, the first network model may be deployed together with the scene vector database into the electronic device that performs the scene recognition method.

It should be noted that, after the scene vector database is deployed, there may be a case where candidate scenes need to be added. In this case, the scene description data of the candidate scene to be added may be acquired in response to the scene adding instruction, and then the second feature vector of the scene description data of the candidate scene to be added is obtained through the second network model, and is used as the second feature vector to be added, and the second feature vector to be added is added to the scene vector database.

Alternatively, the scene recognition method is applied to the electronic device as an example. As shown in fig. 6, the scenario shown in fig. 6 includes an electronic device 100 and a server 200. The scene recognition method provided in the embodiment of the present application may be executed by the electronic device 100. In this case, the first network model and the scene vector database may be deployed together into the electronic device 100. If a candidate scene to be added is available, the server 200 may acquire the candidate scene to be added, and obtain a second feature vector corresponding to the candidate scene to be added through a second network model in the server 200, as the second feature vector to be added. The server 200 may then transmit the second feature vector to be added to the electronic device 100 so that the electronic device 100 adds the second feature vector to be added to the local scene vector database.

For example, the existing candidate scenes may include candidate scene C1, candidate scene C2, and candidate scene C3, in which case, in the local scene vector database, the second feature vectors of each of candidate scene C1, candidate scene C2, and candidate scene C3 are stored. The device performing the scene recognition method can recognize only the candidate scene C1, the candidate scene C2, and the candidate scene C3. In this case, if it is necessary to add the candidate scene C4, the second feature vector of the candidate scene C4 may be added to the scene vector database, so that the apparatus performing the scene recognition method can also recognize the candidate scene C4.

It should be noted that, there may be an inaccurate case for the scene description data of the scene, or there may be a case that the second feature vector converted from the model (for example, the second network model) cannot accurately represent the scene. To improve this, scene content of candidate scenes to be added and corresponding initial scene description data may be acquired. The scene content is of an image mode or a voice mode, and the initial scene description data comprises a first data part and a second data part, wherein the first data part can be updated in a learning mode.

Then, updating the first data part in the initial scene description data based on the first network model, the second network model and the scene content to obtain updated scene description data, wherein in the process of updating the first data part, network parameters of the first network model and the second network model remain unchanged, and the updated scene description data is used as scene description data of a candidate scene to be added.

Therefore, the scene description data of the candidate scenes to be added can be obtained by the mode of updating the initial scene description data, so that the obtained scene description data of the candidate scenes to be added can be expressed more accurately.

S230: and taking the second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector.

S240: and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

According to the scene recognition method provided by the embodiment, when a new candidate scene (scene which can be recognized) needs to be added, only scene description data corresponding to the new candidate scene needs to be added, the second network model is used for conversion to obtain a corresponding second feature vector, and the second feature vector is stored in the scene vector database, so that the scene which can be recognized can be expanded more simply and conveniently. In addition, when the scene recognition method provided by the embodiment of the application runs in the electronic device, the function is upgraded (for example, more scenes can be recognized) without updating the whole model (for example, the first network model and the second network model), but only the scene vector database is required to be updated, so that the data update in the upgrading process is small, and the speed and the efficiency are high.

Referring to fig. 7, a scene recognition method provided in an embodiment of the present application includes:

s310: a first training data set is obtained, wherein the first training data set comprises a plurality of first training samples and respective scene description data of the first training samples, and the first training samples are samples of an image mode or a voice mode.

In the present embodiment, the first training data set may be understood as a data set for training the model.

As one way, the first training data set may be obtained from the internet. It should be noted that there are a large amount of multimodal related data in the internet, such as pictures and associated text descriptions thereof (e.g., text descriptions of clothes pictures and attributes thereof sold on shopping websites), such as voices (e.g., audio/video resources) and corresponding text descriptions thereof. So that a large amount of semantically corresponding data, etc., can be collected as a multimodal dataset as a first training dataset at a very low cost. The data has natural self-labeling property, and can be labeled for a second time at a later stage, so that the cost is reduced.

Alternatively, the first training data set may be completed by manual annotation. In this manner, a plurality of first training samples (pictures or audio) may be obtained first, and then corresponding scene description data may be configured for each first training sample manually, so as to obtain a first training data set.

It should be noted that the first training sample in the first training data set may be a training sample of an image mode or a training sample of a voice mode. The mode of the first training sample in the first training data set may be determined according to an input mode of the first network model to be trained. If the input mode of the first network model is an image mode, the first network model may be understood as an image encoder, and thus the mode of the first training sample in the first training data set may be an image mode. If the input mode of the first network model is a speech mode, the first network model may be understood as a speech encoder, and thus the mode of the first training sample in the first training data set may be a speech mode.

S320: and training the first network model to be trained and the second network model to be trained through the first training data set so as to obtain the first network model and the second network model.

In the embodiment of the application, the first network model to be trained and the second network model to be trained can be trained by a training method of contrast learning. In this case, the process of model training can be understood as a training process of inter-modality semantic alignment by contrast learning training methods.

As one way, the process of training the first network model to be trained and the second network model to be trained to obtain the first network model and the second network model may include: and acquiring respective third feature vectors of the plurality of first training samples through the first network model to be trained so as to obtain a plurality of third feature vectors. And acquiring fourth feature vectors of the scene description data of each of the plurality of first training samples through the second network model to be trained so as to obtain a plurality of fourth feature vectors. And training the first network model to be trained to obtain a first network model through the third feature vectors and the fourth feature vectors, and training the second network model to be trained to obtain a second network model.

The first network model and the second network model are obtained through training, so that the similarity between the third feature vector corresponding to each first training sample and the fourth feature vector of the respective scene description data is larger than the similarity between the third feature vector and the fourth feature vector of other scene description data.

The first training dataset may include N images and corresponding text descriptions, which may be pair-form data pairs. The N images may be understood as the first training samples, and the text descriptions corresponding to the images may be understood as scene description data corresponding to the first training samples. The training process of performing contrast training based on this example may be as shown in fig. 8, where in the training process, for a description including N images and corresponding texts, feature vectors I of the N images are calculated by the first network model to be trained ₁ ...I _N For the text descriptions corresponding to the N images, calculating corresponding feature vectors T through the second network model to be trained ₁ ...T _N . The first network model to be trained can be understood as an image encoder to be trained, and the second network model to be trained can be understood as a text encoder. Wherein the feature vector I ₁ ...I _N It can be understood that a plurality of third eigenvectors, corresponding to eigenvectors T ₁ ...T _N It is understood as a plurality of fourth feature vectors.

Then, N text feature vectors (i.e., feature vector T ₁ ...T _N ) And N image feature vectors (i.e., feature vector I ₁ ...I _N ) Two-by-two combinations, calculate N ² The similarity of the possible text image pairs can be obtained directly by calculating the cosine similarity of the text feature vector and the image feature vector. In the example shown in FIG. 8There are N positive samples in total, i.e. text and images (diagonal elements in the matrix in fig. 8) that truly belong to a matching pair, while the remaining N ² The N unmatched text-image pairs are negative samples, and the training goal of the contrast learning is to maximize the similarity of the N positive samples while minimizing N ² -similarity of N negative samples. Through the contrast learning method, the semantics of the image and the text description semantics can be aligned.

It should be noted that, fig. 8 is an exemplary description of a training process for semantically aligning an image mode with a text mode, and in the process of semantically aligning a speech mode with a text model in the embodiment of the present application, the training process shown in fig. 8 may also be referred to. In addition, in the embodiment of the application, the corresponding models can be trained for the image mode and the voice mode respectively. For example, as shown in fig. 9, the image modality may be trained by training the CNN network to obtain an image encoder, and the transducer network may be trained synchronously to obtain a text encoder, so that the image modality may be semantically aligned with the text modality, and dimensions of feature vectors (e.g., a first feature vector and a second feature vector) converted by the trained image encoder and the text encoder may be N-dimensional feature vectors. Similarly, the speech mode may be trained by training the Mel-spectrum CNN network to obtain a speech encoder, and the Transformer network may be trained synchronously to obtain a text encoder, so that the speech mode may be semantically aligned with the text mode, and the dimensions of feature vectors (e.g., the first feature vector and the second feature vector) converted by the trained speech encoder and the text encoder may be N-dimensional feature vectors.

S330: and acquiring a first feature vector of data to be identified through the first network model, wherein the data to be identified is data of an image mode or a voice mode.

S340: and acquiring respective second feature vectors of the plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode.

S350: and taking the second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector.

S360: and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

The scene recognition method provided by the embodiment realizes simpler and more convenient scene expansion and recognition. In this embodiment, the first network model to be trained is trained to obtain a first network model, and the second network model to be trained is trained to obtain a second network model, so that the first network model and the second network model can convert the input of the corresponding modes into the same vector space, and after the first feature vector corresponding to the data to be identified is obtained through the first network model, the scene to which the data to be identified belongs can be determined based on the similarity between the first feature vector and the second feature vector of each of the plurality of candidate scenes.

Referring to fig. 10, a scene recognition method provided in an embodiment of the present application includes:

s410: and acquiring a second training data set, wherein the second training data set comprises a plurality of second training samples and initial scene description data corresponding to the second training samples, the initial scene description data comprises a first data part and a second data part, the first data part is positioned in front of the second data part, and the first data part can be updated in a learning mode.

As one way, the second training data set is selected from the first training data set, and the first training data set is used for training the first network model to be trained and the second network model to be trained so as to obtain the first network model and the second network model.

S420: and updating the first data part in the initial scene description data based on the first network model, the second network model and the comparison learning mode to obtain updated scene description data.

The process of updating the first data portion in the initial scene description data may be understood as a process of scene recognition optimization and incremental training. In this embodiment, the first network model and the second network model may be understood as the models obtained by training the first network model to be trained and the second network model to be trained through the aforementioned first training data set. Wherein the network parameters of the first network model and the second network model remain unchanged during the updating of the first data portion. As shown in fig. 11, in the model optimization and incremental training process, M learnable text words (learning context) are added before the initial scene text description to obtain initial scene description data, where the initial scene description data corresponding to the scene (the scene represented by the second training sample) may be divided into two parts, namely a learnable part and a fixed part. The learning part is the first data part, and the fixed part is the second data part. The fixed part adopts a unified template (such as a picture of an airplane, a picture of a butterfly and the like), the learner part randomly initializes to XXX, and the learner part iteratively updates the learner text word vector along with training of the model so as to learn the most accurate text expression of the scene.

In the model optimization and incremental training process, a contrast learning method can be continuously adopted, and in the training process, the trainable scene text description part can automatically learn the most accurate language description of the scene. The learned scene description text (learnable context) plus the fixed portion scene description text is finally used as final scene description data.

Wherein, as a way, because the model optimization and incremental training process uses the second training data set, and the second data set is a data set composed of partial samples selected from the full data set (e.g., the first training data set), the training process is very fast, consumes very little computation time, and has relatively low training cost because no training is performed on the full data set.

For example, as shown in fig. 12, in the case where scene recognition optimization and incremental training are performed by the server, the server may transmit the resulting updated scene description data to the electronic device after the incremental training is completed.

S430: and taking the scene included in the second training sample as a candidate scene, and taking updated scene description data corresponding to the second training sample as scene description data of the candidate scene.

S440: and acquiring a first feature vector of data to be identified through the first network model, wherein the data to be identified is data of an image mode or a voice mode.

S450: and acquiring respective second feature vectors of the plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode.

S460: taking the second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector;

s470: and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

The scene recognition method provided by the embodiment realizes simpler and more convenient scene expansion and recognition. In addition, in this embodiment, there may be one initial scene description data corresponding to each candidate scene, and the initial scene description data may include a first data portion that may be trained, so that the first data portion may be updated by a training manner, so that the scene description data obtained by final training may describe the candidate scene more accurately.

Referring to fig. 13, a model training method provided in an embodiment of the present application includes:

s510: a first training data set is obtained, wherein the first training data set comprises a plurality of first training samples and respective scene description data of the first training samples, and the first training samples are samples of an image mode or a voice mode.

S520: training a first network model to be trained and a second network model to be trained through a first training data set to obtain the first network model and the second network model, wherein the second network model is used for converting scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is used for acquiring the first feature vectors of the data to be identified so as to acquire scenes corresponding to the data to be identified by acquiring the second feature vectors, the similarity of which meets the target similarity condition, of the first feature vectors.

Referring to fig. 14, in an embodiment of the present application, a scene recognition device 600 is provided, where the device 600 includes:

the to-be-identified data processing unit 610 is configured to obtain, through the first network model, a first feature vector of to-be-identified data, where the to-be-identified data is data of an image mode or a voice mode.

The candidate scene obtaining unit 620 is configured to obtain respective second feature vectors of the plurality of candidate scenes, where the second feature vectors are obtained by converting scene description data of the candidate scenes through the second network model, and the scene description data is data of a text modality.

The vector comparison unit 630 is configured to take, as the target feature vector, a second feature vector, of the second feature vectors of the plurality of candidate scenes, whose similarity with the first feature vector satisfies a target similarity condition.

The scene acquisition unit 640 is configured to use a candidate scene corresponding to the target feature vector as a scene corresponding to the data to be identified.

As one way, the candidate scene obtaining unit 620 is specifically configured to obtain, from a local scene vector database, respective second feature vectors of the plurality of candidate scenes, where the second feature vectors in the scene vector database are obtained by converting in advance by the second network model.

Optionally, the candidate scene obtaining unit 620 is further configured to obtain scene description data of a candidate scene to be added in response to a scene adding instruction; obtaining second feature vectors of scene description data of the candidate scene to be added through a second network model, wherein the second feature vectors are used as the second feature vectors to be added; the second feature vector to be added is added to the scene vector database.

Optionally, the candidate scene obtaining unit 620 is specifically configured to obtain scene content of a candidate scene to be added and corresponding initial scene description data, where the scene content is content of an image mode or a voice mode, and the initial scene description data includes a first data portion and a second data portion, and the first data portion is updatable in a learning manner; updating a first data part in the initial scene description data based on the first network model, the second network model and scene content to obtain updated scene description data, wherein network parameters of the first network model and the second network model remain unchanged in the process of updating the first data part; and taking the updated scene description data as scene description data of candidate scenes to be added.

Optionally, as shown in fig. 15, the apparatus 600 further includes:

the training unit 650 is configured to obtain a first training data set, where the first training data set includes a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a voice mode; and training the first network model to be trained and the second network model to be trained through the first training data set so as to obtain the first network model and the second network model.

Optionally, the training unit 650 is specifically configured to obtain, through the first network model to be trained, a third feature vector of each of the plurality of first training samples, so as to obtain a plurality of third feature vectors; acquiring fourth feature vectors of respective scene description data of a plurality of first training samples through a second network model to be trained so as to obtain a plurality of fourth feature vectors; training a first network model to be trained to obtain a first network model and training a second network model to be trained to obtain a second network model through a plurality of third feature vectors and a plurality of fourth feature vectors, wherein the first network model and the second network model obtained through training enable the similarity between the third feature vector corresponding to each first training sample and the fourth feature vector of the respective scene description data to be larger than the similarity between the third feature vector corresponding to each first training sample and the fourth feature vector of other scene description data.

Optionally, the training unit 650 is further configured to obtain a second training data set, where the second training data set includes a plurality of second training samples, and initial scene description data corresponding to each of the plurality of second training samples, where the initial scene description data includes a first data portion and a second data portion, and the first data portion is located before the second data portion, and the first data portion is updatable by a learning manner; updating a first data part in the initial scene description data based on the first network model, the second network model and the comparison learning mode to obtain updated scene description data, wherein network parameters of the first network model and the second network model are kept unchanged in the process of updating the first data part; and taking the scene included in the second training sample as a candidate scene, and taking updated scene description data corresponding to the second training sample as scene description data of the candidate scene.

Referring to fig. 16, in an embodiment of the present application, a model training apparatus 700 is provided, where the apparatus 700 includes:

the training data obtaining unit 710 is configured to obtain a first training data set, where the first training data set includes a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a speech mode.

The training unit 720 is configured to train the first network model to be trained and the second network model to be trained through the first training data set to obtain a first network model and a second network model, where the second network model is configured to convert scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is configured to obtain a first feature vector of the data to be identified so as to obtain a scene corresponding to the data to be identified by obtaining a second feature vector with similarity to the first feature vector meeting a target similarity condition.

It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.

An electronic device provided in the present application will be described with reference to fig. 17.

Referring to fig. 17, another electronic device 100 capable of executing the foregoing scene recognition method or model training method is further provided in the embodiments of the present application based on the foregoing scene recognition method, model training method, and apparatus. The electronic device 100 includes one or more (only one shown) processors 102, memory 104, and a network module 106 coupled to one another. The memory 104 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the program stored in the memory 104.

Wherein the processor 102 may include one or more processing cores. The processor 102 utilizes various interfaces and lines to connect various portions of the overall electronic device 100, perform various functions of the electronic device 100, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104, and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 102 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 102 and may be implemented solely by a single communication chip.

The Memory 104 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (RAM). Memory 104 may be used to store instructions, programs, code sets, or instruction sets. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the terminal 100 in use (such as phonebook, audio-video data, chat-record data), etc.

The network module 106 is configured to receive and transmit electromagnetic waves, and to implement mutual conversion between the electromagnetic waves and the electrical signals, so as to communicate with a communication network or other devices, such as an audio playing device. The network module 106 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and the like. The network module 106 may communicate with various networks such as the Internet, intranets, wireless networks, or other devices via wireless networks. The wireless network may include a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 106 may interact with base stations.

Referring to fig. 18, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

In summary, after obtaining data to be identified in an image mode or a voice mode, the scene identification method, the model training method, the device and the electronic equipment provided by the application can obtain a first feature vector of the data to be identified through a first network model, then obtain respective second feature vectors of a plurality of candidate scenes, take the second feature vector most similar to the first feature vector as a target feature vector, and take a candidate scene corresponding to the target feature vector as a scene corresponding to the data to be identified. Therefore, under the condition that the second feature vectors acquired through the second network model are respectively corresponding to the plurality of candidate scenes, after the data to be identified are acquired, the data to be identified can be converted into the corresponding first feature vectors through the first network model, so that the scene to which the data to be identified belongs can be determined through acquiring the similarity between the first feature vectors and the plurality of second feature vectors, and when a new candidate scene (scene capable of being identified) needs to be added, the scene description data corresponding to the new candidate scene can be only added, and the second feature vectors can be obtained through conversion through the second network model, so that the new candidate scene can be identified, and the scene capable of being identified can be expanded more simply and conveniently.

According to the method for semantic alignment among modes (images, voices and texts), input of different modes is finally mapped to a unified embedded space, feature vectors (for example, a first feature vector and a second feature vector) are obtained, and semantic alignment is carried out. The scene classification is performed by calculating the similarity (for example, cosine distance) between the feature vector of the input modality (image or voice) and the feature vector (second feature vector) of the scene text description (scene description data) to be recognized, so that the input supporting multiple modalities (such as image and voice) is realized, the expansibility is good, and when more recognition scenes are required to be added, only the text description corresponding to the scene is required.

In addition, the application also provides a scene recognition optimization and incremental training scheme based on small sample learning. The user only needs to collect a small amount of data (such as images and voices) of the scene to be recognized, and incremental training can be completed. In the incremental training stage, the most accurate text description of the scene to be identified can be learned. The data, calculation force and time required by the whole incremental training stage are very little, so the method is very simple and efficient for expanding the scene recognition variety.

Also, once the end-side deployed models (e.g., the first network model and the second network model) are deployed, no changes are subsequently required. When the variety of scene identification needs to be increased, only a second feature vector corresponding to the scene description data corresponding to the scene is trained on the server, and then the second feature vector is pushed to the scene vector database at the end side through the cloud, so that the function update of the increment can be completed. The whole function expansion updating process is very efficient, and the data transmission quantity is very small. Therefore, the whole model is not required to be updated in the whole function upgrading, and only the scene vector database is required to be updated, so that the purposes of smaller data updating and high speed and high efficiency are realized.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

1. A method of scene recognition, the method comprising:

acquiring a first feature vector of data to be identified through a first network model, wherein the data to be identified is data of an image mode or a voice mode;

acquiring respective second feature vectors of a plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode;

taking a second feature vector, of the second feature vectors of each of the plurality of candidate scenes, of which the similarity with the first feature vector meets a target similarity condition, as a target feature vector;

and taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

2. The method of claim 1, wherein the obtaining a second feature vector for each of the plurality of candidate scenes comprises:

and obtaining respective second feature vectors of a plurality of candidate scenes from a local scene vector database, wherein the second feature vectors in the scene vector database are obtained by converting the second network model in advance.

3. The method according to claim 2, wherein the method further comprises:

responding to a scene adding instruction, and acquiring scene description data of candidate scenes to be added;

obtaining a second feature vector of the scene description data of the candidate scene to be added as a second feature vector to be added through the second network model;

and adding the second feature vector to be added to the scene vector database.

4. A method according to claim 3, wherein the obtaining scene description data of candidate scenes to be added comprises:

acquiring scene contents of candidate scenes to be added and corresponding initial scene description data, wherein the scene contents are contents of image modes or voice modes, and the initial scene description data comprise a first data part and a second data part, and the first data part can be updated in a learning mode;

updating a first data part in the initial scene description data based on the first network model, the second network model and the scene content to obtain updated scene description data, wherein network parameters of the first network model and the second network model remain unchanged in the process of updating the first data part;

And taking the updated scene description data as scene description data of the candidate scene to be added.

5. The method of claim 1, wherein the acquiring the data to be identified further comprises:

acquiring a first training data set, wherein the first training data set comprises a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a voice mode;

and training the first network model to be trained and the second network model to be trained through the first training data set so as to obtain the first network model and the second network model.

6. The method of claim 5, wherein training the first network model to be trained and the second network model to be trained via the first training data set to obtain the first network model and the second network model comprises:

acquiring respective third feature vectors of the plurality of first training samples through a first network model to be trained so as to obtain a plurality of third feature vectors;

acquiring fourth feature vectors of the scene description data of each of the plurality of first training samples through a second network model to be trained so as to obtain a plurality of fourth feature vectors;

And training the first network model to be trained to obtain a first network model and training the second network model to be trained to obtain a second network model through the plurality of third feature vectors and the plurality of fourth feature vectors, wherein the first network model and the second network model obtained through training enable the similarity between the third feature vector corresponding to each first training sample and the fourth feature vector of the respective scene description data to be larger than the similarity between the third feature vector corresponding to each first training sample and the fourth feature vector of other scene description data.

7. The method of claim 1, wherein the acquiring the data to be identified further comprises:

acquiring a second training data set, wherein the second training data set comprises a plurality of second training samples and initial scene description data corresponding to the second training samples, the initial scene description data comprises a first data part and a second data part, the first data part is positioned in front of the second data part, and the first data part can be updated in a learning mode;

updating a first data part in the initial scene description data based on the first network model, the second network model and a comparison learning mode to obtain updated scene description data, wherein network parameters of the first network model and the second network model remain unchanged in the process of updating the first data part;

And taking the scene included in the second training sample as the candidate scene, and taking updated scene description data corresponding to the second training sample as the scene description data of the candidate scene.

8. The method of claim 7, wherein the second training data set is selected from the first training data set, and wherein the first training data set is used to train a first network model to be trained and a second network model to be trained to obtain a first network model and a second network model.

9. The method of any one of claims 1-8, wherein the target similarity condition comprises:

a second feature vector having a similarity to the first feature vector greater than a specified similarity threshold;

or a second feature vector having the greatest similarity with the first feature vector.

10. A method of model training, the method comprising:

Training a first network model to be trained and a second network model to be trained through the first training data set to obtain a first network model and a second network model, wherein the second network model is used for converting scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is used for acquiring a first feature vector of data to be identified so as to acquire scenes corresponding to the data to be identified by acquiring the second feature vector, the similarity of the second feature vector and the first feature vector meets a target similarity condition.

11. A scene recognition device, the device comprising:

the data processing unit to be identified is used for acquiring a first feature vector of data to be identified through a first network model, wherein the data to be identified is data of an image mode or a voice mode;

the candidate scene acquisition unit is used for acquiring second feature vectors of each of a plurality of candidate scenes, wherein the second feature vectors are obtained by converting scene description data of the candidate scenes through a second network model, and the scene description data are data of a text mode;

the vector comparison unit is used for taking a second feature vector, of the second feature vectors of the candidate scenes, of which the similarity with the first feature vector meets the target similarity condition, as a target feature vector;

And the scene acquisition unit is used for taking the candidate scene corresponding to the target feature vector as the scene corresponding to the data to be identified.

12. A model training apparatus, the apparatus comprising:

the training data acquisition unit is used for acquiring a first training data set, wherein the first training data set comprises a plurality of first training samples and scene description data of each of the plurality of first training samples, and the first training samples are samples of an image mode or a voice mode;

the training unit is used for training a first network model to be trained and a second network model to be trained through the first training data set to obtain a first network model and a second network model, wherein the second network model is used for converting scene description data of a plurality of candidate scenes to obtain a plurality of second feature vectors, and the first network model is used for acquiring a first feature vector of data to be identified so as to acquire the scenes corresponding to the data to be identified by acquiring the second feature vectors, the similarity of which meets the target similarity condition, of the first feature vector.

13. An electronic device comprising a processor and a memory; one or more programs are stored in the memory and configured to be executed by the processor to implement the method of any of claims 1-9 or the method of claim 10.

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, wherein the program code, when being executed by a processor, performs the method of any of claims 1-9 or the method of claim 10.