CN111797873A

CN111797873A - Scene recognition method and device, storage medium and electronic equipment

Info

Publication number: CN111797873A
Application number: CN201910282441.8A
Authority: CN
Inventors: 何明; 陈仲铭; 李姬俊男; 刘耀勇; 陈岩
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2019-04-09
Filing date: 2019-04-09
Publication date: 2020-10-20

Abstract

The embodiment of the application provides a scene identification method, a scene identification device, a storage medium and electronic equipment, wherein the scene identification method comprises the following steps: acquiring perception data of a current scene; acquiring a feature vector according to the perception data; sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values; determining a target feature vector from the preset feature vectors; and determining a preset scene corresponding to the target characteristic vector as a current scene. In the scene identification method provided by the embodiment of the application, the electronic device can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so that the current scene is identified, and the electronic device can perform intelligent operation on the current scene conveniently.

Description

Scene recognition method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of electronic technologies, and in particular, to a scene recognition method and apparatus, a storage medium, and an electronic device.

Background

With the development of electronic technology, electronic devices such as smart phones are capable of providing more and more services to users. For example, the electronic device may provide social services, navigation services, travel recommendation services, and the like for the user. In order to provide targeted and personalized services for users, the electronic device needs to identify the scene where the user is located, and then provide the services for the user based on the identified scene.

Disclosure of Invention

The embodiment of the application provides a scene identification method, a scene identification device, a storage medium and electronic equipment, which can identify a current scene through the electronic equipment.

The embodiment of the application provides a scene identification method, which comprises the following steps:

acquiring perception data of a current scene;

acquiring a feature vector of the current scene according to the perception data;

sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene;

determining a target feature vector from the preset feature vectors according to the similarity values;

and determining a preset scene corresponding to the target characteristic vector as a current scene.

An embodiment of the present application further provides a scene recognition apparatus, including:

the first acquisition module is used for acquiring the perception data of the current scene;

the second acquisition module is used for acquiring the feature vector of the current scene according to the perception data;

the calculation module is used for calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors in sequence to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene;

a first determining module, configured to determine a target feature vector from the plurality of preset feature vectors according to the plurality of similarity values;

and the second determining module is used for determining the preset scene corresponding to the target characteristic vector as the current scene.

An embodiment of the present application further provides a storage medium, where a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer is enabled to execute the above scene recognition method.

The embodiment of the present application further provides an electronic device, which includes a processor and a memory, where the memory stores a computer program, and the processor is configured to execute the scene recognition method by calling the computer program stored in the memory.

The scene recognition method provided by the embodiment of the application comprises the following steps: acquiring perception data of a current scene; acquiring a feature vector of the current scene according to the perception data; sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene; determining a target feature vector from the preset feature vectors according to the similarity values; and determining a preset scene corresponding to the target characteristic vector as a current scene. In the scene identification method, the electronic device can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so as to realize identification of the current scene, and thus, the electronic device can perform intelligent operation on the current scene conveniently.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments will be briefly introduced below. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

Fig. 1 is a schematic view of an application scenario of a scenario identification method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart of a first method for scene recognition according to an embodiment of the present disclosure.

Fig. 3 is a schematic flowchart of a second method for scene recognition according to an embodiment of the present disclosure.

Fig. 4 is a third flowchart illustrating a scene recognition method according to an embodiment of the present application.

Fig. 5 is a fourth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Fig. 6 is a fifth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Fig. 7 is a sixth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Fig. 8 is a schematic structural diagram of a first scene recognition device according to an embodiment of the present application.

Fig. 9 is a schematic structural diagram of a second scene recognition device according to an embodiment of the present application.

Fig. 10 is a schematic structural diagram of a first electronic device according to an embodiment of the present application.

Fig. 11 is a second structural schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without inventive step, are within the scope of the present application.

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a scenario identification method provided in an embodiment of the present application. The scene recognition method is applied to electronic equipment. A panoramic perception framework is arranged in the electronic equipment. The panoramic perception architecture is an integration of hardware and software for implementing the scene recognition method in an electronic device.

The panoramic perception architecture comprises an information perception layer, a data processing layer, a feature extraction layer, a scene modeling layer and an intelligent service layer.

The information perception layer is used for acquiring information of the electronic equipment or information in an external environment. The information-perceiving layer may include a plurality of sensors. For example, the information sensing layer includes a plurality of sensors such as a distance sensor, a magnetic field sensor, a light sensor, an acceleration sensor, a fingerprint sensor, a hall sensor, a position sensor, a gyroscope, an inertial sensor, an attitude sensor, an image sensor, and an audio sensor.

Among other things, a distance sensor may be used to detect a distance between the electronic device and an external object. The magnetic field sensor may be used to detect magnetic field information of the environment in which the electronic device is located. The light sensor can be used for detecting light information of the environment where the electronic equipment is located. The acceleration sensor may be used to detect acceleration data of the electronic device. The fingerprint sensor may be used to collect fingerprint information of a user. The Hall sensor is a magnetic field sensor manufactured according to the Hall effect, and can be used for realizing automatic control of electronic equipment. The location sensor may be used to detect the geographic location where the electronic device is currently located. Gyroscopes may be used to detect angular velocity of an electronic device in various directions. Inertial sensors may be used to detect motion data of an electronic device. The gesture sensor may be used to sense gesture information of the electronic device. An image sensor, which may be, for example, a camera, may be used to capture images of the surrounding environment. An audio sensor, which may be a microphone, for example, may be used to capture sound signals in the surrounding environment.

And the data processing layer is used for processing the data acquired by the information perception layer. For example, the data processing layer may perform data cleaning, data integration, data transformation, data reduction, and the like on the data acquired by the information sensing layer.

The data cleaning refers to cleaning a large amount of data acquired by the information sensing layer to remove invalid data and repeated data. The data integration refers to integrating a plurality of single-dimensional data acquired by the information perception layer into a higher or more abstract dimension so as to comprehensively process the data of the plurality of single dimensions. The data transformation refers to performing data type conversion or format conversion on the data acquired by the information sensing layer so that the transformed data can meet the processing requirement. The data reduction means that the data volume is reduced to the maximum extent on the premise of keeping the original appearance of the data as much as possible.

The characteristic extraction layer is used for extracting characteristics of the data processed by the data processing layer so as to extract the characteristics included in the data. The extracted features may reflect the state of the electronic device itself or the state of the user or the environmental state of the environment in which the electronic device is located, etc.

The feature extraction layer may extract features or process the extracted features by a method such as a filtering method, a packing method, or an integration method.

The filtering method is to filter the extracted features to remove redundant feature data. Packaging methods are used to screen the extracted features. The integration method is to integrate a plurality of feature extraction methods together to construct a more efficient and more accurate feature extraction method for extracting features.

The scene modeling layer is used for building a model according to the features extracted by the feature extraction layer, and the obtained model can be used for representing the state of the electronic equipment, the state of a user, the environment state and the like. For example, the scenario modeling layer may construct a key value model, a pattern identification model, a graph model, an entity relation model, an object-oriented model, and the like according to the features extracted by the feature extraction layer.

The intelligent service layer is used for providing intelligent services for the user according to the model constructed by the scene modeling layer. For example, the intelligent service layer can provide basic application services for users, perform system intelligent optimization for electronic equipment, and provide personalized intelligent services for users.

In addition, the panoramic perception architecture can further comprise a plurality of algorithms, each algorithm can be used for analyzing and processing data, and the plurality of algorithms can form an algorithm library. For example, the algorithm library may include algorithms such as a markov algorithm, a hidden dirichlet distribution algorithm, a bayesian classification algorithm, a word vector, a K-means clustering algorithm, a K-nearest neighbor algorithm, a cosine similarity algorithm, a residual error network, a long-short term memory network, a convolutional neural network, and a recurrent neural network.

The embodiment of the application provides a scene identification method, which can be applied to electronic equipment. The electronic device may be a smartphone, a tablet computer, a gaming device, an AR (Augmented Reality) device, an automobile, a data storage device, an audio playback device, a video playback device, a laptop computer, a desktop computing device, a wearable device such as an electronic watch, an electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic garment, or the like.

Referring to fig. 2, fig. 2 is a schematic flowchart of a first method for scene recognition according to an embodiment of the present disclosure. The scene recognition method comprises the following steps:

and 110, acquiring the perception data of the current scene.

The electronic device may obtain perceptual data of a current scene. The current scene is a scene of an environment where the electronic device is currently located, that is, a scene of an environment where a user of the electronic device is currently located. It should be noted that, since the electronic device identifies the current scene through the acquired perceptual data, the current scene is an unknown scene for the electronic device.

The electronic device can acquire the perception data of the current scene through the information perception layer in the panoramic perception architecture. The perception data may comprise arbitrary data. For example, the sensing data may include various data such as temperature, humidity, ambient light intensity, image information, audio information, and the like.

For example, the current scene may be a conference scene. The perception data acquired by the electronic device may include a plurality of image information, a plurality of sound information, and the like in the conference scene.

And 120, acquiring a feature vector of the current scene according to the perception data.

After the electronic device acquires the sensing data of the current scene, the feature vector of the current scene can be acquired according to the sensing data. Wherein the feature vector may comprise a plurality of features. The feature vector is used to quantize the current scene so that the current scene can be represented by the feature vector.

For example, the feature vector may be P (a, B, C). Wherein A, B, C each represent a feature, e.g., a may represent an image feature, B may represent an audio feature, and C may represent a text feature. The situation of the current scene can be represented by the feature vector P (A, B, C).

And 130, sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene.

A plurality of preset feature vectors may be preset in the electronic device, and each preset feature vector corresponds to one preset scene. That is, each preset feature vector is used to represent a preset scene. It will be appreciated that the preset scenario is a scenario known to the electronic device.

For example, the preset feature vector P may be preset in the electronic device₁、P₂、P₃And the like. Wherein, P₁May correspond to a conference scenario, P₂Can correspond to a restaurant scenario, P₃May correspond to a subway scene. That is, P₁For representing conference scenes, P₂For representing restaurant scenes, P₃For representing a subway scene.

It should be noted that the preset feature vectors set in the electronic device may be a large number of vectors, for example, 100 preset feature vectors may be set in the electronic device. Therefore, a large number of preset scenes can be preset in the electronic equipment, and each preset scene can be a specific scene in the life of the user.

After the electronic equipment obtains the feature vector of the current scene, the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors is calculated in sequence, and a plurality of similarity values are obtained. The greater the similarity between the feature vector and a preset feature vector, the more similar the feature vector and the preset feature vector, that is, the more similar the current scene and the preset scene corresponding to the preset feature vector.

For example, after acquiring a feature vector P of a current scene, the electronic device sequentially calculates the feature vectors P and P₁Similarity of (2) N₁P and P₂Similarity of (2) N₂P and P₃Similarity of (2) N₃. Wherein N is₁Denotes P and P₁The similarity between the current scene and P can also be understood as₁Similarity between corresponding preset scenes; n is a radical of₂Denotes P and P₂The similarity between the current scene and P can also be understood as₂Similarity between corresponding preset scenes; n is a radical of₃Denotes P and P₃The similarity between the current scene and P can also be understood as₃Similarity between corresponding preset scenes.

140, determining a target feature vector from the preset feature vectors according to the similarity values.

After the electronic device obtains the plurality of similarity values, a target feature vector can be determined from the plurality of preset feature vectors according to the plurality of similarity values. The target feature vector is the feature vector with the largest similarity with the feature vector of the current scene in the preset feature vectors. That is, the similarity between the scene represented by the target feature vector and the current scene is the highest.

And 150, determining a preset scene corresponding to the target feature vector as a current scene.

After the electronic device determines the target feature vector, a preset scene corresponding to the target feature vector can be determined as a current scene, so that the current scene is identified. Subsequently, the electronic device may perform an intelligent operation according to the identified scene, for example, the electronic device may automatically perform a mode switch, automatically adjust screen brightness, or provide an intelligent suggestion for the current scene to the user.

For example, the electronic device determines the target feature vector as P₃，P₃And if the corresponding preset scene is a conference scene, the electronic equipment determines the conference scene as a current scene. Subsequently, the electronic device may perform an intelligent operation according to the determined scene, for example, the electronic device may automatically switch to a mute mode.

In the embodiment of the application, the electronic device can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so that the current scene is identified, and the electronic device can perform intelligent operation on the current scene conveniently.

In some embodiments, referring to fig. 3, fig. 3 is a second flowchart illustrating a scene recognition method according to an embodiment of the present disclosure.

Step 120, obtaining the feature vector of the current scene according to the perception data, including the following steps:

121, selecting a corresponding feature extraction model according to the data type of the perception data;

122, extracting data features from the perception data through the feature extraction model;

and 123, acquiring a feature vector of the current scene according to the data features.

A plurality of feature extraction models may be preset in the electronic device, and each feature extraction model is used for performing feature extraction on one type of data. For example, a convolutional neural network model, a recurrent neural network model, a word vector model, or the like may be set in advance in the electronic device. The convolutional neural network model is used for processing the image data so as to extract image features from the image data; the recurrent neural network model is used for processing the audio data so as to extract audio features from the audio data; the word vector model is used for processing the text data to extract text features from the text data.

After the electronic equipment acquires the perception data of the current scene, the corresponding feature extraction model can be selected according to the data type of the perception data. When the perception data comprises a plurality of data types, the electronic device can select a corresponding feature extraction model according to each data type.

And then, the electronic equipment extracts data features from the perception data through the selected feature extraction model and obtains the feature vector of the current scene according to the data features. For example, the electronic device may combine the extracted data features to obtain a feature vector for the current scene.

In some embodiments, referring to fig. 4, fig. 4 is a third flowchart illustrating a scene recognition method according to an embodiment of the present application.

Step 122, extracting data features from the perception data through the feature extraction model, including the following steps:

1221, extracting image features from the image data through a convolutional neural network model;

1222 extracting audio features from the audio data through a recurrent neural network model or a long-short term memory network model;

1223, extracting text features from the text data through a word vector model;

step 123, obtaining the feature vector of the current scene according to the data features, including the following steps:

1231, obtaining the feature vector of the current scene according to the image feature, the audio feature and the text feature.

The data type of the perception data acquired by the electronic equipment comprises image data, audio data and text data. The feature extraction model selected by the electronic device for image data may be a convolutional neural network model, the feature extraction model selected for audio data may be a recurrent neural network model or a long-short term memory network model, and the feature extraction model selected for text data may be a word vector model.

Subsequently, the electronic device may extract image features from the image data through a convolutional neural network model, audio features from the audio data through a recursive neural network model or a long-short term memory network model, and text features from the text data through a word vector model.

And then, the electronic equipment acquires a feature vector of the current scene according to the extracted image feature, audio feature and text feature.

For example, the image feature extracted by the electronic device may be a, the audio feature may be B, and the text feature may be C. Subsequently, the electronic device may stitch the extracted image features, audio features, and text features to obtain a feature vector P (a, B, C) of the current scene.

In some embodiments, the electronic device may further perform feature extraction on the obtained image features, audio features, and text features again to obtain new image features, new audio features, and new text features, and obtain a feature vector of the current scene according to the new image features, the new audio features, and the new text features.

For example, after the electronic device obtains the image feature a, the audio feature B, and the text feature C, feature extraction may be performed on the image feature a, the audio feature B, and the text feature C again in sequence by using a statistical method to obtain a new image feature a₁New audio feature B₁New text feature C₁And A is₁、B₁、C₁Splicing to obtain a feature vector P (A) of the current scene₁，B₁，C₁)。

In some embodiments, referring to fig. 5, fig. 5 is a fourth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Step 130, calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors in sequence to obtain a plurality of similarity values, including the following steps:

131, calculating cosine similarity between the feature vector and each preset feature vector by sequentially adopting a cosine similarity algorithm to obtain a plurality of cosine similarity values;

132, determining the cosine similarity value between the feature vector and each of the predetermined feature vectors as the similarity value between the feature vector and the predetermined feature vector to obtain a plurality of similarity values.

The electronic device may sequentially calculate the cosine similarity between the feature vector and each of the preset feature vectors by using a cosine similarity algorithm, so as to obtain a plurality of cosine similarity values.

Wherein, the value range of the cosine similarity value is [ 1, 1 ]. The cosine similarity value of 1 indicates that the directions of the two vectors are the same, the cosine similarity value of 0 indicates that the two vectors are independent of each other, and the cosine similarity value of-1 indicates that the directions of the two vectors are opposite. The closer the cosine similarity value is to 1, the closer the directions of the two vectors are.

For example, the electronic device obtains a feature vector of a current scene as P, and the preset feature vector includes P₁、P₂、P₃Then the electronic equipment calculates P and P by using a cosine similarity algorithm in sequence₁、P₂、P₃The cosine similarity of P and P is obtained₁Cosine similarity value K of₁P and P₂Cosine similarity value K of₂P and P₃Cosine similarity value K of₃。

Then, the electronic device determines the cosine similarity value between the feature vector of the current scene and each preset feature vector as the similarity value between the feature vector and the preset feature vector to obtain a plurality of similarity values.

For example, the electronic device may compare the cosine similarity value K₁Determining the similarity value of P and P1, and determining the cosine similarity value K₂Determining the similarity value of P and P2, and determining the cosine similarity value K₃The similarity value of P and P3 is determined.

In some embodiments, with continued reference to fig. 5, the step 140 of determining the target feature vector from the plurality of preset feature vectors according to the plurality of similarity values includes the following steps:

141, determining the maximum similarity value from the plurality of similarity values;

and 142, determining the preset feature vector corresponding to the maximum similarity value as a target feature vector.

After obtaining the plurality of similarity values, the electronic device may compare the plurality of similarity values with each other to determine a maximum similarity value from the plurality of similarity values. And then, determining the preset feature vector corresponding to the maximum similarity value as a target feature vector.

For example, three similarity values N₁、N₂、N₃In, N₁Less than N₂，N₂Less than N₃Then the electronic device may determine that the maximum similarity value is N₃. Subsequently, the electronic device will N₃Corresponding preset feature vector P₃And determining the target feature vector.

In some embodiments, referring to fig. 6, fig. 6 is a fifth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Before acquiring the perception data of the current scene in step 110, the method further includes the following steps:

161, obtaining preset sensing data of each preset scene in the plurality of preset scenes for multiple times to obtain a plurality of preset sensing data of each preset scene;

and 162, acquiring preset feature vectors corresponding to the preset scenes according to the preset perception data of each preset scene.

Wherein a plurality of preset scenes may be determined by a user first. For example, a user may determine a plurality of scenes, such as a meeting scene, a restaurant scene, a subway scene, and so on.

The electronic device may obtain preset sensing data of each preset scene in the plurality of preset scenes for multiple times to obtain a plurality of preset sensing data of each preset scene.

For example, the electronic device may obtain the preset sensing data of the conference scene multiple times to obtain multiple preset sensing data of the conference sceneX₁、X₂、X₃(ii) a And acquiring preset perception data of the restaurant scene for multiple times to obtain multiple preset perception data Y of the restaurant scene₁、Y₂、Y₃(ii) a And acquiring preset sensing data of the subway scene for multiple times to obtain multiple preset sensing data Z of the subway scene₁、Z₂、Z₃。

And then, the electronic equipment acquires preset feature vectors corresponding to the preset scenes according to the preset perception data of each preset scene. The electronic equipment can extract features of a plurality of preset sensing data of each preset scene, and obtains preset feature vectors corresponding to the preset scenes according to the extracted features.

For example, the electronic device may be paired with X₁、X₂、X₃Extracting the characteristics, and acquiring a preset characteristic vector P of the conference scene according to the extracted characteristics₁(ii) a For Y₁、Y₂、Y₃Extracting the characteristics, and acquiring a preset characteristic vector P of the restaurant scene according to the extracted characteristics₂(ii) a To Z₁、Z₂、Z₃Extracting the characteristics, and acquiring a preset characteristic vector P of the subway scene according to the extracted characteristics₃。

In some embodiments, referring to fig. 7, fig. 7 is a sixth flowchart illustrating a scene recognition method according to an embodiment of the present application.

Step 162, obtaining a preset feature vector corresponding to each preset scene according to a plurality of preset sensing data of the preset scene, includes the following steps:

1621, sequentially obtaining a preset sub-feature vector according to each preset sensing data to obtain a plurality of preset sub-feature vectors of the preset scene;

1622, calculating an average feature vector of the plurality of preset sub-feature vectors;

1623, determining the average feature vector as a preset feature vector corresponding to the preset scene.

After the electronic equipment obtains a plurality of preset perception data of each preset scene, a preset sub-feature vector is obtained in sequence according to each preset perception data, so that a plurality of preset sub-feature vectors of the preset scene are obtained. The electronic device can extract features of each preset sensing data, and obtains a corresponding preset sub-feature vector according to the extracted features. Thus, for each preset scene, the electronic device may obtain a plurality of preset sub-feature vectors.

For example, for a meeting scenario, the electronic device may be aware of preset perception data X₁Performing feature extraction to obtain X₁Corresponding preset sub-feature vector P₁₁(ii) a For preset perception data X₂Performing feature extraction to obtain X₂Corresponding preset sub-feature vector P₁₂(ii) a For preset perception data X₃Performing feature extraction to obtain X₃Corresponding preset sub-feature vector P₁₃. Thus, the electronic device can obtain three preset sub-feature vectors P of the conference scene₁₁、P₁₂、P₁₃。

Then, the electronic device calculates an average feature vector of the plurality of preset sub-feature vectors, and determines the average feature vector as a preset feature vector corresponding to the preset scene.

For example, the electronic device obtains three preset sub-feature vectors P of a conference scene₁₁、P₁₂、P₁₃Then, the three preset sub-feature vectors P are calculated₁₁、P₁₂、P₁₃Is averaged feature vector P₁. The electronic device then averages the feature vector P₁And determining a preset feature vector of the conference scene.

In particular implementation, the present application is not limited by the execution sequence of the described steps, and some steps may be performed in other sequences or simultaneously without conflict.

As can be seen from the above, the scene identification method provided in the embodiment of the present application includes: acquiring perception data of a current scene; acquiring a feature vector of the current scene according to the perception data; sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene; determining a target feature vector from the preset feature vectors according to the similarity values; and determining a preset scene corresponding to the target characteristic vector as a current scene. In the scene identification method, the electronic device can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so as to realize identification of the current scene, and thus, the electronic device can perform intelligent operation on the current scene conveniently.

The embodiment of the application also provides a scene recognition device, and the scene recognition device can be integrated in the electronic equipment. The electronic device may be a smartphone, a tablet computer, a gaming device, an AR (Augmented Reality) device, an automobile, a data storage device, an audio playback device, a video playback device, a laptop computer, a desktop computing device, a wearable device such as an electronic watch, an electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic garment, or the like.

Referring to fig. 8, fig. 8 is a schematic view of a first structure of a scene recognition apparatus according to an embodiment of the present application. Wherein the scene recognition apparatus 200 comprises: a first obtaining module 201, a second obtaining module 202, a calculating module 203, a first determining module 204, and a second determining module 205.

A first obtaining module 201, configured to obtain perceptual data of a current scene.

The first obtaining module 201 may obtain perceptual data of a current scene. The current scene is a scene of an environment where the electronic device is currently located, that is, a scene of an environment where a user of the electronic device is currently located. It should be noted that, since the electronic device identifies the current scene through the acquired perceptual data, the current scene is an unknown scene for the electronic device.

The first obtaining module 201 may collect the sensing data of the current scene through an information sensing layer in a panoramic sensing architecture in the electronic device. The perception data may comprise arbitrary data. For example, the sensing data may include various data such as temperature, humidity, ambient light intensity, image information, audio information, and the like.

For example, the current scene may be a conference scene. The perception data acquired by the first acquisition module 201 may include a plurality of image information, a plurality of sound information, and the like in a conference scene.

A second obtaining module 202, configured to obtain a feature vector of the current scene according to the sensing data.

After the first obtaining module 201 obtains the sensing data of the current scene, the second obtaining module 202 may obtain the feature vector of the current scene according to the sensing data. Wherein the feature vector may comprise a plurality of features. The feature vector is used to quantize the current scene so that the current scene can be represented by the feature vector.

The calculating module 203 is configured to sequentially calculate a similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, where each preset feature vector corresponds to one preset scene.

For example, the preset feature vector P may be preset in the electronic device₁、P₂、P₃And the like. Wherein, P₁May correspond to a conference scenario, P₂Can correspond to a restaurant scenario, P₃Can be connected with a subwayThe scenes correspond. That is, P₁For representing conference scenes, P₂For representing restaurant scenes, P₃For representing a subway scene.

After the second obtaining module 202 obtains the feature vector of the current scene, the calculating module 203 sequentially calculates the similarity between the feature vector and each preset feature vector in the plurality of preset feature vectors to obtain a plurality of similarity values. The greater the similarity between the feature vector and a preset feature vector, the more similar the feature vector and the preset feature vector, that is, the more similar the current scene and the preset scene corresponding to the preset feature vector.

For example, after the second obtaining module 202 obtains the feature vector P of the current scene, the calculating module 203 sequentially calculates the feature vectors P and P₁Similarity of (2) N₁P and P₂Similarity of (2) N₂P and P₃Similarity of (2) N₃. Wherein N is₁Denotes P and P₁The similarity between the current scene and P can also be understood as₁Similarity between corresponding preset scenes; n is a radical of₂Denotes P and P₂The similarity between the current scene and P can also be understood as₂Similarity between corresponding preset scenes; n is a radical of₃Denotes P and P₃The similarity between the current scene and P can also be understood as₃Similarity between corresponding preset scenes.

A first determining module 204, configured to determine a target feature vector from the plurality of preset feature vectors according to the plurality of similarity values.

After the calculating module 203 obtains a plurality of similarity values, the first determining module 204 may determine the target feature vector from the plurality of preset feature vectors according to the plurality of similarity values. The target feature vector is the feature vector with the largest similarity with the feature vector of the current scene in the preset feature vectors. That is, the similarity between the scene represented by the target feature vector and the current scene is the highest.

A second determining module 205, configured to determine a preset scene corresponding to the target feature vector as a current scene.

After the first determining module 204 determines the target feature vector, the second determining module 205 may determine a preset scene corresponding to the target feature vector as a current scene, so as to implement identification of the current scene. Subsequently, the electronic device may perform an intelligent operation according to the identified scene, for example, the electronic device may automatically perform a mode switch, automatically adjust screen brightness, or provide an intelligent suggestion for the current scene to the user.

For example, the target feature vector determined by the first determination module 204 is P₃，P₃If the corresponding preset scene is a conference scene, the second determining module 205 determines the conference scene as a current scene. Subsequently, the electronic device may perform an intelligent operation according to the determined scene, for example, the electronic device may automatically switch to a mute mode.

In some embodiments, the second obtaining module 202 is configured to perform the following steps:

selecting a corresponding feature extraction model according to the data type of the perception data;

extracting data features from the perception data through the feature extraction model;

and acquiring a feature vector of the current scene according to the data features.

After the first obtaining module 201 obtains the sensing data of the current scene, the second obtaining module 202 may select a corresponding feature extraction model according to a data type of the sensing data. When the sensing data includes a plurality of data types, the second obtaining module 202 may select a corresponding feature extraction model according to each data type.

Subsequently, the second obtaining module 202 extracts data features from the sensing data through the selected feature extraction model, and obtains feature vectors of the current scene according to the data features. For example, the second obtaining module 202 may combine the extracted data features to obtain a feature vector of the current scene.

In some embodiments, when extracting data features from the perceptual data through the feature extraction model, the second obtaining module 202 is configured to perform the following steps:

extracting image features from the image data through a convolutional neural network model;

extracting audio features from the audio data through a recurrent neural network model or a long-short term memory network model;

extracting text features from the text data through a word vector model;

when the feature vector of the current scene is obtained according to the data feature, the second obtaining module 202 is configured to perform the following steps:

and acquiring a feature vector of the current scene according to the image feature, the audio feature and the text feature.

The data type of the perception data acquired by the first acquiring module 201 includes image data, audio data, and text data. The feature extraction model selected for the image data may be a convolutional neural network model, the feature extraction model selected for the audio data may be a recurrent neural network model or a long-short term memory network model, and the feature extraction model selected for the text data may be a word vector model.

Subsequently, the second obtaining module 202 may extract image features from the image data through a convolutional neural network model, audio features from the audio data through a recursive neural network model or a long-short term memory network model, and text features from the text data through a word vector model.

Subsequently, the second obtaining module 202 obtains a feature vector of the current scene according to the extracted image feature, audio feature and text feature.

For example, the image feature extracted by the second obtaining module 202 may be a, the audio feature may be B, and the text feature may be C. Subsequently, the second obtaining module 202 may splice the extracted image features, audio features, and text features to obtain a feature vector P (a, B, C) of the current scene.

In some embodiments, the second obtaining module 202 may further perform feature extraction on the obtained image features, audio features, and text features again to obtain new image features, new audio features, and new text features, and obtain a feature vector of the current scene according to the new image features, the new audio features, and the new text features.

For example, after the second obtaining module 202 obtains the image feature a, the audio feature B, and the text feature C, feature extraction may be performed on the image feature a, the audio feature B, and the text feature C again in sequence by using a statistical method to obtain a new image feature a₁New audio feature B₁New text feature C₁And A is₁、B₁、C₁Splicing to obtain a feature vector P (A) of the current scene₁，B₁，C₁)。

In some embodiments, the calculation module 203 is configured to perform the following steps:

calculating cosine similarity between the eigenvectors and each preset eigenvector by sequentially adopting a cosine similarity algorithm to obtain a plurality of cosine similarity values;

and determining the cosine similarity value of the feature vector and each preset feature vector as the similarity value of the feature vector and the preset feature vector to obtain a plurality of similarity values.

The calculating module 203 may sequentially calculate the cosine similarity between the feature vector and each of the preset feature vectors by using a cosine similarity algorithm, so as to obtain a plurality of cosine similarity values.

For example, the second obtaining module 202 obtains a feature vector of the current scene as P, where the preset feature vector includes P₁、P₂、P₃Then the calculating module 203 calculates P and P sequentially by using the cosine similarity algorithm₁、P₂、P₃The cosine similarity of P and P is obtained₁Cosine similarity value K of₁P and P₂Cosine similarity value K of₂P and P₃Cosine similarity value K of₃。

Subsequently, the calculating module 203 determines the cosine similarity value between the feature vector of the current scene and each of the preset feature vectors as the similarity value between the feature vector and the preset feature vector, so as to obtain a plurality of similarity values.

For example, the calculating module 203 may calculate the cosine similarity value K₁Determining the similarity value of P and P1, and determining the cosine similarity value K₂Determining the similarity value of P and P2, and determining the cosine similarity value K₃The similarity value of P and P3 is determined.

In some embodiments, the first determination module 204 is configured to perform the following steps:

determining a maximum similarity value from the plurality of similarity values;

and determining the preset feature vector corresponding to the maximum similarity value as a target feature vector.

After the calculating module 203 obtains the plurality of similarity values, the first determining module 204 may compare the plurality of similarity values with each other to determine a maximum similarity value from the plurality of similarity values. And then, determining the preset feature vector corresponding to the maximum similarity value as a target feature vector.

For example, three similarity values N₁、N₂、N₃In, N₁Less than N₂，N₂Less than N₃Then the first determination module 204 may determine that the maximum similarity value is N₃. Subsequently, the first determination module 204 compares N₃Corresponding preset feature vector P₃And determining the target feature vector.

In some embodiments, referring to fig. 9, fig. 9 is a schematic diagram of a second structure of a scene recognition apparatus provided in an embodiment of the present application.

The scene recognition apparatus 200 further includes a third obtaining module 206, where the third obtaining module 206 is configured to perform the following steps:

acquiring preset sensing data of each preset scene in a plurality of preset scenes for a plurality of times to obtain a plurality of preset sensing data of each preset scene;

and acquiring preset characteristic vectors corresponding to the preset scenes according to the preset perception data of each preset scene.

The third obtaining module 206 may obtain the preset sensing data of each preset scene in the plurality of preset scenes for multiple times to obtain a plurality of preset sensing data of each preset scene.

For example, the third acquisition module 206 may acquire a conference scene multiple timesTo obtain a plurality of preset sensing data X of the conference scene₁、X₂、X₃(ii) a And acquiring preset perception data of the restaurant scene for multiple times to obtain multiple preset perception data Y of the restaurant scene₁、Y₂、Y₃(ii) a And acquiring preset sensing data of the subway scene for multiple times to obtain multiple preset sensing data Z of the subway scene₁、Z₂、Z₃。

Subsequently, the third obtaining module 206 obtains the preset feature vector corresponding to each preset scene according to the plurality of preset sensing data of the preset scene. The third obtaining module 206 may perform feature extraction on a plurality of preset sensing data of each preset scene, and obtain a preset feature vector corresponding to the preset scene from the extracted features.

For example, the third acquisition module 206 may be for X₁、X₂、X₃Extracting the characteristics, and acquiring a preset characteristic vector P of the conference scene according to the extracted characteristics₁(ii) a For Y₁、Y₂、Y₃Extracting the characteristics, and acquiring a preset characteristic vector P of the restaurant scene according to the extracted characteristics₂(ii) a To Z₁、Z₂、Z₃Extracting the characteristics, and acquiring a preset characteristic vector P of the subway scene according to the extracted characteristics₃。

In some embodiments, when obtaining the preset feature vectors corresponding to each preset scene according to the plurality of preset perceptual data of the preset scene, the third obtaining module 206 is configured to perform the following steps:

acquiring a preset sub-feature vector according to each preset sensing data in sequence to obtain a plurality of preset sub-feature vectors of the preset scene;

calculating an average feature vector of the preset sub-feature vectors;

and determining the average characteristic vector as a preset characteristic vector corresponding to the preset scene.

After obtaining the plurality of preset sensing data of each preset scene, the third obtaining module 206 obtains a preset sub-feature vector according to each preset sensing data in sequence, so as to obtain a plurality of preset sub-feature vectors of the preset scene. The third obtaining module 206 may perform feature extraction on each preset sensing data, and obtain a corresponding preset sub-feature vector from the extracted features. Thus, for each preset scenario, the third obtaining module 206 may obtain a plurality of preset sub-feature vectors.

For example, for a meeting scenario, the third obtaining module 206 may obtain the preset perception data X₁Performing feature extraction to obtain X₁Corresponding preset sub-feature vector P₁₁(ii) a For preset perception data X₂Performing feature extraction to obtain X₂Corresponding preset sub-feature vector P₁₂(ii) a For preset perception data X₃Performing feature extraction to obtain X₃Corresponding preset sub-feature vector P₁₃. Thus, the third obtaining module 206 may obtain three preset sub-feature vectors P of the conference scene₁₁、P₁₂、P₁₃。

Subsequently, the third obtaining module 206 calculates an average feature vector of the preset sub-feature vectors, and determines the average feature vector as a preset feature vector corresponding to the preset scene.

For example, the third obtaining module 206 obtains three preset sub-feature vectors P of the conference scene₁₁、P₁₂、P₁₃Then, the three preset sub-feature vectors P are calculated₁₁、P₁₂、P₁₃Is averaged feature vector P₁. Subsequently, the third obtaining module 206 will average the feature vector P₁And determining a preset feature vector of the conference scene.

In specific implementation, the modules may be implemented as independent entities, or may be combined arbitrarily and implemented as one or several entities.

As can be seen from the above, the scene recognition apparatus 200 according to the embodiment of the present application includes: a first obtaining module 201, configured to obtain perceptual data of a current scene; a second obtaining module 202, configured to obtain a feature vector of the current scene according to the sensing data; a calculating module 203, configured to sequentially calculate a similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, where each preset feature vector corresponds to a preset scene; a first determining module 204, configured to determine a target feature vector from the plurality of preset feature vectors according to the plurality of similarity values; a second determining module 205, configured to determine a preset scene corresponding to the target feature vector as a current scene. The scene recognition device can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so that the current scene is recognized, and the electronic equipment can perform intelligent operation on the current scene conveniently.

The embodiment of the application also provides the electronic equipment. The electronic device may be a smartphone, a tablet computer, a gaming device, an AR (Augmented Reality) device, an automobile, a data storage device, an audio playback device, a video playback device, a laptop computer, a desktop computing device, a wearable device such as an electronic watch, an electronic glasses, an electronic helmet, an electronic bracelet, an electronic necklace, an electronic garment, or the like.

Referring to fig. 10, fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Electronic device 300 includes, among other things, a processor 301 and a memory 302. The processor 301 is electrically connected to the memory 302.

The processor 301 is a control center of the electronic device 300, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or calling a computer program stored in the memory 302 and calling data stored in the memory 302, thereby performing overall monitoring of the electronic device.

In this embodiment, the processor 301 in the electronic device 300 loads instructions corresponding to one or more processes of the computer program into the memory 302 according to the following steps, and the processor 301 runs the computer program stored in the memory 302, so as to implement various functions:

acquiring perception data of a current scene;

In some embodiments, when obtaining the feature vector of the current scene according to the perception data, the processor 301 performs the following steps:

In some embodiments, the data types of the perception data include image data, audio data, and text data, and when the data features are extracted from the perception data by the feature extraction model, the processor 301 performs the following steps:

extracting image features from the image data by a convolutional neural network model;

extracting text features from the text data through a word vector model;

when obtaining the feature vector of the current scene according to the data feature, the processor 301 executes the following steps:

In some embodiments, when the similarity between the feature vector and each of the preset feature vectors is sequentially calculated to obtain a plurality of similarity values, the processor 301 performs the following steps:

In some embodiments, when determining the target feature vector from the plurality of preset feature vectors according to the plurality of similarity values, the processor 301 performs the following steps:

determining a maximum similarity value from the plurality of similarity values;

In some embodiments, before obtaining the perceptual data of the current scene, the processor 301 further performs the following steps:

In some embodiments, when obtaining the preset feature vector corresponding to each preset scene according to the plurality of preset sensing data of the preset scene, the processor 301 performs the following steps:

calculating an average feature vector of the preset sub-feature vectors;

Memory 302 may be used to store computer programs and data. The memory 302 stores computer programs containing instructions executable in the processor. The computer program may constitute various functional modules. The processor 301 executes various functional applications and data processing by calling a computer program stored in the memory 302.

In some embodiments, referring to fig. 11, fig. 11 is a schematic view of a second structure of an electronic device provided in an embodiment of the present application.

Wherein, the electronic device 300 further comprises: a display 303, a control circuit 304, an input unit 305, a sensor 306, and a power supply 307. The processor 301 is electrically connected to the display 303, the control circuit 304, the input unit 305, the sensor 306, and the power source 307.

The display screen 303 may be used to display information entered by or provided to the user as well as various graphical user interfaces of the electronic device, which may be comprised of images, text, icons, video, and any combination thereof.

The control circuit 304 is electrically connected to the display 303, and is configured to control the display 303 to display information.

The input unit 305 may be used to receive input numbers, character information, or user characteristic information (e.g., fingerprint), and generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control. Wherein, the input unit 305 may include a fingerprint recognition module.

The sensor 306 is used to collect information of the electronic device itself or information of the user or external environment information. For example, the sensor 306 may include a plurality of sensors such as a distance sensor, a magnetic field sensor, a light sensor, an acceleration sensor, a fingerprint sensor, a hall sensor, a position sensor, a gyroscope, an inertial sensor, an attitude sensor, a barometer, a heart rate sensor, and the like.

The power supply 307 is used to power the various components of the electronic device 300. In some embodiments, the power supply 307 may be logically coupled to the processor 301 through a power management system, such that functions of managing charging, discharging, and power consumption are performed through the power management system.

Although not shown in fig. 11, the electronic device 300 may further include a camera, a bluetooth module, and the like, which are not described in detail herein.

As can be seen from the above, an embodiment of the present application provides an electronic device, where the electronic device performs the following steps: acquiring perception data of a current scene; acquiring a feature vector of the current scene according to the perception data; sequentially calculating the similarity between the feature vector and each preset feature vector in a plurality of preset feature vectors to obtain a plurality of similarity values, wherein each preset feature vector corresponds to one preset scene; determining a target feature vector from the preset feature vectors according to the similarity values; and determining a preset scene corresponding to the target characteristic vector as a current scene. The electronic device provided by the embodiment of the application can acquire the feature vector of the current scene according to the sensing data of the current scene, determine the target feature vector according to the similarity values of the feature vector and the preset feature vectors, and determine the preset scene corresponding to the target feature vector as the current scene, so that the current scene is identified, and the electronic device can perform intelligent operation on the current scene conveniently.

An embodiment of the present application further provides a storage medium, where a computer program is stored in the storage medium, and when the computer program runs on a computer, the computer executes the scene recognition method according to any of the above embodiments.

It should be noted that, all or part of the steps in the methods of the above embodiments may be implemented by hardware related to instructions of a computer program, which may be stored in a computer-readable storage medium, which may include, but is not limited to: read Only Memory (ROM), Random Access Memory (RAM), magnetic or optical disks, and the like.

The scene recognition method, the scene recognition device, the storage medium, and the electronic device provided in the embodiments of the present application are described in detail above. The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method for scene recognition, comprising:

acquiring perception data of a current scene;

2. The method according to claim 1, wherein said obtaining a feature vector of the current scene according to the perceptual data comprises:

3. The scene recognition method of claim 2, wherein the data types of the perception data comprise image data, audio data and text data, and the extracting data features from the perception data through the feature extraction model comprises:

extracting text features from the text data through a word vector model;

the obtaining of the feature vector of the current scene according to the data feature includes:

4. The method of claim 1, wherein the sequentially calculating the similarity between the feature vector and each of a plurality of predetermined feature vectors to obtain a plurality of similarity values comprises:

5. The method according to claim 1, wherein said determining a target feature vector from the plurality of predetermined feature vectors according to the plurality of similarity values comprises:

determining a maximum similarity value from the plurality of similarity values;

6. The scene recognition method according to claim 1, wherein before the obtaining the perception data of the current scene, further comprising:

7. The method according to claim 6, wherein the obtaining the preset feature vectors corresponding to the preset scenes according to the preset perceptual data of each preset scene comprises:

calculating an average feature vector of the preset sub-feature vectors;

8. A scene recognition apparatus, comprising:

9. A storage medium having stored therein a computer program which, when run on a computer, causes the computer to execute the scene recognition method according to any one of claims 1 to 7.

10. An electronic device, characterized in that the electronic device comprises a processor and a memory, wherein the memory stores a computer program, and the processor is used for executing the scene recognition method according to any one of claims 1 to 7 by calling the computer program stored in the memory.