CN108549850B

CN108549850B - Image identification method and electronic equipment

Info

Publication number: CN108549850B
Application number: CN201810260038.0A
Authority: CN
Inventors: 田疆; 李聪
Original assignee: Lenovo Beijing Ltd
Current assignee: Lenovo Beijing Ltd
Priority date: 2018-03-27
Filing date: 2018-03-27
Publication date: 2021-07-16
Anticipated expiration: 2038-03-27
Also published as: CN108549850A

Abstract

The invention discloses an image identification method, which comprises the following steps: acquiring image information and first text information; and generating second text information based on the image information and the first text information, wherein the second text information is used for representing the image information and the text information content. The invention also discloses an electronic device.

Description

Image identification method and electronic equipment

Technical Field

The present invention relates to image recognition technologies, and in particular, to an image recognition method and an electronic device.

Background

In the prior art, in the process of identifying the image, only simple judgment can be made on the composition of the image, or a person operating the image can make judgment on the image, so that the identification efficiency is low, and the identification error rate is high.

Disclosure of Invention

The embodiment of the invention provides an image identification method and electronic equipment, which can be used for coding and decoding according to visual features of an image and acquired text information while realizing image identification to obtain second text information describing and fusing the visual features and the text information.

The technical scheme of the embodiment of the invention is realized as follows:

the embodiment of the invention provides an image identification method, which comprises the following steps:

acquiring image information and first text information;

and generating second text information based on the image information and the first text information, wherein the second text information is used for representing the image information and the text information content.

In the foregoing solution, the acquiring the image information and the first text information includes:

extracting visual features from the image;

and coding at least two different types of text information of the image to obtain a coding result representing the semantics of the text information.

In the foregoing solution, the generating second text information based on the image information and the first text information includes:

decoding is carried out based on the visual features and the coding result, and second text information which fuses the visual features and the text information to describe the image is obtained.

In the above scheme, the extracting visual features from the image includes:

processing the image intersection through a convolution layer and a maximum pooling layer of a convolution neural network model to obtain a down-sampling result of the image;

and processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image.

In the above scheme, the method further comprises:

and processing the image visual characteristics through an average pooling layer of the convolutional neural network model to obtain the label representing the classification of the image.

In the above scheme, the encoding at least two different types of text information of the image includes:

performing word-level coding on at least two types of text information of the picture through a neural network model corresponding to different types of text information;

and carrying out sentence-level coding on the coding result of the word level.

In the above scheme, the decoding based on the visual features and the encoding result includes:

decoding the coding result at statement level through a first decoder model;

and performing word level decoding through the decoding result of the sentence level of the second decoder model.

In the above scheme, the method further comprises:

distributing corresponding weights to the visual features and the coding results through an attention model;

and inputting the first weight matrix, the second weight matrix, the coding result and the visual features into the first decoder for decoding.

In the above scheme, the method further comprises:

training a convolutional neural network model for extracting visual features from the image based on image samples and classification labels of the image samples;

training a first decoder model based on the statement samples and the corresponding decoding results;

a second decoder model is trained based on the word samples and the corresponding decoding results.

In the above-mentioned scheme, the first step of the method,

when the image is a medical image of a diseased part, the first text information comprises an indication and a clinical report of the diseased part, and the second text information comprises a diagnosis result of the diseased part.

An embodiment of the present invention further provides an electronic device, where the electronic device includes:

the information acquisition module is used for acquiring the image and the first text information;

and the information processing module is used for generating second text information based on the image information and the first text information, and the second text information is used for representing the image information and the text information.

In the above-mentioned scheme, the first step of the method,

the information acquisition module is used for extracting visual features from the image;

the information processing module is used for coding at least two different types of text information of the image to obtain a coding result representing the semantics of the text information.

In the above-mentioned scheme, the first step of the method,

and the information processing module is used for decoding based on the visual characteristics and the coding result to obtain second text information which is fused with the visual characteristics and the text information to describe the image.

In the above-mentioned scheme, the first step of the method,

the information acquisition module is used for processing the image intersection through a convolution layer and a maximum pooling layer of a convolution neural network model to obtain a down-sampling result of the image;

and the information acquisition module is used for processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image.

In the above-mentioned scheme, the first step of the method,

the information acquisition module is used for processing the image visual characteristics through an average pooling layer of the convolutional neural network model to obtain the classified labels representing the images.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for carrying out word-level coding on at least two types of text information of the picture through the neural network models corresponding to the text information of different types;

and the information processing module is used for coding the coding result of the word level at the statement level.

In the above-mentioned scheme, the first step of the method,

the information processing module is used for decoding the coding result at a statement level through a first decoder model;

and the information processing module is used for decoding the word level according to the decoding result of the statement level of the second decoder model.

In the above-mentioned scheme, the first step of the method,

the information processing module is further used for distributing corresponding weights to the visual features and the coding results through an attention model;

the information processing module is further configured to input the first weight matrix, the second weight matrix, the encoding result, and the visual characteristic into the first decoder for decoding.

In the above solution, the electronic device further includes:

the training module is used for training a convolutional neural network model for extracting visual features from the image based on an image sample and a classification label of the image sample;

the training module is used for training a first decoder model based on the statement sample and the corresponding decoding result;

the training module is configured to train a second decoder model based on the word samples and the corresponding decoding results.

In the above-mentioned scheme, the first step of the method,

when the image is a medical image of a diseased part, the text information comprises an indication and a clinical report of the diseased part, and the second text information comprises a diagnosis result of the diseased part.

The present invention also provides an electronic device, including:

a memory for storing executable instructions;

a processor, configured to execute the executable instructions stored in the memory, to:

acquiring image information and first text information;

The acquiring of the image information and the first text information includes:

extracting visual features from the image;

Generating second text information based on the image information and the first text information, including:

The visual feature extraction from the image comprises:

The method further comprises the following steps:

The encoding of at least two different types of textual information of the image includes:

and carrying out sentence-level coding on the coding result of the word level.

The decoding based on the visual features and the encoding results comprises:

decoding the coding result at statement level through a first decoder model;

The method further comprises the following steps:

In the embodiment of the invention, the second text information capable of representing the image information and the text information content is generated through the acquired image information and the acquired first text information, so that the automatic identification of the image is realized, and the output second text information represents the information of the image and the text information content, so that a reader of the second text information can more clearly know the image and the first text information and form visual experience.

Drawings

FIG. 1 is a schematic flow chart of an alternative image recognition method provided by an embodiment of the present invention;

FIG. 2 is an alternative schematic diagram of an electronic device according to an embodiment of the invention;

FIG. 3 is an alternative schematic diagram of an electronic device according to an embodiment of the invention;

FIG. 4 is an alternative schematic diagram of an electronic device provided by an embodiment of the invention;

FIG. 5 is a schematic diagram of processing images at a convolutional layer and a pooling layer by an Activation Function (Activation Function);

FIG. 6 is a schematic flow chart of an alternative image recognition method according to an embodiment of the present invention;

fig. 7 is an alternative structural schematic diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Fig. 1 is a schematic flow chart of an alternative image recognition method according to an embodiment of the present invention, and as shown in fig. 1, the steps shown in the alternative flow chart of the image recognition method according to the embodiment of the present invention are described.

Step 101: image information and first text information are acquired.

Step 102: generating second text information based on the image information and the first text information;

wherein the second text information is used for representing the image information and the text information content.

In an embodiment of the present invention, the acquiring the image information and the first text information includes: extracting visual features from the image; and coding at least two different types of text information of the image to obtain a coding result representing the semantics of the text information. By the technical scheme shown in the embodiment, accurate extraction of image information and text information can be realized, specifically, the image information can be a photo or a medical image, and the first text information can be text information of at least two different information sources.

In one embodiment of the present invention, the generating of the second text information based on the image information and the first text information includes: decoding is carried out based on the visual features and the coding result, and second text information which fuses the visual features and the text information to describe the image is obtained. By the technical scheme shown in this embodiment, the visual feature and the encoding result are decoded, and second text information is obtained and used for fusing the visual feature and the text information, so that fusion of the extracted information is achieved.

In one embodiment of the present invention, the extracting visual features from the image includes: processing the image intersection through a convolution layer (alternating volumetric layer) and a pooling layer (pooling layer) of a convolution neural network model to obtain a down-sampling result of the image; and processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image. Through the technical scheme shown in the embodiment, through the cross processing of the convolutional layer and the maximum pooling layer of the convolutional neural network model, each small block in the neural network is deeply analyzed by the convolutional layer, so that the characteristic with higher abstraction degree is obtained, the size of the matrix can be reduced, the number of nodes in the final full-connection layer is further reduced, and the aim of reducing parameters in the whole neural network is fulfilled.

In an embodiment of the present invention, the image visual features may be further processed through an average pooling layer of the convolutional neural network model, so as to obtain the label representing the classification of the image. Through the technical scheme shown in the embodiment, the label for representing the classification of the image is obtained, so that the classification processing of a plurality of images or the classification of different features of the same image is realized.

In one embodiment of the present invention, the encoding of at least two different types of text information of the image includes: performing word-level coding on at least two types of text information of the picture through a neural network model corresponding to different types of text information; and carrying out sentence-level coding on the coding result of the word level. Through the technical scheme shown in this embodiment, word-level coding and sentence-level coding can be respectively performed on at least two types of text information of the picture through a Bi-directional long and short time memory recurrent neural network (Bi-directional LSTM RNN), wherein the same coder model can be used for the word-level coding or the sentence-level coding of the at least two types of text information of the picture.

In one embodiment of the present invention, the decoding based on the visual feature and the encoding result includes: decoding the coding result at statement level through a first decoder model; and performing word level decoding through the decoding result of the sentence level of the second decoder model. With the technical solution shown in this embodiment, when the first decoder model is a statement decoder, the second decoder model is a Long Short-Term Memory (LSTM) network.

In an embodiment of the present invention, further, the visual features and the encoding results may be assigned with corresponding weights through an attention model; and inputting the first weight matrix, the second weight matrix, the coding result and the visual features into the first decoder for decoding.

In one embodiment of the invention, a convolutional neural network model for extracting visual features from an image is trained based on an image sample and a classification label of the image sample; training a first decoder model based on the statement samples and the corresponding decoding results; a second decoder model is trained based on the word samples and the corresponding decoding results. Through the technical scheme shown in the embodiment, the neural network model and different decoders can be trained in a targeted manner.

In an embodiment of the invention, when the image is a medical image of a patient site, the text information includes an indication of the patient site and a clinical report, and the second text information includes a diagnosis result of the patient site. By the technical scheme shown in the embodiment, the medical image of the diseased part, the indication of the diseased part and the diagnosis result of the diseased part which is obtained by fusing the medical image of the diseased part and the clinical report can be output by natural language.

Fig. 2 is an optional structural schematic diagram of the electronic device according to the embodiment of the present invention, and as shown in fig. 2, an optional structural diagram of the electronic device according to the embodiment of the present invention is provided, and the following describes the modules related to fig. 2 respectively.

An information obtaining module 201, configured to obtain an image and first text information;

an information processing module 202, configured to generate second text information based on the image information and the first text information, where the second text information is used to represent the image information and the text information content.

In an embodiment of the present invention, the information obtaining module 201 is configured to extract visual features from the image; the information processing module 202 is configured to encode at least two different types of text information of the image to obtain an encoding result representing semantics of the text information. By the technical scheme shown in the embodiment, accurate extraction of image information and text information can be realized, specifically, the image information can be a photo or a medical image, and the first text information can be text information of at least two different information sources.

In an embodiment of the present invention, the information processing module 202 is configured to perform decoding based on the visual feature and the encoding result, and obtain second text information that fuses the visual feature and the text information to describe the image. By the technical scheme shown in this embodiment, the visual feature and the encoding result are decoded, and second text information is obtained and used for fusing the visual feature and the text information, so that fusion of the extracted information is achieved.

In an embodiment of the present invention, the information obtaining module 201 is configured to perform cross processing on the image through a convolutional layer and a maximum pooling layer of a convolutional neural network model to obtain a down-sampling result of the image; the information obtaining module 202 is configured to process the downsampling result through an average pooling layer of the convolutional neural network model to obtain a visual feature of the image. Through the technical scheme shown in the embodiment, through the cross processing of the convolutional layer and the maximum pooling layer of the convolutional neural network model, each small block in the neural network is deeply analyzed by the convolutional layer, so that the characteristic with higher abstraction degree is obtained, the size of the matrix can be reduced, the number of nodes in the final full-connection layer is further reduced, and the aim of reducing parameters in the whole neural network is fulfilled.

In an embodiment of the present invention, the information obtaining module 201 is configured to process the image visual features through an average pooling layer of the convolutional neural network model to obtain the labels representing the classifications of the images. Through the technical scheme shown in the embodiment, the label for representing the classification of the image is obtained, so that the classification processing of a plurality of images or the classification of different features of the same image is realized.

In an embodiment of the present invention, the information processing module 202 is configured to perform word-level encoding on at least two types of text information of the picture through a neural network model corresponding to different types of text information; the information processing module 202 is configured to perform statement-level coding on the word-level coding result. Through the technical scheme shown in this embodiment, word-level coding and sentence-level coding can be respectively performed on at least two types of text information of the picture through a Bi-directional long and short time memory recurrent neural network (Bi-directional LSTM RNN), wherein the same coder model can be used for the word-level coding or the sentence-level coding of the at least two types of text information of the picture.

In an embodiment of the present invention, the information processing module 202 is configured to perform statement level decoding on the encoding result through a first decoder model; the information processing module 202 is configured to perform word-level decoding according to a decoding result at a statement level of the second decoder model. With the technical solution shown in this embodiment, when the first decoder model is a statement decoder, the second decoder model is a Long Short-Term Memory network (LSTM Long Short-Term Memory).

In an embodiment of the present invention, the information processing module 202 is further configured to assign corresponding weights to the visual features and the encoding results through an attention model; the information processing module 202 is further configured to input the first weight matrix, the second weight matrix, the encoding result, and the visual characteristic into the first decoder for decoding.

In one embodiment of the present invention, the electronic device further includes: the training module is used for training a convolutional neural network model for extracting visual features from the image based on an image sample and a classification label of the image sample; the training module is used for training a first decoder model based on the statement sample and the corresponding decoding result; the training module is configured to train a second decoder model based on the word samples and the corresponding decoding results. Through the technical scheme shown in the embodiment, the neural network model and different decoders can be trained in a targeted manner.

Fig. 3 is an optional structural schematic diagram of the electronic device according to the embodiment of the present invention, and as shown in fig. 3, an optional structural diagram of the electronic device according to the embodiment of the present invention is provided, and the following describes the modules related to fig. 3 respectively.

The image encoder 301 is configured to perform intersection processing on the image through a convolution layer and a maximum pooling layer of a convolutional neural network model to obtain a down-sampling result of the image; and processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image.

The text encoder 302 is configured to obtain the first text information and encode the obtained first text information.

A text decoder 303, configured to generate second text information based on the image visual feature and the first text information, where the second text information is used to characterize the image visual feature and the text information content. The information processing procedures of the image encoder 301, the text encoder 302 and the text decoder 303 are as shown in fig. 4.

Fig. 4 is an optional structural schematic diagram of the electronic device according to the embodiment of the present invention, and as shown in fig. 4, an optional structural diagram of the electronic device according to the embodiment of the present invention is described below with reference to the modules in fig. 4.

The first neural network 401 is configured to encode first type text information in the first text information to obtain an encoding result representing semantics of the first type text information.

A first text decoder 402, configured to perform decoding processing on the encoding result of the first neural network 401, so as to output a first type of text information in the first text information through a natural language.

The second neural network 403 is configured to encode first information in second type text information in the first text information to obtain an encoding result representing semantics of the first information in the second type text information, and specifically, the first information in the second type text information in the first text information includes at least two sentences.

A second text decoder 404, configured to perform decoding processing on the encoding result of the second neural network 403, so as to output the first information in the second type text information in the first text information through natural language.

The third neural network 405 is configured to encode second information in second type text information in the first text information to obtain an encoding result representing semantics of the second information in the second type text information, and specifically, the second information in the second type text information in the first text information includes at least two sentences.

A third text decoder 406, configured to perform a decoding process on the encoding result of the third neural network 405 to output, through a natural language, second information in the second type of text information in the first text information.

In an embodiment of the present invention, the first neural network 401, the second neural network 403, and the third neural network 405 may all use a Bi-directional long-term memory recurrent neural network (Bi-directional LSTM RNN), and the encoder models corresponding to different types of text information may be the same, that is, the first text decoder 402, the second text decoder 404, and the third text decoder 406 may be the same type of decoders.

In an embodiment of the present invention, through the word-level encoding and sentence-level encoding of the first neural network 401, the second neural network 403, and the third neural network 405, a sentence-level encoding result that fuses at least two different sources of the first text information may be obtained.

A convolutional neural network 407 for extracting visual features from the image.

In one embodiment of the present invention, the illustrated solution may support images in any format, including but not limited to JPG, PNG, TIF, BMP, etc. Of course, in order to ensure the uniformity and processing rate of image processing during implementation, when the sample image is received, the sample image may be converted into a uniform format supported by the system, and then the sample image is processed accordingly. Of course, in order to adapt to the processing performance of the system, the sample images with different sizes may be cut into fixed-size images supported by the system, and then the images are processed accordingly.

In one embodiment of the present invention, the extracting visual features from the image includes: processing the image intersection through a convolution layer and a maximum pooling layer of a convolution neural network model to obtain a down-sampling result of the image; and processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image. In an alternative configuration of the electronic device shown in fig. 4, when the medical image of the patient site is identified by the convolutional neural network, the first text information includes an indication of the patient site and a clinical report, and the clinical report includes at least two statements.

In an embodiment of the present invention, the image visual features may be further processed through an average pooling layer of the convolutional neural network model to obtain the labels representing the classifications of the images, and according to the technical solution shown in this embodiment, the labels representing the classifications of the images may be obtained, so as to implement classification processing on multiple images or classification on different features of the same image.

In one embodiment of the present invention, the visual processing of the image by the convolutional neural network is implemented by processing an Activation Function (Activation Function) of the image with an original size of 256 pixels by 256 pixels as shown in fig. 5.

And the attention model 408 is used for assigning corresponding weights to the visual features and the coding results to obtain a first weight matrix and a second weight matrix, wherein the weight matrices are used for representing the significance of the weight characterization target features.

And a fourth text decoder 409, configured to receive the first weight matrix, the second weight matrix, the encoding result, and the visual characteristics, and perform corresponding decoding.

An information generator 410 for sending the processing result of the fourth text decoder to a second neural network.

The second neural network includes:

a first decoder model 411, configured to perform statement-level decoding on the encoded result;

and the second decoder model 412 is configured to perform word-level decoding on the sentence-level decoding result to obtain second text information that fuses the visual features and the text information to describe the image.

In an embodiment of the invention, when it is judged that the accuracies of training the neural network model and different decoders for multiple times tend to be stable and do not mutate any more, the trained neural network model and different decoders reach a stable state at the moment, and the training does not need to be continued.

In an embodiment of the present invention, a training iteration number of 2000 times may be preset, and then, when the iteration number of model training reaches 2000 times, it may be determined that the currently trained neural network model and different decoders have reached a stable state, and the training may be stopped.

Fig. 5 is a schematic diagram of processing an image at a convolution layer and a pooling layer by an Activation Function (Activation Function), and as shown in fig. 5, performing convolution processing and pooling processing on an image with an original size of 256 pixels by 256 pixels at the convolution layer and the pooling layer respectively by the Activation Function to obtain visual characteristics of the image. Through the technical scheme shown in the embodiment, through the cross processing of the convolutional layer and the maximum pooling layer of the convolutional neural network model, each small block in the neural network is deeply analyzed by the convolutional layer, so that the characteristic with higher abstraction degree is obtained, the size of the matrix can be reduced, the number of nodes in the final full-connection layer is further reduced, and the aim of reducing parameters in the whole neural network is fulfilled.

Fig. 6 is an alternative flowchart of the image recognition method according to the embodiment of the present invention, and as shown in fig. 6, the steps shown in the alternative flowchart of the image recognition method according to the embodiment of the present invention are described.

Step 601: and processing the image intersection through a convolution layer and a maximum pooling layer of the convolution neural network model to obtain a down-sampling result of the image.

Wherein the image is a facial feature image of a person.

Step 602: and processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image.

Thus, the extraction of the corresponding visual feature of the facial feature of the person in the image can be realized. And processing the down-sampling result through an average pooling layer of the convolutional neural network model to obtain the visual characteristics of the image. Through the technical scheme shown in the embodiment, through the cross processing of the convolutional layer and the maximum pooling layer of the convolutional neural network model, each small block in the neural network is analyzed by the convolutional layer more deeply, so that the characteristic with higher abstraction degree is obtained, the size of the matrix can be reduced, the number of nodes in the final full-connection layer is further reduced, and the purpose of reducing parameters in the whole neural network is achieved, so that the convolutional layer is more applicable to the condition that the number of the face images is large.

In an embodiment of the present invention, the image visual features of the facial features may be further processed through an average pooling layer of the convolutional neural network model, so as to obtain the labels representing the classifications of the facial feature images. Through the technical scheme shown in the embodiment, the label for representing the classification of the facial image is obtained, so that the classification processing of a plurality of facial images or the classification of different features of the same facial image is realized.

Step 603: and performing word-level coding on at least two types of text information of the picture through the neural network models corresponding to different types of first text information, and performing statement-level coding on the word-level coding result.

Step 604: and distributing corresponding weights to the visual features and the coding results through an attention model, and inputting the first weight matrix, the second weight matrix, the coding results and the visual features into the first decoder for decoding.

In one embodiment of the present invention, the encoding of at least two different types of text information of the image includes: performing word-level coding on at least two types of text information of the picture through a neural network model corresponding to different types of text information; and carrying out sentence-level coding on the coding result of the word level. Through the technical scheme shown in this embodiment, word-level coding and sentence-level coding can be respectively performed on at least two types of text information of the picture through a Bi-directional long-and-short-term memory recurrent neural network ((Bi-directional LSTM RNN)), wherein the same coder model can be used for the word-level coding or the sentence-level coding of the at least two types of text information of the picture.

Step 605: and performing sentence-level decoding on the encoding result and performing word-level decoding on the decoding result of the second decoder model in the first decoder model to form second text information which fuses the visual features and the text information to describe the image.

Fig. 7 is an alternative structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 7, the electronic device 700 may be a mobile phone, a computer, a digital broadcast terminal, an information transceiver device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, etc. with image recognition function. The electronic device 700 shown in fig. 7 includes: at least one processor 701, a memory 702, at least one network interface 704, and a user interface 703. The various components in the electronic device 700 are coupled together by a bus system 705. It is understood that the bus system 705 is used to enable communications among the components. The bus system 705 includes a power bus, a control bus, and a status signal bus in addition to a data bus. But for clarity of illustration the various busses are labeled in figure 7 as the bus system 705.

The user interface 703 may include, among other things, a display, a keyboard, a mouse, a trackball, a click wheel, a key, a button, a touch pad, or a touch screen.

It will be appreciated that the memory 702 can be either volatile memory or nonvolatile memory, and can include both volatile and nonvolatile memory. Among them, the nonvolatile Memory may be a Read Only Memory (ROM), a Programmable Read Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), an Electrically Erasable Programmable Read-Only Memory (EEPROM), a magnetic random access Memory (FRAM), a Flash Memory (Flash Memory), a magnetic surface Memory, an optical disk, or a Compact Disc Read-Only Memory (CD-ROM); the magnetic surface storage may be disk storage or tape storage. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), Enhanced Synchronous Dynamic Random Access Memory (Enhanced DRAM), Synchronous Dynamic Random Access Memory (SLDRAM), Direct Memory (DRmb Access), and Random Access Memory (DRAM). The memory 702 described in connection with the embodiments of the invention is intended to comprise these and any other suitable types of memory.

Memory 702 in embodiments of the present invention includes, but is not limited to: the ternary content addressable memory, static random access memory, may be capable of storing a variety of data such as image data, text data image recognition programs, etc. to support the operation of the electronic device 700. Examples of such data include: any computer programs for operating on the electronic device 700, such as an operating system 7021 and application programs 7022, image data, text data, image recognition programs, and the like. The operating system 7021 includes various system programs, such as a framework layer, a core library layer, a driver layer, and the like, for implementing various basic services and processing hardware-based tasks. The application 7022 may include various applications, such as a client with an image recognition function, an application, or the like, for implementing various application services including acquiring image information and first text information, and generating second text information based on the image information and the first text information. A program for implementing the power adjustment method according to an embodiment of the present invention may be included in the application program 7022.

The method disclosed in the above embodiments of the present invention may be applied to the processor 701, or implemented by the processor 701. The processor 701 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be implemented by integrated logic circuits of hardware or operations in the form of software in the processor 701. The Processor 701 may be a general purpose Processor, a Digital Signal Processor (DSP), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. The processor 701 may implement or perform the methods, steps, and logic blocks disclosed in embodiments of the present invention. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of the method disclosed by the embodiment of the invention can be directly implemented by a hardware decoding processor, or can be implemented by combining hardware and software modules in the decoding processor. The software modules may be located in a storage medium located in the memory 702, and the processor 701 may read the information in the memory 702 and perform the steps of the aforementioned methods in conjunction with its hardware.

In an exemplary embodiment, the electronic Device 700 may be implemented by one or more Application Specific Integrated Circuits (ASICs), DSPs, Programmable Logic Devices (PLDs), Complex Programmable Logic Devices (CPLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, Micro Controllers (MCUs), microprocessors (microprocessors), or other electronic components for performing the power adjustment method.

In an exemplary embodiment, the present invention further provides a computer readable storage medium, such as the memory 702 comprising a computer program, which is executable by the processor 701 of the electronic device 700 to perform the steps of the foregoing method. The computer readable storage medium can be Memory such as FRAM, ROM, PROM, EPROM, EEPROM, Flash Memory, magnetic surface Memory, optical disk, or CD-ROM; or may be a variety of devices including one or any combination of the above memories, such as a mobile phone, computer, tablet device, personal digital assistant, etc.

An embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, performs:

image information and first text information are acquired.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, embodiments of the present invention may take the form of a hardware embodiment, a software embodiment, or an embodiment combining software and hardware aspects. Furthermore, embodiments of the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including magnetic disk storage, optical storage, and the like) having computer-usable program code embodied in the medium.

Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program operations. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the operations performed by the processor of the computer or other programmable data processing apparatus produce means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program operations may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the operations stored in the computer-readable memory produce an article of manufacture including operating means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program operations may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the operations executed on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only exemplary of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents, improvements, etc. that are within the spirit and principle of the present invention should be included in the present invention.

Claims

1. An image recognition method, characterized in that the method comprises:

extracting visual features from the image;

coding at least two different types of text information of the image to obtain a coding result representing the semantics of the text information;

decoding is carried out based on the visual features and the coding result, and second text information which is used for fusing the visual features and the text information to describe the image is obtained, wherein the second text information is used for representing the image information and the text information content.

2. The method of claim 1, wherein the extracting visual features from the image comprises:

3. The method of claim 1, wherein encoding at least two different types of textual information for the image comprises:

performing word-level coding on at least two types of text information of the image through a neural network model corresponding to different types of text information;

and carrying out sentence-level coding on the coding result of the word level.

4. The method of claim 1, wherein the decoding based on the visual characteristic and the encoding result comprises:

decoding the coding result at statement level through a first decoder model;

and performing word-level decoding on the sentence-level decoding result through a second decoder model.

5. The method of claim 4, further comprising:

distributing corresponding weights to the visual features and the coding results through an attention model to obtain a first weight matrix and a second weight matrix;

inputting the first weight matrix, the second weight matrix, the coding result and the visual features into the first decoder model for decoding.

6. The method of claim 1, further comprising:

training a decoding first decoder model for performing statement level decoding based on the statement samples and the corresponding decoding results;

a second decoder model for word-level decoding is trained based on the word samples and corresponding decoding results.

7. An electronic device, characterized in that the electronic device comprises:

the information acquisition module is also used for coding at least two different types of text information of the image to obtain a coding result representing the semantics of the text information;

and the information processing module is used for decoding based on the visual features and the coding result to obtain second text information which is used for fusing the visual features and the text information to describe the image, and the second text information is used for representing the image information and the text information content.

8. An electronic device, characterized in that the electronic device comprises:

a memory for storing executable instructions;

a processor for executing the executable instructions stored by the memory to perform the image recognition method of claims 1 to 6.