CN110414673B

CN110414673B - Multimedia recognition method, device, equipment and storage medium

Info

Publication number: CN110414673B
Application number: CN201910699746.9A
Authority: CN
Inventors: 曹效伦
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2022-10-28
Anticipated expiration: 2039-07-31
Also published as: CN110414673A

Abstract

The present disclosure relates to a multimedia recognition method, apparatus, device and storage medium, including: acquiring a first target training layer from a training layer of a multimedia recognition model, wherein the number of spacing layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of spacing layers from the first target training layer to an input layer of the multimedia recognition model; in a first target training layer, acquiring a plurality of target neurons; determining the weight of each target neuron in the multimedia recognition model; determining a target pruning neuron in the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model; carrying out pruning operation on the target pruning neurons; training the multimedia recognition model after pruning to obtain a target multimedia recognition model; and identifying the multimedia data based on the target multimedia identification model. By acquiring the target pruning neurons from the neurons close to the output layer, the accuracy of the recognition result of the multimedia recognition model after pruning operation is improved.

Description

Multimedia recognition method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a multimedia recognition method, apparatus, device, and storage medium.

Background

With the development of artificial intelligence technology, neural network models play an important role in the field of multimedia data recognition such as images, voice, natural language and the like. In order to ensure the accuracy of the recognition result, the multimedia recognition model meeting the recognition requirements needs to be obtained by continuously adjusting the parameters in the training layer of the multimedia recognition model. However, because the connection structure of the neural network model is complex and the interrelation of the parameters between the structures is tight, redundant calculation amount of the multimedia recognition model is caused, and the analysis processing speed of the multimedia recognition model is reduced.

In order to solve the above problem, in the related art, the importance of all neurons in the multimedia recognition model is determined, and the neurons with the importance not meeting the requirement are pruned, so that the redundant computation amount of the multimedia recognition model is reduced, and the analysis processing speed is increased.

However, the pruning method in the related art is prone to pruning the neurons close to the input layer, the importance of which does not meet the requirement, if the neurons close to the input layer are pruned, parameters of all the neurons in the later training layer, which are associated with the pruned neurons, in the whole training layer of the multimedia recognition model will be changed, and when the number of the neurons with changed parameters is large, the accuracy of the recognition result of the multimedia data by the whole multimedia recognition model will be affected.

Disclosure of Invention

The present disclosure provides a multimedia recognition method, apparatus, device and storage medium, so as to at least solve the problem that the neuron pruning manner in the related art will affect the accuracy of the recognition result of the whole neural network model. The technical scheme of the disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a multimedia recognition method, the method including: acquiring a first target training layer from training layers of a multimedia recognition model, wherein the number of the spacing layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of the spacing layers from the first target training layer to an input layer of the multimedia recognition model; in the first target training layer, acquiring a plurality of target neurons; determining a weight of each target neuron in the multimedia recognition model; determining a target pruning neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model; carrying out pruning operation on the target pruning neurons; training the multimedia recognition model after pruning to obtain a target multimedia recognition model; and identifying the multimedia data based on the target multimedia identification model.

Optionally, the determining a weight of each target neuron in the multimedia recognition model comprises: and determining a quantized value obtained after pruning each target neuron, taking the quantized value corresponding to each target neuron as the weight of each target neuron, and using the quantized value to represent the change degree of the output result of the multimedia recognition model.

Optionally, the determining a target pruning neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model comprises: ordering the quantization values corresponding to each target neuron; and determining a target pruning neuron in the plurality of target neurons according to the sequencing result.

Optionally, the determining a target pruning neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model comprises: obtaining a quantization value distribution function according to the number of the target neurons and the quantization value corresponding to each target neuron; determining a quantization value threshold according to the distribution function; and taking the target neuron with the corresponding quantization value lower than the quantization value threshold value as a target pruning neuron.

Optionally, the training the multimedia recognition model after the pruning operation to obtain a target multimedia recognition model includes: training the multimedia recognition model after pruning by using training data to obtain a loss value, wherein the loss value is used for representing the error between a predicted value and an actual value of the multimedia recognition model; and adjusting parameters of a reference neuron in a training layer of the multimedia recognition model after pruning operation according to the loss value to obtain the target multimedia recognition model, wherein the reference neuron is associated with the target pruning neuron.

Optionally, after the parameters of the reference neurons in the training layer in the multimedia recognition model after the pruning operation are adjusted according to the loss value to obtain the target multimedia recognition model, the method further includes: acquiring the recognition accuracy of the target multimedia recognition model; when the recognition accuracy rate meets a target condition, acquiring a second target training layer from the training layers of the target multimedia recognition model, wherein the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model is greater than the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model; in the second target training layer, acquiring a plurality of target neurons; according to the method for pruning the target neurons in the first target training layer, pruning the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers.

According to a second aspect of the embodiments of the present disclosure, there is provided a multimedia recognition apparatus, the apparatus including: the first acquisition module is configured to acquire a first target training layer from training layers of a multimedia recognition model, wherein the number of spacing layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of spacing layers from the first target training layer to an input layer of the multimedia recognition model; a second acquisition module configured to execute in the first target training layer, acquiring a plurality of target neurons; a first determination module configured to perform determining a weight of each target neuron in the multimedia recognition model; a second determination module configured to perform a determination of target pruning neurons among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model; a pruning module configured to perform a pruning operation on the target pruning neuron; the training module is configured to train the multimedia recognition model after pruning operation to obtain a target multimedia recognition model; a recognition module configured to perform recognition of multimedia data based on the target multimedia recognition model.

Optionally, the first determining module is configured to determine a quantization value obtained by performing pruning on each target neuron, and use the quantization value corresponding to each target neuron as a weight of each target neuron, where the quantization value is used to characterize a degree of change of an output result of the multimedia recognition model.

Optionally, the second determining module is configured to perform sorting of the quantization values corresponding to each target neuron; and determining a target pruning neuron in the plurality of target neurons according to the sequencing result.

Optionally, the second determining module is configured to perform a quantization value distribution function according to the number of target neurons and a quantization value corresponding to each target neuron; determining a quantization value threshold according to the distribution function; and taking the target neuron with the corresponding quantization value lower than the quantization value threshold value as a target pruning neuron.

Optionally, the recognition module is configured to perform training on the multimedia recognition model after pruning by using training data to obtain a loss value, where the loss value is used to represent an error between a predicted value and an actual value of the multimedia recognition model; and adjusting parameters of a reference neuron in a training layer of the multimedia recognition model after pruning operation according to the loss value to obtain the target multimedia recognition model, wherein the reference neuron is associated with the target pruning neuron.

Optionally, the identification module is further configured to perform obtaining of identification accuracy of the target multimedia identification model; when the recognition accuracy rate meets a target condition, acquiring a second target training layer from the training layers of the target multimedia recognition model, wherein the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model is greater than the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model; in the second target training layer, acquiring a plurality of target neurons; according to the method for pruning the target neurons in the first target training layer, pruning the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers.

According to a third aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing the processor-executable instructions; wherein the processor is configured to execute the instructions to implement the method as in the first aspect or any one of the possible implementations of the first aspect.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium comprising: the instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the method as in the first aspect or any one of the possible implementations of the first aspect.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program (product) comprising: computer program code which, when run by a computer, causes the computer to perform the method of the above aspects.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

the method comprises the steps of obtaining a first target training layer on one side close to an output layer of a multimedia recognition model, obtaining a plurality of target neurons in the first target training layer, determining target pruning neurons according to the obtained weights of the target neurons, then carrying out pruning operation on the target pruning neurons, training the multimedia recognition model after the pruning operation to obtain a target multimedia recognition model, and recognizing multimedia data based on the target multimedia recognition model. By acquiring the target pruning neurons for pruning from the target neurons close to the output layer, the accuracy of the multimedia recognition model for the multimedia data recognition result after the pruning operation is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is a schematic diagram illustrating an application scenario of a multimedia recognition method according to an exemplary embodiment;

FIG. 2 is a flow diagram illustrating a multimedia recognition method according to an exemplary embodiment;

FIG. 3 is a diagram illustrating a multimedia recognition method according to an exemplary embodiment;

FIG. 4 is a block diagram illustrating a multimedia recognition device according to an example embodiment;

FIG. 5 is a block diagram illustrating an electronic device in accordance with an exemplary embodiment;

fig. 6 is a diagram illustrating a terminal according to an example embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in other sequences than those illustrated or described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

With the development of artificial intelligence technology, neural network models play an important role in the field of multimedia data recognition such as images, voice, natural language and the like. In order to ensure the accuracy of the recognition result, the multimedia recognition model meeting the recognition requirements needs to be obtained by continuously adjusting the parameters in the training layer of the multimedia recognition model. However, because the connection structure of the neural network model is complex and the interrelation of the parameters between the structures is tight, redundant calculation amount occurs in the multimedia recognition model, and the analysis processing speed of the multimedia recognition model is reduced.

In order to solve the above problems, in the related art, the importance of all neurons in the multimedia recognition model is determined, and the neurons with the importance not meeting the requirement are pruned, so that the redundant computation amount of the multimedia recognition model is reduced, and the analysis processing speed is increased.

However, the pruning method in the related art is prone to pruning the neurons close to the input layer, the importance of which does not meet the requirement, if the neurons close to the input layer are pruned, parameters of all the neurons in the later training layer, which are associated with the pruned neurons, in the whole training layer of the multimedia recognition model will be changed, and when the number of the neurons with changed parameters is large, the accuracy of the recognition result of the multimedia data by the whole multimedia recognition model will be affected. In order to solve the problems in the related art, the embodiments of the present application provide a new pruning method to ensure the accuracy of multimedia data recognition by a multimedia recognition model.

Before the technical solution provided by the embodiment of the present application is introduced, the following definitions of pruning the neural network model are introduced. In a neural network, pruning a neural network model refers to finding a neuron with the smallest contribution to an output feature, and pruning the neuron and parameters generated by the neuron, and this operation process is called pruning. As shown in fig. 1, the input layer of the neural network model inputs (H, W, C) three-dimensional input data, wherein H, W represents the size of the image, and C represents the color of the image; and (H ', W', K) dimensional output is obtained after (K, M, N, C) dimensional neuron calculation, wherein 'K' represents the number of the neurons A in a certain training layer. And determining the neuron A which needs to be pruned by comparing the contribution of a plurality of neurons A to the output characteristics. After comparison, when a neuron A is pruned at the layer, the output result of the output layer is changed from (H ', W', K) to (H ', W', K-1), that is, after a neuron is pruned, the number of output channels of the subsequent neural network model is influenced, and the recognition result of the neural network model is further influenced. Wherein only one training layer is shown in fig. 1, the training layer of the multimedia recognition model based on the neural network model may comprise multiple layers.

Based on the above technical introduction to pruning, fig. 2 is a flow chart illustrating a multimedia recognition method according to an exemplary embodiment. As shown in fig. 2, the multimedia recognition method is used in an electronic device, which may be a terminal or a server. The embodiment of the application takes a terminal as an example, the multimedia recognition model comprises an input layer, a training layer and an output layer, the training layer comprises neurons with parameters, and the output result of the multimedia recognition model is changed by adjusting the parameters of the neurons in the training layer. The training process of the multimedia recognition model is to train parameters of the convolutional layer and the full connection layer of the multimedia recognition model to adjust the output of the multimedia recognition model, wherein the parameters comprise the weight and the offset of the neuron. The method comprises the following steps.

In step S21, a first target training layer is obtained from the training layers of the multimedia recognition model, and the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model is smaller than the number of spacing layers from the first target training layer to the input layer of the multimedia recognition model.

Illustratively, the first target training layer may be one layer or multiple layers, and those skilled in the art may determine the number of target training layers according to the actual pruning operation needs. For the multimedia recognition model, from left to right, an input layer, a training layer, and an output layer may be included. When the target neuron is obtained from the training layer close to the output layer, the influence of the pruning operation on other neurons of the training layer can be reduced. For example, if the training layer includes 10 layers in total, when the target neurons are obtained from the fifth layer from left to right for pruning, the neurons in the sixth layer to the tenth layer are affected after the fifth layer; if the pruned neurons are directly selected from the target neurons of the tenth layer or the ninth and tenth layers, the number of influences on the neurons in the training layer is reduced. Therefore, when the multimedia recognition model needs to be pruned, the neurons needing pruning can be selected from the training layer close to the output layer.

In step S22, in the first target training layer, a plurality of target neurons are acquired;

for example, when the first target training layer is a layer, the plurality of target neurons may be all neurons or a part of neurons of the same training layer. When the first target training layer is a plurality of layers, the plurality of target neurons may include all neurons of each layer; any number of neurons can be selected from each target training layer to form the target neuron. The number of target neurons is not limited in the embodiments of the present application.

In step S23, the weight of each target neuron in the multimedia recognition model is determined.

As an alternative embodiment of the present application, step S23 includes:

determining a quantization value obtained after pruning operation is respectively carried out on each target neuron, taking the quantization value corresponding to each target neuron as the weight of each target neuron, wherein the quantization value is used for representing the change degree of an output result of the multimedia recognition model, the small quantization value represents that the weight of the target neuron corresponding to the neuron to be pruned in the multimedia recognition model is low, and the large quantization value represents that the weight of the target neuron corresponding to the neuron to be pruned in the multimedia recognition model is high.

Illustratively, in the embodiment of the present application, the quantized value may be a partial differential of an output error of the output layer after each target neuron is subjected to a pruning operation. The magnitude of the partial differential can represent the influence degree of the neuron to be pruned on the multimedia recognition model. The partial differential is large, after the representation pruning of the corresponding target neurons, the change degree of the output result of the multimedia recognition model is large, the influence degree of the target neurons subjected to the pruning operation on the multimedia recognition model is large, namely the weight of the corresponding target neurons in the multimedia recognition model is high; and the partial differential is small, after the corresponding target neurons are represented and pruned, the change degree of the output result of the multimedia recognition model is small, the influence degree of the target neurons subjected to pruning on the multimedia recognition model is small, namely the weight of the corresponding target neurons in the multimedia recognition model is low. The weights of the target neurons in the multimedia recognition model may also be determined by those skilled in the art according to other ways, which are not limited in the embodiments of the present application.

In step S24, a target pruning neuron is determined among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model.

As an alternative embodiment of the present application, step S24 includes:

s241a, sorting the quantization values corresponding to each target neuron.

And S242a, determining a target pruning neuron in the target neurons according to the sequencing result.

Illustratively, the obtained quantized values of the degree of change of the output result of the characterization multimedia recognition model corresponding to each target neuron are ranked from large to small. And according to the sequencing result, the target neuron corresponding to the minimum quantization value can be used as a target pruning neuron. Or selecting the target neuron corresponding to the quantized value of the last reference number as the target pruning neuron. The reference number may be determined according to an actual sorting result, for example, when the quantized values of the last reference number are the same or are all within the same threshold range, the target neurons corresponding to the quantized values of the reference number may all be regarded as target pruned neurons. The reference number is not limited in the examples of the present application. After determining the target pruning neuron, carrying out final pruning operation on the target pruning neuron in the multimedia recognition model

As an optional implementation manner of the present application, step S24 includes:

and S241b, obtaining a quantization value distribution function according to the number of the target neurons and the quantization value corresponding to each target neuron.

Illustratively, each target neuron is numbered one by one in sequence according to the number of the target neurons, the abscissa of the quantization value distribution function is the number of the target neuron, and the ordinate is the quantization value corresponding to the target neuron.

And S242b, determining a quantization value threshold according to the distribution function.

Exemplarily, as shown in fig. 3 (1) and 3 (2), when the obtained distribution function is an exponential function distribution or a linear function distribution, it is apparent that the number of target neurons included in the case of the exponential function distribution is larger than the number of target neurons included in the case of the linear function distribution for the same quantization value L. Different thresholds can be determined based on the obtained distribution function. For example, if the number of the prunes under different distributions is to be kept the same, the threshold of the quantization value under the distribution function with exponential function distribution may be set to L1, and the threshold of the quantization value under the distribution function with linear function distribution may be set to L2, so that L1 is smaller than L2.

And S243b, taking the target neuron with the corresponding quantization value lower than the quantization value threshold value as a target pruning neuron.

In step S25, a pruning operation is performed on the target pruning neuron.

In step S26, the multimedia recognition model after pruning operation is trained to obtain a target multimedia recognition model.

As an alternative embodiment of the present application, step S26 includes:

firstly, training a multimedia recognition model after pruning operation by using training data to obtain a loss value, wherein the loss value is used for representing the error between a predicted value and an actual value of the multimedia recognition model.

Illustratively, after the training data is used for training the multimedia recognition model after pruning, a loss value representing the error between the predicted value and the actual value of the multimedia recognition model is obtained through a loss function of the multimedia recognition model. The embodiment of the present application does not limit the specific form of the loss function, and those skilled in the art can select different loss functions according to the specific application scenario of the multimedia recognition model.

And secondly, adjusting parameters of a reference neuron in a training layer of the multimedia recognition model after pruning operation according to the loss value to obtain a target multimedia recognition model, wherein the reference neuron is associated with the target pruning neuron.

For example, after the pruning operation is performed on the multimedia recognition model, the parameters of the multimedia recognition model after the pruning operation need to be adjusted to ensure the accuracy of the multimedia recognition model, and this process may be referred to as fine tuning. In the fine tuning process, the partial derivative of each neuron remaining after pruning relative to the loss value is determined, so as to obtain the gradient of the neuron weight, and the gradient is added to the original weight of the neuron, so as to obtain the updated weight. The same operation is performed on the partial derivatives of the neurons in the same way to update the partial derivatives of the neurons. And the accuracy of the multimedia recognition model is ensured by adjusting the parameters of the reference neurons in the multimedia recognition model, which are associated with the target pruning neurons. The fine tuning efficiency is improved since only the parameters of the reference neurons associated with the target pruned neuron need to be adjusted.

In step S27, the multimedia data is recognized based on the target multimedia recognition model.

According to the multimedia recognition method, the first target training layer is obtained on one side close to the output layer of the multimedia recognition model, the target neurons are obtained in the first target training layer, the target pruning neurons are determined according to the obtained weights of the target neurons, then pruning operation is conducted on the target pruning neurons, the multimedia recognition model after pruning operation is trained, the target multimedia recognition model is obtained, and multimedia data are recognized based on the target multimedia recognition model. By acquiring the target pruning neurons for pruning from the target neurons close to the output layer, the accuracy of the multimedia recognition model for the multimedia data recognition result after the pruning operation is improved.

As an optional embodiment of the present application, after the parameters of the reference neurons in the training layer in the multimedia recognition model after the pruning operation are adjusted according to the loss value to obtain the target multimedia recognition model, the method further includes:

and S31, acquiring the identification accuracy of the target multimedia identification model.

And S32, when the recognition accuracy meets the target condition, acquiring a second target training layer from the training layers of the target multimedia recognition model, wherein the number of the spacing layers from the second target training layer to the output layer of the target multimedia recognition model is larger than the number of the spacing layers from the first target training layer to the output layer of the multimedia recognition model.

Illustratively, after the pruning operation is performed on the neurons in the first target training layer and the neuron parameters of the pruned multimedia recognition model are adjusted, a plurality of target recognition data can be obtained again, and the target multimedia recognition model obtained after the parameters are adjusted is used for recognition to obtain the recognition accuracy of the target multimedia recognition model. The target condition that the recognition accuracy meets can be the change range of the recognition accuracy of the multimedia recognition model before and after pruning, and the pruning operation can be performed again by determining whether the difference value between the recognition accuracy of the multimedia recognition model after pruning and the original accuracy of the multimedia recognition model is within the target threshold range or not and indicating when the difference value is within the target threshold range. For example, when the target threshold range is 2%, when the original accuracy of the multimedia recognition model is 80%, after the first pruning operation, the recognition accuracy of the target multimedia recognition model after parameter adjustment is 79%, and the difference between the original accuracy and the recognition accuracy after the pruning operation is 1%, within the target threshold range, the pruning operation can be performed on the target multimedia recognition model again. The target condition is not limited in the embodiment of the present application, and a person skilled in the art may determine whether to continue pruning the target multimedia recognition model that has been subjected to pruning operation according to other criteria.

S33, acquiring a plurality of target neurons in a second target training layer;

and S34, according to the method for pruning the target neurons in the first target training layer, pruning the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers.

Illustratively, in the embodiment of the present application, pruning is performed according to the sequence from the output layer to the input layer, and when the recognition accuracy meets the target condition, the target number of layers may be the total number of layers of the training layer, that is, the multimedia recognition model may be pruned multiple times until the training layer closest to the input layer is pruned.

Fig. 4 is a block diagram illustrating a multimedia recognition apparatus according to an example embodiment. The multimedia recognition model comprises an input layer, a training layer and an output layer, wherein the training layer comprises neurons with parameters, the output result of the multimedia recognition model is changed by adjusting the parameters of the neurons in the training layer, and referring to fig. 4, the device comprises a first obtaining module 41, a second obtaining module 42, a first determining module 43, a second determining module 44, a pruning module 45, a training module 46 and a recognition module 47.

A first obtaining module 41 configured to perform obtaining, in training layers of a multimedia recognition model, a first target training layer, where the number of interval layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of interval layers from the first target training layer to an input layer of the multimedia recognition model;

a second obtaining module 42 configured to perform in the first target training layer, obtaining a plurality of target neurons;

a first determination module 43 configured to perform determining a weight of each target neuron in the multimedia recognition model;

a second determination module 44 configured to perform a determination of a target pruned neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model;

a pruning module 45 configured to perform a pruning operation on the target pruning neurons;

a training module 46 configured to perform training on the multimedia recognition model after pruning operation to obtain a target multimedia recognition model;

a recognition module 47 configured to perform recognition of the multimedia data based on the target multimedia recognition model.

The multimedia recognition device provided by the embodiment of the application acquires a first target training layer on one side of an output layer close to a multimedia recognition model, acquires a plurality of target neurons in the first target training layer, determines the target pruning neurons according to the acquired weights of the target neurons, then carries out pruning operation on the target pruning neurons, trains the multimedia recognition model after the pruning operation to obtain the target multimedia recognition model, and recognizes multimedia data based on the target multimedia recognition model. By acquiring the target pruning neurons for pruning from the target neurons close to the output layer, the accuracy of the multimedia data identification result of the multimedia identification model after the pruning operation is improved.

As an optional embodiment of the present application, the first determining module 43 is configured to determine a quantization value obtained by performing pruning on each target neuron, and use the quantization value corresponding to each target neuron as a weight of each target neuron, where the quantization value is used to characterize a degree of change of an output result of the multimedia recognition model.

As an optional embodiment of the present application, the second determining module 44 is configured to perform sorting of the quantized values corresponding to each target neuron; and determining a target pruning neuron in the plurality of target neurons according to the sequencing result.

As an optional embodiment of the present application, the second determining module 44 is configured to perform a quantization value distribution function according to the number of target neurons and a quantization value corresponding to each target neuron; determining a quantization value threshold according to the distribution function; and taking the corresponding target neuron with the quantization value lower than the quantization value threshold value as a target pruning neuron.

As an optional embodiment of the present application, the recognition module 47 is configured to perform training on the multimedia recognition model after pruning operation by using training data, so as to obtain a loss value, where the loss value is used to represent an error between a predicted value and an actual value of the multimedia recognition model; and adjusting parameters of a reference neuron in a training layer of the multimedia recognition model after pruning operation according to the loss value to obtain a target multimedia recognition model, wherein the reference neuron is associated with the target pruning neuron.

As an optional embodiment of the present application, the recognition module 47 is further configured to perform obtaining a recognition accuracy of the target multimedia recognition model; when the recognition accuracy rate meets a target condition, acquiring a second target training layer from the training layer of the target multimedia recognition model, wherein the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model is larger than the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model; in a second target training layer, acquiring a plurality of target neurons; according to the method for pruning the target neurons in the first target training layer, pruning operation is carried out on the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Based on the same concept, an embodiment of the present application further provides an electronic device, as shown in fig. 5, the electronic device includes:

a processor 51;

a memory 52 for storing instructions executable by the processor 51;

wherein the processor is configured to execute the command to implement the multimedia recognition method according to the above embodiment. The processor 51 and the memory 52 are connected by a communication bus 53.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, in an alternative embodiment, the memory may include both read-only memory and random access memory, and provide instructions and data to the processor. The memory may also include non-volatile random access memory. For example, the memory may also store device type information.

The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

Fig. 6 is a block diagram illustrating a terminal 600 according to an example embodiment. The terminal 600 may be: a smartphone, a tablet, a laptop, or a desktop computer. The terminal 600 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, etc.

In general, the terminal 600 includes: a processor 601 and a memory 602.

Processor 601 may include one or more processing cores, such as 4-core processors, 8-core processors, and so forth. The processor 601 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 601 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content required to be displayed on the display screen. In some embodiments, processor 601 may also include an AI (Artificial Intelligence) processor for processing computational operations related to machine learning.

The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one instruction for execution by processor 601 to implement the multimedia recognition methods provided by the method embodiments herein.

In some embodiments, the terminal 600 may further optionally include: a peripheral interface 603 and at least one peripheral. The processor 601, memory 602, and peripheral interface 603 may be connected by buses or signal lines. Various peripheral devices may be connected to the peripheral interface 603 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of a radio frequency circuit 604, a display 605, a camera 606, an audio circuit 607, a positioning component 608, and a power supply 609.

The peripheral interface 603 may be used to connect at least one peripheral related to I/O (Input/Output) to the processor 601 and the memory 602. In some embodiments, the processor 601, memory 602, and peripheral interface 603 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 601, the memory 602, and the peripheral interface 603 may be implemented on separate chips or circuit boards, which is not limited by the present embodiment.

The Radio Frequency circuit 604 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuitry 604 communicates with communication networks and other communication devices via electromagnetic signals. The rf circuit 604 converts an electrical signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electrical signal. Optionally, the radio frequency circuit 604 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuitry 604 may communicate with other terminals via at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: metropolitan area networks, various generation mobile communication networks (2G, 3G, 4G, and 5G), wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 604 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display 605 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 605 is a touch display screen, the display screen 605 also has the ability to capture touch signals on or over the surface of the display screen 605. The touch signal may be input to the processor 601 as a control signal for processing. At this point, the display 605 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 605 may be one, providing the front panel of the terminal 600; in other embodiments, the display 605 may be at least two, respectively disposed on different surfaces of the terminal 600 or in a folded design; in still other embodiments, the display 605 may be a flexible display disposed on a curved surface or on a folded surface of the terminal 600. Even more, the display 605 may be arranged in a non-rectangular irregular pattern, i.e., a shaped screen. The Display 605 may be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), and the like.

The camera assembly 606 is used to capture images or video. Optionally, camera assembly 606 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 606 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp and can be used for light compensation under different color temperatures.

The audio circuitry 607 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals to the processor 601 for processing or inputting the electric signals to the radio frequency circuit 604 to realize voice communication. For the purpose of stereo sound collection or noise reduction, a plurality of microphones may be provided at different portions of the terminal 600. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 601 or the radio frequency circuit 604 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, audio circuitry 607 may also include a headphone jack.

The positioning component 608 is used for positioning the current geographic Location of the terminal 600 to implement navigation or LBS (Location Based Service). The Positioning component 608 can be a Positioning component based on the united states GPS (Global Positioning System), the chinese beidou System, the russian graves System, or the european union's galileo System.

A power supply 609 is used to supply power to the various components in terminal 600. The power supply 609 may be ac, dc, disposable or rechargeable. When the power supply 609 includes a rechargeable battery, the rechargeable battery may support wired or wireless charging. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the terminal 600 also includes one or more sensors 610. The one or more sensors 610 include, but are not limited to: acceleration sensor 611, gyro sensor 612, pressure sensor 613, fingerprint sensor 614, optical sensor 615, and proximity sensor 616.

The acceleration sensor 611 may detect the magnitude of acceleration in three coordinate axes of the coordinate system established with the terminal 600. For example, the acceleration sensor 611 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 601 may control the touch screen display 605 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 611. The acceleration sensor 611 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 612 may detect a body direction and a rotation angle of the terminal 600, and the gyro sensor 612 and the acceleration sensor 611 may cooperate to acquire a 3D motion of the user on the terminal 600. The processor 601 may implement the following functions according to the data collected by the gyro sensor 612: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

The pressure sensor 613 may be disposed on a side frame of the terminal 600 and/or on a lower layer of the touch display screen 605. When the pressure sensor 613 is disposed on the side frame of the terminal 600, a user's holding signal of the terminal 600 can be detected, and the processor 601 performs left-right hand recognition or shortcut operation according to the holding signal collected by the pressure sensor 613. When the pressure sensor 613 is arranged at the lower layer of the touch display screen 605, the processor 601 controls the operability control on the UI interface according to the pressure operation of the user on the touch display screen 605. The operability control comprises at least one of a button control, a scroll bar control, an icon control, and a menu control.

The fingerprint sensor 614 is used for collecting a fingerprint of a user, and the processor 601 identifies the identity of the user according to the fingerprint collected by the fingerprint sensor 614, or the fingerprint sensor 614 identifies the identity of the user according to the collected fingerprint. Upon identifying that the user's identity is a trusted identity, the processor 601 authorizes the user to perform relevant sensitive operations including unlocking the screen, viewing encrypted information, downloading software, paying, and changing settings, etc. The fingerprint sensor 614 may be provided on the front, back, or side of the terminal 600. When a physical button or vendor Logo is provided on the terminal 600, the fingerprint sensor 614 may be integrated with the physical button or vendor Logo.

The optical sensor 615 is used to collect the ambient light intensity. In one embodiment, processor 601 may control the display brightness of touch display 605 based on the ambient light intensity collected by optical sensor 615. Specifically, when the ambient light intensity is high, the display brightness of the touch display screen 605 is increased; when the ambient light intensity is low, the display brightness of the touch display screen 605 is turned down. In another embodiment, the processor 601 may also dynamically adjust the shooting parameters of the camera assembly 606 according to the ambient light intensity collected by the optical sensor 615.

A proximity sensor 616, also known as a distance sensor, is typically disposed on the front panel of the terminal 600. The proximity sensor 616 is used to collect the distance between the user and the front surface of the terminal 600. In one embodiment, when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 gradually decreases, the processor 601 controls the touch display 605 to switch from the bright screen state to the dark screen state; when the proximity sensor 616 detects that the distance between the user and the front surface of the terminal 600 becomes gradually larger, the touch display 605 is controlled by the processor 601 to switch from the message screen state to the bright screen state.

Those skilled in the art will appreciate that the configuration shown in fig. 6 is not intended to be limiting of terminal 600 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

The present application provides a computer program, which when executed by a computer, may cause the processor or the computer to perform the respective steps and/or procedures corresponding to the above-described method embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions described in the present application are generated in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, digital subscriber line) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk), among others.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for multimedia recognition, the method comprising:

acquiring a first target training layer from training layers of a multimedia recognition model, wherein the number of spacing layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of spacing layers from the first target training layer to an input layer of the multimedia recognition model;

in the first target training layer, acquiring a plurality of target neurons;

determining a weight of each target neuron in the multimedia recognition model;

determining a target pruning neuron in the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model;

carrying out pruning operation on the target pruning neurons;

training the multimedia recognition model after pruning to obtain a target multimedia recognition model, wherein the target multimedia recognition model is obtained by adjusting parameters of reference neurons in a training layer of the multimedia recognition model, and the reference neurons are associated with the target pruning neurons;

acquiring the recognition accuracy rate of the target multimedia recognition model;

when the recognition accuracy rate meets a target condition, acquiring a second target training layer from the training layers of the target multimedia recognition model, wherein the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model is greater than the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model; the target condition is the change range of the recognition accuracy of the multimedia recognition model before and after pruning;

in the second target training layer, acquiring a plurality of target neurons;

according to the method for pruning the target neurons in the first target training layer, pruning the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers;

and identifying the multimedia data based on a target multimedia identification model obtained after pruning operation is carried out on the target neurons obtained from the second target training layer.

2. The method of claim 1, wherein determining the weight of each target neuron in the multimedia recognition model comprises:

and determining a quantized value obtained after pruning each target neuron, taking the quantized value corresponding to each target neuron as the weight of each target neuron, and using the quantized value to represent the change degree of the output result of the multimedia recognition model.

3. The method of claim 2, wherein determining a target pruned neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model comprises:

ordering the quantization values corresponding to each target neuron;

and determining a target pruning neuron in the plurality of target neurons according to the sequencing result.

4. The method of claim 2, wherein determining a target pruned neuron among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model comprises:

obtaining a quantization value distribution function according to the number of the target neurons and the quantization value corresponding to each target neuron;

determining a quantization value threshold according to the quantization value distribution function;

and taking the target neuron with the corresponding quantization value lower than the quantization value threshold value as a target pruning neuron.

5. The method according to any one of claims 1 to 4, wherein the training of the multimedia recognition model after pruning to obtain a target multimedia recognition model comprises:

training the multimedia recognition model after pruning by using training data to obtain a loss value, wherein the loss value is used for representing the error between a predicted value and an actual value of the multimedia recognition model;

and adjusting parameters of the reference neurons in the training layer of the multimedia recognition model after pruning operation according to the loss value to obtain the target multimedia recognition model.

6. A multimedia recognition apparatus, the apparatus comprising:

the first acquisition module is configured to acquire a first target training layer from training layers of a multimedia recognition model, wherein the number of the spacing layers from the first target training layer to an output layer of the multimedia recognition model is smaller than the number of the spacing layers from the first target training layer to an input layer of the multimedia recognition model;

a second acquisition module configured to execute in the first target training layer, acquiring a plurality of target neurons;

a first determination module configured to perform determining a weight of each target neuron in the multimedia recognition model;

a second determination module configured to perform a determination of target pruning neurons among the plurality of target neurons according to the weight of each target neuron in the multimedia recognition model;

a pruning module configured to perform a pruning operation on the target pruning neuron;

a training module configured to train the multimedia recognition model after pruning operation to obtain a target multimedia recognition model, wherein the target multimedia recognition model is obtained by adjusting parameters of reference neurons in a training layer of the multimedia recognition model, and the reference neurons are associated with the target pruning neurons;

a recognition module configured to perform obtaining a recognition accuracy of the target multimedia recognition model; when the recognition accuracy meets a target condition, acquiring a second target training layer from the training layers of the target multimedia recognition model, wherein the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model is greater than the number of spacing layers from the first target training layer to the output layer of the multimedia recognition model; in the second target training layer, acquiring a plurality of target neurons; according to the method for pruning the target neurons in the first target training layer, pruning the target neurons obtained from the second target training layer until the number of spacing layers from the second target training layer to the output layer of the target multimedia recognition model meets the number of target layers; the target condition is the change range of the recognition accuracy of the multimedia recognition model before and after pruning;

the recognition module is further configured to perform recognition on the multimedia data based on the target multimedia recognition model after pruning the target neurons obtained from the second target training layer.

7. The apparatus of claim 6, wherein the first determining module is configured to determine a quantization value obtained by pruning each target neuron, and use the quantization value corresponding to each target neuron as a weight of each target neuron, where the quantization value is used to characterize a degree of change of an output result of the multimedia recognition model.

8. The apparatus of claim 7, wherein the second determining module is configured to perform sorting of the quantization values corresponding to each target neuron; and determining a target pruning neuron in the plurality of target neurons according to the sequencing result.

9. The apparatus of claim 7, wherein the second determining module is configured to perform a quantization value distribution function according to the number of target neurons and a quantization value corresponding to each target neuron; determining a quantization value threshold according to the quantization value distribution function; and taking the target neuron with the corresponding quantization value lower than the quantization value threshold value as a target pruning neuron.

10. The apparatus according to any of claims 6-9, wherein the recognition module is configured to perform training the multimedia recognition model after pruning operation by using training data to obtain a loss value, and the loss value is used for representing an error between a predicted value and an actual value of the multimedia recognition model; and adjusting parameters of the reference neurons in the training layer of the multimedia recognition model after pruning operation according to the loss value to obtain the target multimedia recognition model.

11. An electronic device, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the multimedia recognition method of any of claims 1 to 5.

12. A storage medium, wherein instructions in the storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the multimedia recognition method of any of claims 1-5.