CN115270981A - Object processing method and device, readable medium and electronic equipment - Google Patents

Object processing method and device, readable medium and electronic equipment Download PDF

Info

Publication number
CN115270981A
CN115270981A CN202210940360.4A CN202210940360A CN115270981A CN 115270981 A CN115270981 A CN 115270981A CN 202210940360 A CN202210940360 A CN 202210940360A CN 115270981 A CN115270981 A CN 115270981A
Authority
CN
China
Prior art keywords
target
model
processing
encoder
determined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210940360.4A
Other languages
Chinese (zh)
Inventor
曾妍
周王春澍
王天楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youzhuju Network Technology Co Ltd
Original Assignee
Beijing Youzhuju Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Youzhuju Network Technology Co Ltd filed Critical Beijing Youzhuju Network Technology Co Ltd
Priority to CN202210940360.4A priority Critical patent/CN115270981A/en
Publication of CN115270981A publication Critical patent/CN115270981A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The disclosure relates to an object processing method, an object processing device, a readable medium and an electronic device. The method comprises the following steps: acquiring a target object to be processed; inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the target object comprises a target image and/or a target character; the target model is obtained after target compression treatment is carried out on the model to be fixed, the model to be fixed is obtained after knowledge distillation is carried out on a pre-trained first teacher model according to a first training sample, and the target compression treatment can comprise pruning treatment. Therefore, the undetermined model is obtained through knowledge distillation, and then target compression processing (such as pruning processing) is carried out on the undetermined model, so that the trained target model can reduce model parameters and accelerate the reasoning speed of the model under the condition that the performances of the first teacher model in various tasks are reserved to the maximum extent.

Description

Object processing method and device, readable medium and electronic equipment
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to an object processing method, an object processing apparatus, a readable medium, and an electronic device.
Background
With the development of computer technology, multimodal understanding tasks for solving related problems by simultaneously understanding Visual information (vision) and language information (language) are increasingly applied, for example, a picture-Text Retrieval task (Image-Text Retrieval), a Visual Question and answer task (Visual Question Answering), a Visual Reasoning task (Visual Reasoning) and the like.
However, in the related art, the model structure corresponding to the multi-modal understanding task is complex, and the deployment is limited.
Disclosure of Invention
This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used to limit the scope of the claimed subject matter.
According to a first aspect of embodiments of the present disclosure, there is provided an object processing method, the method including:
acquiring a target object to be processed; the target object comprises a target image and/or a target character;
inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object;
the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
According to a second aspect of the embodiments of the present disclosure, there is provided an object processing apparatus, the apparatus including:
the object acquisition module is used for acquiring a target object to be processed; the target object comprises a target image and/or a target character;
the object processing module is used for inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the target model is obtained after target compression processing is carried out on the model to be fixed, and the model to be fixed is obtained after knowledge distillation is carried out on a first teacher model trained in advance according to a first training sample; the target compression process includes a pruning process.
According to a third aspect of embodiments of the present disclosure, there is provided a computer readable medium having stored thereon a computer program which, when executed by a processing apparatus, performs the steps of the method of the first aspect of the present disclosure.
According to a fourth aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to implement the steps of the method of the first aspect of the present disclosure.
By adopting the technical scheme, the target object to be processed is obtained; inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; wherein the target object comprises a target image and/or a target character; the target model is obtained by performing target compression processing on the model to be fixed, the model to be fixed is obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample, and the target compression processing can comprise pruning processing. Therefore, the undetermined model is obtained through knowledge distillation, and then target compression processing (such as pruning processing) is carried out on the undetermined model, so that the model parameters can be reduced, the reasoning speed of the model is accelerated, and the target model can be conveniently deployed and applied in various scenes under the condition that the performances of the first teacher model in various tasks are reserved to the greatest extent.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The above and other features, advantages, and aspects of embodiments of the present disclosure will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. Throughout the drawings, the same or similar reference numbers refer to the same or similar elements. It should be understood that the drawings are schematic and that elements and components are not necessarily drawn to scale. In the drawings:
FIG. 1 is a flow diagram illustrating an object processing method in accordance with an exemplary embodiment.
FIG. 2 is a flowchart illustrating a method of training an object model, according to an example embodiment.
FIG. 3 is a schematic diagram illustrating a method of pruning a model to be modeled according to an exemplary embodiment.
FIG. 4 is a schematic diagram illustrating another method of training a target model in accordance with an exemplary embodiment.
Fig. 5 is a block diagram illustrating an object processing apparatus according to an example embodiment.
Fig. 6 is a block diagram illustrating another object processing apparatus according to an example embodiment.
FIG. 7 is a block diagram illustrating an electronic device in accordance with an example embodiment.
Detailed Description
Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.
It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise. In the description of the present disclosure, unless otherwise indicated, "plurality" means two or more, and other terms are similar; at least one item of
The term "or" a "or" an "or similar expressions refer to any combination of the term(s), including any combination of the term(s) alone or the term(s). For example, one or more of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c can be single or multiple; "and/or" is an association describing an associated object, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural.
The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.
It is understood that before the technical solutions disclosed in the embodiments of the present disclosure are used, the type, the use range, the use scene, etc. of the personal information related to the present disclosure should be informed to the user and obtain the authorization of the user through a proper manner according to the relevant laws and regulations.
For example, in response to receiving an active request from a user, a prompt message is sent to the user to explicitly prompt the user that the requested operation to be performed would require the acquisition and use of personal information to the user. Thus, the user can autonomously select whether to provide personal information to software or hardware such as an electronic device, an application program, a server, or a storage medium that performs the operations of the disclosed technical solution, according to the prompt information.
As an alternative but non-limiting implementation manner, in response to receiving an active request from the user, the manner of sending the prompt information to the user may be, for example, a pop-up window manner, and the prompt information may be presented in a text manner in the pop-up window. In addition, a selection control for providing personal information to the electronic device by the user's selection of "agreeing" or "disagreeing" can be carried in the pop-up window.
It is understood that the above notification and user authorization process is only illustrative and not limiting, and other ways of satisfying relevant laws and regulations may be applied to the implementation of the present disclosure.
Meanwhile, it is understood that the data involved in the present technical solution (including but not limited to the data itself, the acquisition or use of the data) should comply with the requirements of the corresponding laws and regulations and the related regulations.
It should be noted that all the actions of acquiring signals, information or data in the present disclosure are performed under the premise of complying with the corresponding data protection regulation policy of the country of the location and obtaining the authorization given by the owner of the corresponding device.
First, an application scenario of the present disclosure will be explained. The present disclosure may be applied to model structure compression scenarios. In the related art, the model corresponding to the multi-modal understanding task is complex in structure and limited in deployment. In some embodiments, model compression may be performed for multi-modal understanding tasks based on visual region features, for example, mini VLM (Vision Language model) or Distill VLM may be employed for model compression. But the amount of parameters of the compressed encoder is very considerable, which causes that the model is difficult to deploy and apply on the edge device with limited inference time or limited memory occupation size.
In order to solve the above problems, the present disclosure provides an object processing method, an object processing apparatus, a readable medium, and an electronic device, where a to-be-determined model is obtained through knowledge distillation, and then target compression processing (such as pruning processing) is performed on the to-be-determined model, so that a trained target model can reduce model parameters, accelerate inference speed of the model, and facilitate deployment and application of the target model in various scenes while maintaining performance of a first teacher model in various tasks to the maximum extent.
The present disclosure is described below with reference to specific examples.
Fig. 1 illustrates an object processing method according to an exemplary embodiment, which may be applied to an electronic device, which may include a terminal device, such as a smart phone, a smart wearable device, a smart speaker, a smart tablet, a PDA (Personal Digital Assistant), a CPE (Customer Premise Equipment), a Personal computer, a vehicle terminal, and the like; the electronic device may also include a server, such as a local server or a cloud server. As shown in fig. 1, the method may include:
s101, acquiring a target object to be processed.
The target object includes a target image and/or a target text.
In some embodiments, the target object may include a target image, and the target image may be an image acquired by an electronic device in real time, may be a pre-stored image, or may be an image received from a higher-level device. The target image may be an image or a video, and the present disclosure does not limit the type of the target image.
In other embodiments, the target object may include target text, such as Chinese text, english text, other languages, or a mixture of multiple languages.
In other embodiments, the target object may include a target image and a target text.
And S102, inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object.
The model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process may include a pruning process.
In some embodiments, the target model may comprise a Transformer-based multimodal model. The target model may be various, and the processing result corresponding to the target object may also be various.
The method comprises the following steps:
in some embodiments, the object model may comprise a teletext Retrieval model for performing a teletext Retrieval task (Image-Text Retrieval). Under the condition that the target object is the target character, the image-text retrieval model can retrieve through the target character to obtain an image corresponding to the target character, namely the processing result corresponding to the target object is to obtain the image corresponding to the target character; when the target object is the target image, the image-text retrieval model can retrieve through the target image to obtain the characters corresponding to the target image, that is, the processing result corresponding to the target object is to obtain the characters corresponding to the target image.
In some embodiments, the target model may include a Visual Question-Answering model for conducting a Visual Question-Answering task (Visual Question Answering). The visual question-answering model can determine a question corresponding to the target image according to the target image and answer the question.
In some embodiments, the target model may include a Visual Reasoning model for performing a Visual Reasoning task (Visual Reasoning). The visual inference model may determine whether the target text correctly describes the target image, for example, according to whether the matching relationship between the target text and the target image is correct.
In some embodiments, the object model may include an Image annotation generation model for performing an Image annotation generation task (Image capturing). The image annotation generation model may generate target text describing the target image from the target image.
By adopting the method, the target object to be processed is obtained; inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; wherein the target object comprises a target image and/or a target character; the target model is obtained by performing target compression processing on the model to be fixed, the model to be fixed is obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample, and the target compression processing can comprise pruning processing. Therefore, the model to be determined is obtained through knowledge distillation, and then target compression processing (such as pruning processing) is carried out on the model to be determined, so that the model parameters can be reduced, the reasoning speed of the model is accelerated, and the deployment and the application of the target model in various scenes are facilitated under the condition that the performance of the first teacher model in various tasks is reserved to the greatest extent by the trained target model.
FIG. 2 is a flow diagram illustrating a method of training a target model in accordance with an exemplary embodiment. As shown in fig. 2, the training method may include:
s201, obtaining a first teacher model trained in advance.
The first teacher model may be one or more pre-trained models, and may be, for example, a large neural network model, or may be multiple neural network models.
In some embodiments, the first teacher model may include a transform-based multi-modal model. Illustratively, the first teacher model may include X-VLM, which is one of the best performing among the current transform-based multi-modal models. It should be noted that Oscar is a teacher model used in the existing multimodal model compression technology, and is also one of multimodal model representatives based on the feature of visual region.
S202, knowledge distillation is carried out on the first teacher model according to the first training sample, and a to-be-determined model is obtained.
In the step, knowledge learned by a large model or a plurality of models can be transferred to another light-weight single model through Knowledge Distillation (Knowledge Distillation), and the deployment is convenient. During distillation, the large model or models may be referred to as a Teacher model (Teacher) and the lightweight single model may be referred to as a Student model (Student).
Illustratively, the hierarchy of the first teacher model can be simplified through knowledge distillation, and parameters can also be simplified, so that a lightweight undetermined model is obtained, and the performance of the first teacher model in various tasks can be reserved on the undetermined model.
In some embodiments, the first training sample may include image-text pairing data. Illustratively, a plurality of first sample images can be included, and a first sample text corresponding to each first sample image; a plurality of first sample words may also be included, with a first sample image corresponding to each first sample word. Thus, the first training sample can be used for training a single-mode model and can also be used for training a multi-mode model.
And S203, performing target compression treatment on the model to be fixed to obtain a target model.
In the step, the undetermined model after knowledge distillation can be further subjected to target compression processing, and performance of the first teacher model in various tasks is reserved due to the undetermined model obtained through knowledge distillation. Therefore, the model can be further fine-tuned by the target compression process to obtain the target model.
Illustratively, the target compression process may include one or more of a pruning process, a knowledge distillation process, and a quantization process.
By adopting the training method, a pre-trained first teacher model is obtained, knowledge distillation is carried out on the first teacher model according to a first training sample, and a model to be determined is obtained; and then, target compression processing can be carried out on the undetermined model to obtain a target model. Therefore, the model to be determined is obtained through knowledge distillation, and then target compression processing (such as pruning processing) is carried out on the model to be determined, so that the model parameters can be reduced, the reasoning speed of the model is accelerated, and the deployment and the application of the target model in various scenes are facilitated under the condition that the performance of the first teacher model in various tasks is reserved to the greatest extent by the trained target model. In some scenarios, the training method may preserve more than 98% of the performance of the first teacher model on various tasks, thereby improving the performance of the target model obtained after training.
In some embodiments of the present disclosure, the target compression process may include a first pruning process, and thus, the step S203 may include the following sub-steps:
first, a first target task type corresponding to a target model is determined.
The first target task type may be used to characterize a type of downstream task that the target model performs. Illustratively, the first target task type may include one or more of a teletext retrieval task, a visual question-and-answer task, a visual reasoning task, and a picture annotation generation task,
and then, carrying out first pruning processing on the model to be determined according to the first target task type to obtain a target model corresponding to the first target task type.
In some embodiments, the pending model may include at least one target encoder, such that the first pruning process may be performed for at least one network layer of the at least one target encoder according to the first target task type.
The first pruning process may include modal adaptive pruning, where the fixed model row is treated adaptively according to the first target task type. For example, the first pruning processing may include pruning processing based on a differentiable regularization mode, and a differentiable L0 regularization method may be used, and lagrange multipliers are used to control sparseness of the model during pruning.
Illustratively, a first loss function for model training is shown in equation (1):
Figure BDA0003785246390000101
wherein λ is 1 Denotes the first Lagrange multiplier, λ 2 Denotes the second Lagrange multiplier, λ 1 And λ 2 So that the learning can be carried out through training,
Figure BDA0003785246390000102
representing model parameters, alpha representing a hyper-parameter that can be trained,
Figure BDA0003785246390000103
the display of the user can be expected to be,
Figure BDA0003785246390000104
represents the expected value of a loss function randomly sampled over a variable u, which represents the distribution of random variables sampled with equal probability between (0, 1),
Figure BDA0003785246390000105
indicating that the right side is expected to take the minimum value by updating theta and alpha,
Figure BDA0003785246390000106
indicating by updating lambda 1 And λ 2 Obtaining the maximum value of the right expected minimum value; d represents the total number of training samples, x i Denotes the ith training sample, y i Represents the training label corresponding to the ith training sample,
Figure BDA0003785246390000107
representing a pending loss function for training; s (α) represents the current sparsity of the trained model, and t represents the desired sparsity of the trained model, which may be a preset value.
In some embodiments, S (α) in the above formula (1) may represent a preset function, which may be Hard descriptor Distribution, and may also be referred to as S function. For example, the preset function may include the following formula (2):
u~U(0,1)
Figure BDA0003785246390000111
Figure BDA0003785246390000112
where U represents a random parameter sampled randomly between 0 and 1, which may be generated, for example, by a random sampling function U (0, 1); the sigmoid function is an activation function that can be used to map variables between (0, 1); both alpha and beta are trainable hyperparameters; r represents a preset upper limit (e.g., 1.1) of the S function (stretch), and l represents a preset lower limit (e.g., 0.1) of the S function (stretch).
With this formula (2), it is possible to first sample a parameter u with equal probability between 0 and 1, and then determine s (α) with three sub-formulas in formula (2).
The model is trained by adopting the first preset loss function, and the value of alpha can enable the values of the s-function corresponding to some redundant parameters to be close to 0, so that the number of model parameters can be reduced.
By the mode, pruning processing based on an L0 regularization mode can be achieved, and the sparsity of a model during pruning can be controlled by using a Lagrange multiplier.
FIG. 3 is a schematic diagram illustrating a method of pruning a model to be modeled according to an exemplary embodiment. As shown in fig. 3, the pending model includes a target encoder, which may include one or more of a first image encoder, a first text encoder, and a first cross-modality encoder.
Where the target encoder comprises a first image encoder or a first text encoder, the network layer of the target encoder may comprise a feed-forward neural network and a self-attention module.
Where the target encoder comprises a first cross-modal encoder, the network layer of the first cross-modal encoder comprises a feed-forward neural network, a self-attention module, and a cross-attention module.
The gray rectangular blocks in fig. 3 may be used to characterize the portion that is pruned after the first pruning process described above. Based on the first pruning processing, each network layer of each encoder can be pruned in a self-adaptive manner, so that the performance of the pruned model is ensured.
In other embodiments of the present disclosure, the target compressing process may include a second pruning process and a knowledge distilling process, and thus, the step S203 may include the following sub-steps:
first, a second target task type corresponding to the target model is determined.
The second target task type may be used to characterize a type of downstream task performed by the target model. Illustratively, the second target task type may include one or more of a teletext retrieval task, a visual question-answering task, a visual reasoning task, and a picture annotation generation task. The second target task type may be the same as or different from the first target task type.
Second, a second teacher model is determined based on the second target task type and the first teacher model.
Illustratively, the first teacher model may be fine-tuned to obtain a second teacher model according to the second target task type.
And finally, according to the second target task type and the second teacher model, carrying out second pruning treatment and knowledge distillation treatment on the model to be fixed to obtain a target model.
The specific implementation manner of the second pruning processing may be the same as that of the first pruning processing, and is not described herein again.
By the method, knowledge distillation can be performed end to end, namely, the knowledge distillation is introduced in a pre-training stage and a downstream task fine-tuning stage, so that the performance of the teacher model is kept as much as possible by the target model obtained through training. In some scenarios, the target model trained in this manner may retain more than 98% of the performance of the first teacher model.
FIG. 4 is a schematic diagram illustrating another method of training a target model according to an exemplary embodiment. As shown in fig. 4, the training method may include:
s401, obtaining a first training sample.
Illustratively, the first training sample may include image-text pairing data.
S402, obtaining a first teacher model trained in advance.
In some embodiments, the first teacher model may include a transform-based multi-modal model. Illustratively, the first teacher model may include an X-VLM.
And S403, performing knowledge distillation on the first teacher model according to the first training sample.
And S404, obtaining the undetermined model after knowledge distillation.
And S405, determining a second teacher model.
For example, a second teacher model may be determined according to a second target task type corresponding to the target model and the first teacher model.
Wherein the second target task type may be used to characterize a type of downstream task executed by the target model. Illustratively, the second target task type may include one or more of a teletext retrieval task, a visual question-answering task, a visual reasoning task, and a picture annotation generation task.
And S406, carrying out second pruning treatment and knowledge distillation treatment on the model to be fixed.
For example, the second pruning process and the knowledge distillation process may be performed on the model to be modeled according to the second target task type and the second teacher model.
And S407, obtaining a target model.
Illustratively, the target model may be applied to one or more of a teletext retrieval task, a visual reasoning task, a visual question-answering task, and a picture annotation generation task.
It should be noted that, for specific implementation of the above steps, reference may be made to the specific description in the foregoing embodiments of the present disclosure, and details are not described here again.
By adopting the mode, the knowledge distillation is carried out on the first teacher model according to the first training sample to obtain the model to be determined, then the pruning treatment and the further knowledge distillation treatment are carried out on the model to be determined, and the target model is obtained through training, so that the target model can reserve the performance of the first teacher model in various tasks to the maximum extent, the model parameters can be reduced, the reasoning speed of the model is increased, and the deployment and the application of the target model in various scenes are facilitated.
Fig. 5 is a block diagram illustrating an object processing apparatus 500 according to an exemplary embodiment, and as shown in fig. 5, the apparatus 500 may include:
an object obtaining module 501, configured to obtain a target object to be processed; the target object comprises a target image and/or a target character;
an object processing module 502, configured to input the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
Fig. 6 is a block diagram illustrating another object processing apparatus according to an exemplary embodiment, and as shown in fig. 6, the apparatus may further include:
the model training module 601 is used for acquiring a pre-trained first teacher model; performing knowledge distillation on the first teacher model according to a first training sample to obtain a to-be-determined model; and performing the target compression processing on the undetermined model to obtain a target model.
In some embodiments, the target compression process comprises a first pruning process; the model training module 601 is configured to determine a first target task type corresponding to a target model; and performing first pruning processing on the model to be determined according to the first target task type to obtain a target model corresponding to the first target task type.
In some embodiments, the pending model includes a target encoder, and the model training module 601 is configured to perform a first pruning process on at least one network layer of at least one target encoder according to the first target task type.
In some embodiments, the target encoder comprises one or more of a first image encoder, a first text encoder, and a first cross-modality encoder.
In some embodiments, where the target encoder comprises the first image encoder or the first text encoder, the network layer of the target encoder comprises a feed-forward neural network and a self-attention module.
In some embodiments, where the target encoder comprises the first cross-modal encoder, the network layer of the first cross-modal encoder comprises a feed-forward neural network, a self-attention module, and a cross-attention module.
In some embodiments, the target compression process comprises a second pruning process and a knowledge distillation process; the model training module 601 is configured to determine a second target task type corresponding to the target model; determining a second teacher model according to the second target task type and the first teacher model; and according to the second target task type and the second teacher model, performing second pruning processing and knowledge distillation processing on the model to be determined to obtain a target model corresponding to the second target task type.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
Referring now to fig. 7, shown is a schematic diagram of an electronic device 2000 (e.g., a terminal device or a server) suitable for use in implementing embodiments of the present disclosure. The terminal device in the embodiments of the present disclosure may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle terminal (e.g., a car navigation terminal), and the like, and a fixed terminal such as a digital TV, a desktop computer, and the like. The electronic device shown in fig. 7 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 7, the electronic device 2000 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 2001, which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 2002 or a program loaded from a storage means 2008 into a Random Access Memory (RAM) 2003. In the RAM2003, various programs and data necessary for the operation of the electronic apparatus 2000 are also stored. The processing device 2001, the ROM2002, and the RAM2003 are connected to each other by a bus 2004. An input/output (I/O) interface 2005 is also connected to bus 2004.
Generally, the following devices may be connected to the input/output interface 2005: input devices 2006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; output devices 2007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 2008 including, for example, magnetic tapes, hard disks, and the like; and a communication device 2009. The communication means 2009 may allow the electronic device 2000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 7 illustrates an electronic device 2000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication device 2009, or installed from the storage device 2008, or installed from the ROM 2002. The computer program, when executed by the processing device 2001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
In some embodiments, the clients, servers may communicate using any currently known or future developed network Protocol, such as HTTP (HyperText Transfer Protocol), and may interconnect with any form or medium of digital data communication (e.g., a communications network). Examples of communication networks include a local area network ("LAN"), a wide area network ("WAN"), the Internet (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network.
The computer readable medium may be embodied in the electronic device; or may be separate and not incorporated into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to: acquiring a target object to be processed; the target object comprises a target image and/or a target character; inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
Computer program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including but not limited to an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems that perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The modules described in the embodiments of the present disclosure may be implemented by software or hardware. The name of a module does not in some cases constitute a limitation of the module itself, and for example, the object acquisition module may also be described as a "module that acquires a target object to be processed".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
According to one or more embodiments of the present disclosure, there is provided an object processing method including:
acquiring a target object to be processed; the target object comprises a target image and/or a target character;
inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object;
the target model is obtained after target compression processing is carried out on the model to be fixed, and the model to be fixed is obtained after knowledge distillation is carried out on a first teacher model trained in advance according to a first training sample; the target compression process includes a pruning process.
According to one or more embodiments of the present disclosure, the target model is trained by:
acquiring a pre-trained first teacher model;
performing knowledge distillation on the first teacher model according to the first training sample to obtain a to-be-determined model;
and performing the target compression processing on the undetermined model to obtain a target model.
According to one or more embodiments of the present disclosure, the target compression process includes a first pruning process; the target compression processing on the to-be-determined model to obtain the target model comprises the following steps:
determining a first target task type corresponding to a target model;
and performing first pruning processing on the model to be determined according to the first target task type to obtain a target model corresponding to the first target task type.
According to one or more embodiments of the present disclosure, the undetermined model includes a target encoder, and the performing a first pruning process on the undetermined model according to the first target task type to obtain a target model corresponding to the first target task type includes:
and performing first pruning processing on at least one network layer of at least one target encoder according to the first target task type.
According to one or more embodiments of the present disclosure, the target encoder includes one or more of a first image encoder, a first text encoder, and a first cross-modality encoder.
In accordance with one or more embodiments of the present disclosure, where the target encoder comprises the first image encoder or the first text encoder, the network layer of the target encoder comprises a feed-forward neural network and a self-attention module.
In accordance with one or more embodiments of the present disclosure, where the target encoder comprises the first cross-modal encoder, the network layer of the first cross-modal encoder comprises a feed-forward neural network, a self-attention module, and a cross-attention module.
According to one or more embodiments of the present disclosure, the target compression process includes a second pruning process and a knowledge distillation process; the target compression processing on the undetermined model to obtain the target model comprises the following steps:
determining a second target task type corresponding to the target model;
determining a second teacher model according to the second target task type and the first teacher model;
and according to the second target task type and the second teacher model, performing second pruning processing and knowledge distillation processing on the model to be determined to obtain a target model corresponding to the second target task type.
According to one or more embodiments of the present disclosure, there is provided an object processing apparatus including:
the object acquisition module is used for acquiring a target object to be processed; the target object comprises a target image and/or a target character;
the object processing module is used for inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
According to one or more embodiments of the present disclosure, the apparatus further comprises:
the model training module is used for acquiring a pre-trained first teacher model; performing knowledge distillation on the first teacher model according to a first training sample to obtain a to-be-determined model; and performing the target compression processing on the to-be-determined model to obtain a target model.
According to one or more embodiments of the present disclosure, the target compression process includes a first pruning process; the model training module is used for determining a first target task type corresponding to a target model; and performing first pruning processing on the model to be determined according to the first target task type to obtain a target model corresponding to the first target task type.
According to one or more embodiments of the present disclosure, the pending model includes a target encoder, and the model training module is configured to perform a first pruning process on at least one network layer of at least one target encoder according to the first target task type.
According to one or more embodiments of the present disclosure, the target encoder includes one or more of a first image encoder, a first text encoder, and a first cross-modality encoder.
In accordance with one or more embodiments of the present disclosure, where the target encoder comprises the first image encoder or the first text encoder, the network layer of the target encoder comprises a feed-forward neural network and a self-attention module.
In accordance with one or more embodiments of the present disclosure, where the target encoder comprises the first cross-modal encoder, the network layer of the first cross-modal encoder comprises a feed-forward neural network, a self-attention module, and a cross-attention module.
According to one or more embodiments of the present disclosure, the target compression process includes a second pruning process and a knowledge distillation process; the model training module is used for determining a second target task type corresponding to the target model; determining a second teacher model according to the second target task type and the first teacher model; and according to the second target task type and the second teacher model, performing second pruning processing and knowledge distillation processing on the model to be determined to obtain a target model corresponding to the second target task type.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims. With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Claims (11)

1. An object processing method, characterized in that the method comprises:
acquiring a target object to be processed; the target object comprises a target image and/or a target character;
inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object;
the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
2. The method of claim 1, wherein the target model is trained by:
acquiring a pre-trained first teacher model;
performing knowledge distillation on the first teacher model according to the first training sample to obtain a to-be-determined model;
and performing the target compression processing on the to-be-determined model to obtain the target model.
3. The method according to claim 1, wherein the target compression process includes a first pruning process; the target compression processing on the undetermined model to obtain the target model comprises the following steps:
determining a first target task type corresponding to a target model;
and performing first pruning processing on the model to be determined according to the first target task type to obtain a target model corresponding to the first target task type.
4. The method of claim 3, wherein the pending model comprises a target encoder, and the performing a first pruning process on the pending model according to the first target task type to obtain a target model corresponding to the first target task type comprises:
and performing first pruning processing on at least one network layer of at least one target encoder according to the first target task type.
5. The method of claim 4, wherein the target encoder comprises one or more of a first image encoder, a first text encoder, and a first cross-modality encoder.
6. The method of claim 5, wherein in a case where the target encoder comprises the first image encoder or the first text encoder, the network layer of the target encoder comprises a feed-forward neural network and a self-attention module.
7. The method of claim 5, wherein where the target encoder comprises the first cross-modal encoder, the network layer of the first cross-modal encoder comprises a feed-forward neural network, a self-attention module, and a cross-attention module.
8. The method according to claim 2, wherein the target compression process includes a second pruning process and a knowledge distillation process; the target compression processing on the to-be-determined model to obtain the target model comprises the following steps:
determining a second target task type corresponding to the target model;
determining a second teacher model according to the second target task type and the first teacher model;
and according to the second target task type and the second teacher model, performing second pruning processing and knowledge distillation processing on the model to be determined to obtain a target model corresponding to the second target task type.
9. An object processing apparatus, characterized in that the apparatus comprises:
the object acquisition module is used for acquiring a target object to be processed; the target object comprises a target image and/or a target character;
the object processing module is used for inputting the target object into a pre-trained target model to obtain a processing result corresponding to the target object; the model to be determined is a model obtained by performing target compression treatment on a model to be determined, and the model to be determined is a model obtained by performing knowledge distillation on a pre-trained first teacher model according to a first training sample; the target compression process includes a pruning process.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processing means, carries out the steps of the method according to any one of claims 1 to 8.
11. An electronic device, comprising:
a storage device having a computer program stored thereon;
processing means for executing the computer program in the storage means to carry out the steps of the method according to any one of claims 1 to 8.
CN202210940360.4A 2022-08-05 2022-08-05 Object processing method and device, readable medium and electronic equipment Pending CN115270981A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210940360.4A CN115270981A (en) 2022-08-05 2022-08-05 Object processing method and device, readable medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210940360.4A CN115270981A (en) 2022-08-05 2022-08-05 Object processing method and device, readable medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115270981A true CN115270981A (en) 2022-11-01

Family

ID=83749861

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210940360.4A Pending CN115270981A (en) 2022-08-05 2022-08-05 Object processing method and device, readable medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115270981A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024097167A1 (en) * 2022-11-04 2024-05-10 Nec Laboratories America, Inc. Visual question answering with unlabeled image augmentation

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2024097167A1 (en) * 2022-11-04 2024-05-10 Nec Laboratories America, Inc. Visual question answering with unlabeled image augmentation

Similar Documents

Publication Publication Date Title
CN110321958B (en) Training method of neural network model and video similarity determination method
CN111767371B (en) Intelligent question-answering method, device, equipment and medium
CN110929780B (en) Video classification model construction method, video classification device, video classification equipment and medium
CN110413812B (en) Neural network model training method and device, electronic equipment and storage medium
CN112883968B (en) Image character recognition method, device, medium and electronic equipment
CN110084317B (en) Method and device for recognizing images
CN113313064A (en) Character recognition method and device, readable medium and electronic equipment
CN113449070A (en) Multimodal data retrieval method, device, medium and electronic equipment
CN113222983A (en) Image processing method, image processing device, readable medium and electronic equipment
CN116310582A (en) Classification model training method, image classification method, device, medium and equipment
CN115908640A (en) Method and device for generating image, readable medium and electronic equipment
CN115578570A (en) Image processing method, device, readable medium and electronic equipment
CN114067327A (en) Text recognition method and device, readable medium and electronic equipment
CN115270981A (en) Object processing method and device, readable medium and electronic equipment
CN111915689B (en) Method, apparatus, electronic device, and computer-readable medium for generating an objective function
CN112241761B (en) Model training method and device and electronic equipment
CN117036827A (en) Multi-mode classification model training, video classification method, device, medium and equipment
CN116912734A (en) Video abstract data set construction method, device, medium and electronic equipment
CN115546487A (en) Image model training method, device, medium and electronic equipment
CN111737575B (en) Content distribution method, content distribution device, readable medium and electronic equipment
CN113610228B (en) Method and device for constructing neural network model
CN113435528B (en) Method, device, readable medium and electronic equipment for classifying objects
CN116092092A (en) Matching method, device, medium and electronic equipment
CN113220922B (en) Image searching method and device and electronic equipment
CN111680754B (en) Image classification method, device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination