CN114627282A

CN114627282A - Target detection model establishing method, target detection model application method, target detection model establishing device, target detection model application device and target detection model establishing medium

Info

Publication number: CN114627282A
Application number: CN202210254685.7A
Authority: CN
Inventors: 郑喜民; 贾云舒; 周成昊; 舒畅; 陈又新
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2022-03-15
Filing date: 2022-03-15
Publication date: 2022-06-14
Also published as: WO2023173552A1

Abstract

The invention discloses a method for establishing a target detection model, an application method, equipment, a device and a medium, which can be used in the field of image recognition; the establishing method comprises the following steps: acquiring a basic target detection network, replacing a common convolution layer of the basic target detection network with a depth separable convolution layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model; acquiring a preset digital image, and inputting the preset digital image to an initial target detection model; performing feature extraction on a preset digital image through a depth separable convolution layer of the initial target detection model, and outputting a feature map; performing target detection on the feature map through a multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model; and optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model. The invention can effectively improve the target detection efficiency of the embedded equipment.

Description

Target detection model establishing method, target detection model application method, target detection model establishing device, target detection model application device and target detection model establishing medium

Technical Field

The invention relates to the field of image recognition, in particular to a method for establishing a target detection model, an application method, equipment, a device and a medium.

Background

The target detection is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like; meanwhile, target detection is also a preorder algorithm required by many visual tasks, and plays a vital role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Due to the wide application of deep learning, the target detection algorithm is developed rapidly. The target detection algorithm based on deep learning is mainly divided into (1) two-stage target detection: firstly, generating a candidate region containing approximate position information of a target, and then classifying and position refining the candidate region in the second stage; (2) single-stage target detection: and directly generating the class probability and the corresponding position coordinate value of the object.

The network parameters of the existing target detection algorithm are relatively large in quantity, the effect of detecting the target in real time can be achieved only by relying on a large computer such as a server in practical application, but the effect is difficult to achieve when a network structure is transplanted to embedded equipment such as a mobile phone, and the processor performance of the embedded equipment is far inferior to that of the server.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the invention provides a method for establishing a target detection model, an application method, equipment, a device and a medium, which can effectively improve the target detection efficiency of embedded equipment.

In a first aspect, an embodiment of the present invention provides a method for establishing a target detection model, including: acquiring a basic target detection network, replacing a common convolution layer of the basic target detection network with a depth separable convolution layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model; acquiring a preset digital image, and inputting the preset digital image to the initial target detection model; performing feature extraction on the preset digital image through the depth separable convolution layer of the initial target detection model, and outputting a feature map; performing target detection on the feature map through the multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model; and optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

In some embodiments, the NetAdapt algorithm optimizes convolution kernels of the depth separable convolution layers of the intermediate target detection model; and the pruning algorithm is used for optimizing the network structure of the intermediate target detection model.

In some embodiments, said feature extracting said preset digital image by said depth separable convolution layer of said initial object detection model, outputting a feature map comprises: performing channel dimension-increasing processing on the preset digital image by using point-by-point convolution; performing feature extraction processing on the preset digital image subjected to channel dimension increasing by utilizing deep convolution to obtain a plurality of initial feature maps; and performing channel dimensionality reduction on the plurality of initial feature maps by using point-by-point convolution, and outputting a final feature map.

In some embodiments, the target detecting the feature map by the multi-scale feature fusion mechanism of the initial target detection model comprises: obtaining the height and width of a first final feature map output by the first depth separable convolution layer; acquiring and adjusting the height and width of a second final feature map output by a second depth separable convolution layer to enable the height and width of the second final feature map to be the same as the height and width of the first final feature map; performing channel splicing and convolution on the adjusted second final feature map and the first final feature map to obtain a fusion feature; and carrying out target detection according to the fusion characteristics.

In some embodiments, the optimizing the intermediate target detection model by using the NetAdapt algorithm and the pruning algorithm includes: optimizing a convolution kernel of a layer of original primary depth separable convolution network by the NetAdapt algorithm to obtain a plurality of second depth separable convolution networks; and the NetAdapt algorithm compares the time delay and the precision of one second depth separable convolution network with the original depth separable convolution network corresponding to the second depth separable convolution network, and selects a final depth separable convolution network according to a comparison result.

In some embodiments, said selecting a final depth separable convolutional network according to the comparison result comprises: when the time delay of the second depth separable convolutional network is larger than the time delay of the original depth separable convolutional network and/or the precision of the second depth separable convolutional network is lower than that of the original depth separable convolutional network, selecting the original depth separable convolutional network as the final depth separable convolutional network; and when the time delay of the second depth separable convolutional network is less than the time delay of the original depth separable convolutional network and the precision of the second depth separable convolutional network is higher than that of the original depth separable convolutional network, selecting the second depth separable convolutional network as the final depth separable convolutional network.

In some embodiments, further comprising: the pruning algorithm is used for pruning the network structure of the intermediate target detection model to remove the redundant weight parameters of the network structure; and the pruning algorithm is used for finely adjusting the intermediate target detection model after pruning.

In some embodiments, the removing the redundancy weight parameter of the network fabric comprises: coding the network structure based on the number of channels of each layer in the middle target detection model after pruning processing to obtain a plurality of coding vectors; inputting the coding vector into the intermediate target detection model and generating a network weight after pruning; obtaining the performance of the intermediate target detection model after pruning according to the network structure, the network weight and a preset verification set; and screening the plurality of code vectors by using an evolutionary algorithm to obtain a final code vector, and obtaining the final target detection model according to the final code vector.

In some embodiments, the initial object detection model further comprises a system loss function comprising a bounding box coordinate error function, a bounding box confidence error function, a classification error function.

In a second aspect, an embodiment of the present invention provides an application method of a target detection model, including: acquiring an actual digital image, and inputting the actual digital image to a final target detection model; performing feature extraction on the actual digital image through the depth separable convolution layer of the target detection model, and outputting a feature map; and carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

In a third aspect, an embodiment of the present invention provides an apparatus for establishing a target detection model, including: the network modification module is used for acquiring a basic target detection network, replacing a common convolutional layer of the basic target detection network with a depth separable convolutional layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model; the digital image acquisition module is used for acquiring a preset digital image and inputting the preset digital image to the initial target detection model; the characteristic extraction module is used for extracting the characteristics of the preset digital image through the depth separable convolution layer of the initial target detection model and outputting a characteristic diagram; the target detection module is used for carrying out target detection on the feature map through the multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model; and the model optimization module is used for optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

In a fourth aspect, an embodiment of the present invention provides an object detection apparatus, including: the digital image acquisition module is used for acquiring an actual digital image and inputting the actual digital image to the target detection model; the characteristic extraction module is used for extracting the characteristics of the actual digital image through the depth separable convolution layer of the target detection model and outputting a characteristic diagram; and the target detection module is used for carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

In a fifth aspect, an embodiment of the present invention provides an object detection apparatus, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method for establishing the object detection model according to the first aspect and/or the method for applying the object detection model according to the second aspect when executing the computer program.

In a sixth aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer-executable instructions are stored, and the computer-executable instructions are configured to perform the method for building the object detection model according to the first aspect and/or the method for applying the object detection model according to the second aspect.

The embodiment of the invention has the beneficial effects that: in the embodiment, an initial target detection model is obtained by acquiring a basic target detection network, replacing a common convolution layer of the basic target detection network with a depth separable convolution layer, and adding a multi-scale feature fusion mechanism to the basic target detection network; acquiring a preset digital image, and inputting the preset digital image to an initial target detection model; performing feature extraction on a preset digital image through a depth separable convolution layer of the initial target detection model, and outputting a feature map; performing target detection on the feature map through a multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model; and optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model. The invention replaces the common convolution layer in the basic target detection network with the depth separable convolution layer, and adds a multi-scale feature fusion mechanism to obtain the initial target detection model, wherein the initial target detection model utilizes the depth separable convolution to extract features, and compared with the method of utilizing the common convolution to extract features, the number of parameters of the depth separable convolution is less, thereby reducing the data processing burden of a processor in the embedded equipment. By adding a multi-scale feature fusion mechanism, the deep features and the shallow features can be learned by the target detection model when the target is detected, the feature expression effect is better, and the target detection precision is enhanced. And acquiring a preset digital image, inputting the preset digital image into the initial target detection model, and performing model training on the initial target detection model to obtain an intermediate target detection model. The NetAdapt algorithm and the pruning algorithm are adopted to optimize the intermediate target detection model to obtain a final target detection model, the intermediate target detection model is miniaturized to obtain the final target detection model through the NetAdapt algorithm and the pruning algorithm, the purpose of accelerated reasoning is achieved, the final target detection model can run on different embedded devices, the overall detection speed is higher, and the effect of target detection on the embedded devices is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the example serve to explain the principles of the invention and not to limit the invention.

FIG. 1 is a schematic diagram of a system architecture platform for building and applying a target detection model according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for building a target detection model according to an embodiment of the present invention;

FIG. 3 is a flow chart of a feature extraction method provided by an embodiment of the invention;

FIG. 4 is a flowchart of a method for target detection using a multi-scale feature fusion mechanism according to an embodiment of the present invention;

fig. 5 is a flowchart of the NetAdapt algorithm optimization process provided by the embodiment of the present invention;

FIG. 6 is a flow chart of a pruning algorithm optimization process provided by an embodiment of the present invention;

FIG. 7 is a flowchart of a method for applying a target detection model according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an object detection model building apparatus according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In the description of the present invention, the meaning of a plurality of means is one or more, the meaning of a plurality of means is two or more, and larger, smaller, larger, etc. are understood as excluding the number, and larger, smaller, inner, etc. are understood as including the number. If the first and second are described for the purpose of distinguishing technical features, they are not to be understood as indicating or implying relative importance or implicitly indicating the number of technical features indicated or implicitly indicating the precedence of the technical features indicated.

In the description of the present invention, unless otherwise explicitly limited, terms such as arrangement, installation, connection and the like should be understood in a broad sense, and those skilled in the art can reasonably determine the specific meanings of the above terms in the present invention in combination with the specific contents of the technical solutions.

It should be noted that although functional blocks are partitioned in a schematic diagram of an apparatus and a logical order is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the partitioning of blocks in the apparatus or the order in the flowchart. The terms first, second and the like in the description and in the claims, and the drawings described above, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

First, several terms related to the present invention are analyzed:

artificial Intelligence (AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produces a new intelligent machine that can react in a manner similar to human intelligence, and research in this field includes robotics, language recognition, image recognition, natural language processing, and expert systems, among others. The artificial intelligence can simulate the information process of human consciousness and thinking. Artificial intelligence is also a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results.

Natural Language Processing (NLP): NLP uses computer to process, understand and use human language (such as chinese, english, etc.), and belongs to a branch of artificial intelligence, which is a cross discipline between computer science and linguistics, also commonly called computational linguistics. Natural language processing includes parsing, semantic analysis, discourse understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, speech recognition and text-to-speech conversion, information retrieval, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like related to language processing.

Information Extraction (NER): and extracting the fact information of entities, relations, events and the like of specified types from the natural language text, and forming a text processing technology for outputting structured data. Information extraction is a technique for extracting specific information from text data. The text data is composed of specific units, such as sentences, paragraphs and chapters, and the text information is composed of small specific units, such as words, phrases, sentences and paragraphs or combinations of these specific units. The extraction of noun phrases, names of people, names of places, etc. in the text data is text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.

In the related technology, target detection is a hot direction in the fields of computer vision and digital image processing, and is widely applied to various fields of robot navigation, intelligent video monitoring, industrial detection, aerospace and the like; meanwhile, target detection is also a preorder algorithm required by many visual tasks, and plays a vital role in subsequent tasks such as face recognition, gait recognition, crowd counting, instance segmentation and the like. Due to the wide application of deep learning, the target detection algorithm is developed rapidly. The target detection algorithm based on deep learning is mainly divided into (1) two-stage target detection: firstly, generating a candidate region containing approximate position information of a target, and then classifying and position refining the candidate region in the second stage; (2) single-stage target detection: and directly generating the class probability and the corresponding position coordinate value of the object. Compared with a two-stage algorithm, the existing single-stage algorithm does not need to generate a candidate region, the whole process is simpler, the speed is higher, but the accuracy rate is not high enough; the speed is not high enough under the condition that the accuracy rate is ensured by the two-stage algorithm; the network parameters of the existing target detection algorithm are relatively large in quantity, the effect of detecting the target in real time can be achieved only by relying on a large-scale computer such as a server in practical application, but the effect is difficult to achieve when a network structure is transplanted to embedded equipment such as a mobile phone, and the processor performance of the embedded equipment is far inferior to that of the server.

Based on this, the embodiment of the invention provides a method for establishing a target detection model, an application method, equipment, a device and a medium, which can effectively improve the target detection efficiency of embedded equipment.

The embodiment of the invention can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.

The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The embodiment of the invention provides a method for establishing and applying a target detection model, and relates to the technical field of artificial intelligence and digital medical treatment. The target detection model establishing method and the target detection model applying method provided by the embodiment of the invention can be applied to a terminal, a server side and software running in the terminal or the server side. In some embodiments, the terminal may be a smartphone, tablet, laptop, desktop computer, or the like; the server side can be configured as an independent physical server, can also be configured as a server cluster or a distributed system formed by a plurality of physical servers, and can also be configured as a cloud server for providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, CDN (content distribution network) and big data and artificial intelligence platforms; the software may be an application or the like that implements a target detection model, but is not limited to the above form.

The invention is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments of the present invention will be further explained with reference to the drawings.

As shown in fig. 1, fig. 1 is a schematic diagram of a system architecture platform for building and applying a target detection model according to an embodiment of the present invention.

The system architecture platform 100 of the present invention includes one or more processors 110 and a memory 120, and fig. 1 illustrates one processor 110 and one memory 120 as an example.

The processor 110 and the memory 120 may be connected by a bus or other means, such as the bus connection shown in FIG. 1.

The memory 120, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory 120 may include high-speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, memory 120 optionally includes memory 120 located remotely from processor 110, which may be connected to system architecture platform 100 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Those skilled in the art will appreciate that the device architecture shown in fig. 1 does not constitute a limitation of system architecture platform 100, and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

As shown in fig. 2, fig. 2 is a flowchart of a method for establishing a target detection model according to an embodiment of the present invention, and the method for establishing a target detection model according to an embodiment of the present invention includes, but is not limited to, step S200, step S210, step S220, step S230, step S240, and step S250.

Step S200, acquiring a basic target detection network;

step S210, replacing the common convolution layer of the basic target detection network with a depth separable convolution layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model;

step S220, acquiring a preset digital image, and inputting the preset digital image to an initial target detection model;

step S230, performing feature extraction on a preset digital image through the depth separable convolution layer of the initial target detection model, and outputting a feature map;

step S240, carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model;

and S250, optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

In the embodiment of the invention, a basic target detection network is obtained, and the main structure of the basic target detection network is just replaced by a lightweight neural network, wherein the lightweight neural network refers to a neural network model with less required parameters and less calculation cost. Because the micro neural network has low computational overhead, the micro neural network model can be deployed on equipment with limited computational resources, such as a smart phone, a tablet computer or other embedded equipment.

And replacing the common convolutional layer of the basic target detection network with a depth separable convolutional layer, wherein the depth separable convolution is that each channel carries out independent convolution, and then carrying out weighted combination on the feature maps in the depth direction through point-by-point convolution to generate a new feature map. The number of the parameters is less than that of the common convolution, the data processing burden of the embedded device processor is effectively reduced, if the number of the layers of the neural network adopting the depth separable convolution is deeper on the premise of the same parameter quantity, and the feature extraction effect of the target detection is greatly improved.

The multi-scale feature fusion mechanism is added into the basic target detection network, so that the target detection model can learn deep features and shallow features simultaneously when carrying out target detection, the feature expression effect is better, the target detection precision is enhanced, and the better feature expression effect is achieved.

Replacing a common convolution layer of a basic target detection network with a depth separable convolution layer, adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model, acquiring a preset digital image, inputting the preset digital image into the initial target detection model for model training, extracting features of the preset digital image through the depth separable convolution layer of the initial target detection model in the process of model training, and outputting a feature map; and carrying out target detection on the characteristic graph through a multi-scale characteristic fusion mechanism of the initial target detection model to obtain an intermediate target detection model. Because the intermediate target detection model is too bulky, the intermediate target detection model is optimized by adopting a Net-Adapt (Platform-Aware Neural Network Adaptation for Mobile Applications) algorithm and a pruning algorithm to obtain a final target detection model, the intermediate target detection model is miniaturized by the Net-Adapt algorithm and the pruning algorithm to obtain the final target detection model, the purpose of reasoning acceleration is achieved, the final target detection model can run on different embedded devices, the overall detection speed is higher, and the effect of target detection on the embedded devices is improved.

It should be noted that, in this embodiment, the NetAdapt algorithm performs optimization processing on the convolution kernel of the depth separable convolution layer of the intermediate target detection model; and the pruning algorithm optimizes the network structure of the intermediate target detection model.

In addition, in this embodiment, the model training is a conventional model training procedure.

As shown in fig. 3, fig. 3 is a flowchart of a feature extraction method according to an embodiment of the present invention, which includes, but is not limited to, step S300, step S310, and step S320.

Step S300, performing channel dimension-increasing processing on a preset digital image by utilizing point-by-point convolution;

step S310, performing feature extraction processing on the preset digital image subjected to channel dimension increasing by utilizing depth convolution to obtain a plurality of initial feature maps;

and step S320, performing channel dimensionality reduction processing on the plurality of initial feature maps by utilizing point-by-point convolution, and outputting a final feature map.

In the embodiment of the invention, the preset digital image is subjected to feature extraction through the depth Separable Convolution layer, and the final feature map is output, wherein the depth Separable Convolution (Depthwise Separable Convolution) mainly comprises depth Convolution (Depthwise Convolution) and Pointwise Convolution (Pointwise Convolution).

After the depth separable convolution layer acquires the preset digital image, channel dimension increasing processing is carried out on the preset digital image by utilizing point-by-point convolution, because the depth convolution determines that the depth convolution has no capacity of changing the number of channels due to the self calculation characteristic, the depth convolution layer only can output the number of channels for the channel number of the previous layer. Therefore, if the number of channels given by the previous layer is small, the deep convolution can only carry out feature extraction in a low-dimensional space, and therefore the effect is not good enough. To improve this problem, point-by-point convolution is performed before performing depth convolution to extract features for image upscaling processing, and channel upscaling coefficients T are defined such that, regardless of how many input channels are, after the point-by-point convolution channel upscaling processing, the depth convolution is performed efficiently in a relatively higher dimensional space.

The preset digital image after the channel dimension increasing is subjected to feature processing by utilizing the deep convolution, because convolution kernels of the deep convolution correspond to input channels one by one, one convolution kernel is responsible for one input channel, and one input channel is only convoluted by one convolution kernel, the number of output initial feature images is the same as that of the convolution kernels, the channel dimension increasing processing is carried out by carrying out point-by-point convolution before the deep convolution is carried out with the feature extraction processing, and a plurality of initial feature images can be obtained by carrying out the feature extraction processing by utilizing the deep convolution.

And performing channel dimensionality reduction on the plurality of initial feature maps by using point-to-point convolution to combine the features extracted by the deep convolution and output a final feature map, wherein the dimensionality reduction can well keep the network performance and make the network lighter, and meanwhile, the lower-dimensional features contain all necessary information.

As shown in fig. 4, fig. 4 is a flowchart of a method for detecting a target by using a multi-scale feature fusion mechanism according to an embodiment of the present invention, and the method for detecting a target according to an embodiment of the present invention includes, but is not limited to, step S400, step S410, step S420, and step S430.

Step S400, obtaining a first final feature map output by the first depth separable convolution layer and the height and width of the first final feature map;

step S410, obtaining a second final feature map output by the second depth separable convolution layer and adjusting the height and width of the second final feature map to make the height and width of the second final feature map the same as the height and width of the first final feature map;

step S420, channel splicing and convolution are carried out on the adjusted second final feature map and the first final feature map to obtain fusion features;

and step S430, carrying out target detection according to the fusion characteristics.

In the embodiment of the invention, a multi-scale feature fusion mechanism is used for target detection, the receptive field of a high-level feature network is larger, the semantic information representation capability is strong, but the resolution of the feature network is low, and the representation capability of the features is weak (the details of space geometric features are lacked); the receptive field of the low-level feature network is small, the feature information representation capability is strong, and although the resolution is high, the semantic information representation capability is weak. The semantic information of the high-level feature network can accurately detect or segment the target. Therefore, the characteristics are added together in the target detection to improve the target detection effect for detection and segmentation. The small-scale feature network has a large receptive field and is suitable for detecting objects of a large target, and the large-scale feature network has a small receptive field and is suitable for detecting small targets. And a plurality of feature networks with different scales are adopted to detect the targets with different sizes, so that the detection precision of target detection is enhanced.

The method comprises the steps of obtaining a first final feature map output by a first depth separable convolutional layer and the height and width of the first final feature map, obtaining a second final feature map output by a second depth separable convolutional layer, adjusting the height and width of the second final feature map to enable the height and width of the second final feature map to be the same as those of the first final feature map, carrying out channel splicing and convolution on the adjusted second final feature map and the first final feature map to obtain a fusion feature, carrying out target detection according to the fusion feature, and adjusting feature maps output by a network rear surface layer and a network front surface layer to carry out feature splicing, so that a deep layer feature and a shallow layer feature are detected simultaneously, the expression effect of the features is improved, and the detection capability of a target detection model on targets with different sizes is improved simultaneously.

It should be noted that, when the initial target detection model is used for model training, 4 bounding boxes are detected for each feature grid, if the central point of the object falls within the feature grid, only the bounding box with the largest overlapping degree with the real frame IOU (Intersection Over Union) is selected for detection, and other bounding boxes with smaller IOU values are discarded, so that the detection capability of the model for targets of different sizes can be improved, and the generalization capability of the bounding boxes of the feature grids can be improved.

As shown in fig. 5, fig. 5 is a flowchart of NetAdapt algorithm optimization processing provided by the embodiment of the present invention, and the NetAdapt algorithm optimization processing provided by the embodiment of the present invention includes, but is not limited to, step S500 and step S510.

Step S500, optimizing convolution kernels of a layer of original depth separable convolution network to obtain a plurality of second depth separable convolution networks;

step S510, comparing the time delay and the precision of a second depth separable convolutional network with the original depth separable convolutional network corresponding to the second depth separable convolutional network, and selecting a final depth separable convolutional network according to the comparison result.

In the prior art, there are generally 2 common methods when designing an efficient network structure: 1. a single network model is uniformly designed regardless of the platform, but the designed network model is different in performance on different platforms; 2. the corresponding network structure is designed manually on the hardware device of a given platform, so that detailed knowledge of the underlying hardware is required, and the network structure is redesigned after the platform is changed.

Both the two methods fail to meet the requirements of the invention, the embodiment of the invention adopts a NetAdapt algorithm of a network compression method, the optimized network is deployed on equipment to directly obtain actual performance indexes, and then a new network compression strategy is guided according to the actually obtained performance indexes, so that the network compression is carried out in an iterative manner to obtain the final result. NetAdapt network optimization is performed in an automated fashion to gradually reduce the resource consumption of the pre-trained network while maximizing accuracy. And optimizing and circulating operation until the resource budget is met. With this design, NetAdapt can not only generate a network that meets the budget, but also a series of simplified networks with different tradeoffs, enabling dynamic network selection and further research.

In an embodiment of the invention, the number of convolution kernels of each depth separable convolutional network layer is searched by using a NetAdapt algorithm. And optimizing the number of convolution kernels of each depth separable convolution network layer, and finally finding a network with high precision and small delay from a second depth separable convolution network set conforming to delay attenuation as a final depth separable convolution network. The accuracy is maintained while optimizing the target detection model, reducing the size of bottlenecks in the extension layer and each layer of depth separable convolutional network layers. Optimizing convolution kernels of a layer of original initial depth separable convolution network by using a NetAdapt algorithm to obtain a plurality of second depth separable convolution networks as a second depth separable convolution network set, selecting one second depth separable convolution network from the second depth separable convolution network set, comparing the time delay and the precision of the second depth separable convolution network with the original initial depth separable convolution network corresponding to the second depth separable convolution network, and selecting a final depth separable convolution network according to a comparison result; and when the time delay of the second depth separable convolution network is smaller than the time delay of the original depth separable convolution network and the precision of the second depth separable convolution network is higher than that of the original depth separable convolution network, selecting the second depth separable convolution network as the final depth separable convolution network.

As shown in fig. 6, fig. 6 is a flowchart of a pruning algorithm optimization process provided in the embodiment of the present invention, and the pruning algorithm optimization process provided in the embodiment of the present invention includes, but is not limited to, step S600 and step S610.

Step S600, pruning the network structure of the intermediate target detection model to remove the redundant weight parameters of the network structure;

and step S610, fine adjustment is carried out on the middle target detection model after pruning processing.

In the embodiment of the invention, the intermediate target detection model is obtained by model training of an initial target detection model, has a large number of redundant weight parameters and neurons which are useless for target detection, and leads to the fact that the model is too bloated as a whole, and a network structure of the intermediate target detection model is pruned by using a pruning algorithm, so that the redundant weight parameters and the useless neurons in the network structure are removed, and the more compact target detection model is realized.

Pruning the network structure of the intermediate target detection model, and removing the redundant weight parameters of the network structure includes but is not limited to the following steps: firstly, coding a network structure by using the number of channels of each layer of the network structure after pruning processing, converting the coded vectors into a group of coded vectors, continuously trying various coded vectors in order to search an optimal pruning network, and generating the network weight after pruning after re-inputting the pruning network; and then obtaining the performance of the intermediate target monitoring model after pruning according to the network structure, the network weight and a preset verification set. And then searching the optimal code vector as a final code vector by using an evolutionary algorithm, and obtaining a final target detection model according to the final code vector. In searching the final code vector by using an evolutionary algorithm, a self-defined objective function is used, wherein the objective function comprises but is not limited to an accuracy function, a time delay function and a calculated quantity function of the network.

It should be noted that, the optimal code vector is searched by using an evolutionary algorithm as a final code vector, and a final target detection model is obtained according to the final code vector. Specific operations include, but are not limited to, treating the encoded vector as a vector representation of the number of channels on each layer of the network, where the number of channels on each layer may correspond to a gene in the evolutionary algorithm. Firstly, randomly selecting a large number of genes, calculating the accuracy of the network weight generated by the pruning network on a preset verification set, taking out the front K genes with the highest accuracy, and then generating new genes by using a crossing and mutation method. The mutation is to randomly change the element proportion in the gene, the crossing is to randomly recombine two parent genes to generate a new gene combination, the process is iterated repeatedly, the coding vector can be obtained as the final coding vector, and the final target detection model is obtained according to the final coding vector.

It should be noted that, in the embodiment of the present invention, an Auto ml (Auto Machine Learning) is used to search for a final coding vector, and then a final target detection model is obtained according to the final coding vector, and the Auto ml has a characteristic of automatically searching for an optimal structure.

In addition, it should be noted that, in the embodiment of the present invention, the pruning algorithm needs to be trained in advance, the training pruning network is composed of l pruning blocks, and each pruning block is composed of two full-link layers. In the forward process, the training pruning network takes the network coding vector as input to generate a weight matrix. Meanwhile, the training pruning network takes the numerical value in the network coding vector as an output channel, and cuts the generated weight matrix to match the input and the output of the training pruning network. For one input image, we can compute the forward loss of the network after pruning. In the reverse process, the weight of the training pruning network, namely the parameters of the full connection layer, is updated by calculating the gradient of the training pruning network. In the whole training process, the training system can obtain different training pruning network structures by randomly generating different network coding vectors. With the network structure and network weights, the performance of the network can be tested on the validation set. And finally, searching the optimal coding vector by using an evolutionary algorithm to obtain an optimal training pruning network. The specific operation is to regard the network code as vector representation of the number of channels of the network on each layer, and the number of channels on each layer can correspond to the genes in the evolutionary algorithm. Firstly, randomly selecting a large number of genes, calculating the precision of the weight generated by the pruning network on a verification set, taking out the front K genes with the highest accuracy, and then generating new genes by using a crossing and mutation method. The mutation is to randomly change the element proportion in the gene, the crossing is to randomly recombine two parent genes to generate a new gene combination, and the process is iterated repeatedly, so that the optimal training pruning network code can be obtained.

In the embodiment of the present invention, the initial target detection model further includes a system loss function for target detection, and as shown in the following formula, the system loss function includes a bounding box coordinate error function, a bounding box confidence error function, and a classification error function.

The first term is a bounding box coordinate error function; the second term is a loss function of the height and width of the bounding box; the third term is a bounding box confidence error function for the presence of an object; the fourth term is a bounding box confidence loss function for the absence of objects; the fifth term is the classification error function of the unit grid in which the object is present. S is a unit grid division coefficient of the picture; b is the number of bounding boxes predicted by each grid; c is the total number of classifications; p is the class probability;

means that there is an object in the ith unit grid, and the jth bounding box in the unit grid predicts the target; lambda [ alpha ]_coordAnd λ_noobjAre the weighting coefficients of the different loss functions.

It should be noted that the system loss function is also included in the intermediate object detection model and the final object detection model.

As shown in fig. 7, fig. 7 is a flowchart of an application method of the object detection model according to the embodiment of the present invention, and the application method of the object detection model according to the embodiment of the present invention includes, but is not limited to, step S700, step S710, and step S720.

Step S700, acquiring an actual digital image, and inputting the actual digital image to a target detection model;

step S710, extracting the characteristics of the actual digital image through the depth separable convolution layer of the target detection model, and outputting a characteristic diagram;

and S720, carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

In the embodiment of the invention, an actual digital image is obtained and input into a target detection model; extracting the features of the actual digital image through the depth separable convolution layer of the target detection model, and outputting a feature map; and carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

As shown in fig. 8, an embodiment of the present invention further provides a target detection model establishing apparatus, including:

a network modification module 800, configured to obtain a basic target detection network, replace a normal convolutional layer of the basic target detection network with a deep separable convolutional layer, and add a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model;

a digital image obtaining module 810, configured to obtain a preset digital image and input the preset digital image to the initial target detection model;

a feature extraction module 820, configured to perform feature extraction on a preset digital image through a depth separable convolution layer of the initial target detection model, and output a feature map;

the target detection module 830 is configured to perform target detection on the feature map through a multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model;

and the model optimization module 840 is used for optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

It should be noted that the contents of the method embodiment of the present invention are all applicable to the apparatus embodiment, the functions specifically implemented by the apparatus embodiment are the same as those of the method embodiment, and the beneficial effects achieved by the apparatus embodiment are also the same as those achieved by the method, which are not described herein again.

As shown in fig. 9, an embodiment of the present invention further provides an object detection apparatus, including:

a digital image acquisition module 900, configured to acquire an actual digital image and input the actual digital image to the target detection model;

the feature extraction module 910 is configured to perform feature extraction on the actual digital image through the depth separable convolution layer of the target detection model, and output a feature map;

and the target detection module 920 is configured to perform target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

In addition, an embodiment of the present invention further provides a target detection device, where the target detection device includes: a memory, a processor, and a computer program stored on the memory and executable on the processor.

The processor and memory may be connected by a bus or other means.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory located remotely from the processor, and these remote memories may be connected to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

It should be noted that the object detection device in this embodiment may be applied as the method for establishing the object detection model in the above-mentioned embodiment and/or the method for applying the object detection model in the above-mentioned embodiment, and the object detection device in this embodiment has the same inventive concept as the method for establishing the object detection model in the above-mentioned embodiment and/or the method for applying the object detection model in the above-mentioned embodiment, so these embodiments have the same implementation principle and technical effect, and are not described in detail here.

The non-transitory software programs and instructions required to implement the method of building an object detection model as described in the above embodiments and/or the method of applying an object detection model as described in the above embodiments are stored in a memory and, when executed by a processor, perform the method of building an object detection model as described in the above embodiments and/or the method of applying an object detection model as described in the above embodiments, e.g. perform the above described method steps S200 to S250 in fig. 2, method steps S310 to S320 in fig. 3, method steps S400 to S430 in fig. 4, method steps S500 to S510 in fig. 5, method steps S600 to S610 in fig. 6, and method steps S700 to S720 in fig. 7.

The above described embodiments of the object detection device are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium stores computer-executable instructions, which are executed by a processor or a controller, for example, by a processor in the above-mentioned embodiment of the object detection device, and enable the above-mentioned processor to execute the method for building the object detection model according to the above-mentioned embodiment and/or the method for applying the object detection model according to the above-mentioned embodiment, for example, execute the above-mentioned method steps S200 to S250 in fig. 2, method steps S310 to S320 in fig. 3, method steps S400 to S430 in fig. 4, method steps S500 to S510 in fig. 5, method steps S600 to S610 in fig. 6, and method steps S700 to S720 in fig. 7.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

It will be appreciated by those skilled in the art that the solutions shown in fig. 1-9 are not intended to limit the embodiments of the present invention, and may include more or fewer steps than those shown, or some of the steps may be combined, or different steps may be included.

The above-described embodiments of the apparatus are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may also be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the invention and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It is to be understood that, in the present invention, "at least one" means one or more, "a plurality" means two or more. "and/or" for describing an association relationship of associated objects, indicating that there may be three relationships, e.g., "a and/or B" may indicate: only A, only B and both A and B are present, wherein A and B may be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of single item(s) or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the above-described units is only one type of logical functional division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes multiple instructions for enabling a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing programs, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

While the preferred embodiments of the present invention have been described in detail, it will be understood by those skilled in the art that the foregoing and various other changes, omissions and deviations in the form and detail thereof may be made without departing from the scope of this invention.

Claims

1. A method for establishing an object detection model is characterized by comprising the following steps:

acquiring a basic target detection network, replacing a common convolution layer of the basic target detection network with a depth separable convolution layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model;

acquiring a preset digital image, and inputting the preset digital image to the initial target detection model;

performing feature extraction on the preset digital image through the depth separable convolution layer of the initial target detection model, and outputting a feature map;

performing target detection on the feature map through the multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model;

and optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

2. The method of establishing an object detection model according to claim 1, wherein the NetAdapt algorithm optimizes convolution kernels of the depth separable convolution layers of the intermediate object detection model; and the pruning algorithm is used for optimizing the network structure of the intermediate target detection model.

3. The method for building an object detection model according to claim 1, wherein said performing feature extraction on said preset digital image by said depth separable convolution layer of said initial object detection model and outputting a feature map comprises:

performing channel dimension-increasing processing on the preset digital image by using point-by-point convolution;

performing feature extraction processing on the preset digital image subjected to channel dimension increasing by utilizing deep convolution to obtain a plurality of initial feature maps;

and performing channel dimensionality reduction on the plurality of initial feature maps by using point-by-point convolution, and outputting a final feature map.

4. The method for building the target detection model according to claim 1, wherein the target detection of the feature map by the multi-scale feature fusion mechanism of the initial target detection model comprises:

obtaining the height and width of a first final feature map output by the first depth separable convolution layer;

acquiring and adjusting the height and width of a second final feature map output by a second depth separable convolutional layer to enable the height and width of the second final feature map to be the same as those of the first final feature map;

performing channel splicing and convolution on the adjusted second final feature map and the first final feature map to obtain a fusion feature;

and carrying out target detection according to the fusion characteristics.

5. The method for establishing the target detection model according to claim 1, wherein the optimizing the intermediate target detection model by using a NetAdapt algorithm and a pruning algorithm comprises:

the NetAdapt algorithm optimizes convolution kernels of a layer of original depth separable convolution network to obtain a plurality of second depth separable convolution networks;

and the NetAdapt algorithm compares the time delay and the precision of one second depth separable convolution network with the original depth separable convolution network corresponding to the second depth separable convolution network, and selects a final depth separable convolution network according to a comparison result.

6. The method for building the object detection model according to claim 5, wherein the selecting the final deep separable convolution network according to the comparison result comprises:

when the time delay of the second depth separable convolutional network is larger than the time delay of the original depth separable convolutional network and/or the precision of the second depth separable convolutional network is lower than that of the original depth separable convolutional network, selecting the original depth separable convolutional network as the final depth separable convolutional network;

and when the time delay of the second depth separable convolutional network is less than the time delay of the original depth separable convolutional network and the precision of the second depth separable convolutional network is higher than that of the original depth separable convolutional network, selecting the second depth separable convolutional network as the final depth separable convolutional network.

7. The method for building the object detection model according to claim 5, wherein the method further comprises:

the pruning algorithm is used for pruning the network structure of the intermediate target detection model to remove the redundant weight parameters of the network structure;

and the pruning algorithm is used for finely adjusting the intermediate target detection model after pruning.

8. The method for building an object detection model according to claim 7, wherein said removing the redundant weight parameters of the network structure comprises:

coding the network structure based on the number of channels of each layer in the intermediate target detection model after pruning processing to obtain a plurality of coding vectors;

inputting the coding vector into the intermediate target detection model and generating a network weight after pruning;

obtaining the performance of the intermediate target detection model after pruning according to the network structure, the network weight and a preset verification set;

and screening the plurality of code vectors by using an evolutionary algorithm to obtain a final code vector, and obtaining the final target detection model according to the final code vector.

9. The method for building an object detection model according to claim 1, wherein the initial object detection model further comprises a system loss function, and the system loss function comprises a bounding box coordinate error function, a bounding box confidence error function, and a classification error function.

10. A method for applying an object detection model, the method comprising:

acquiring an actual digital image, and inputting the actual digital image to a target detection model;

performing feature extraction on the actual digital image through the depth separable convolution layer of the target detection model, and outputting a feature map;

and carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

11. An object detection model creation apparatus, characterized in that the apparatus comprises:

the network modification module is used for acquiring a basic target detection network, replacing a common convolutional layer of the basic target detection network with a depth separable convolutional layer, and adding a multi-scale feature fusion mechanism to the basic target detection network to obtain an initial target detection model;

the digital image acquisition module is used for acquiring a preset digital image and inputting the preset digital image to the initial target detection model;

the characteristic extraction module is used for extracting the characteristics of the preset digital image through the depth separable convolution layer of the initial target detection model and outputting a characteristic diagram;

the target detection module is used for carrying out target detection on the feature map through the multi-scale feature fusion mechanism of the initial target detection model to obtain an intermediate target detection model;

and the model optimization module is used for optimizing the intermediate target detection model by adopting a NetAdapt algorithm and a pruning algorithm to obtain a final target detection model.

12. An object detection apparatus, characterized in that the apparatus comprises:

the digital image acquisition module is used for acquiring an actual digital image and inputting the actual digital image to the target detection model;

the characteristic extraction module is used for extracting the characteristics of the actual digital image through the depth separable convolution layer of the target detection model and outputting a characteristic diagram;

and the target detection module is used for carrying out target detection on the feature map through a multi-scale feature fusion mechanism of the target detection model.

13. An object detection device, comprising: memory, processor and computer program stored on the memory and executable on the processor, the processor implementing a method of establishing an object detection model according to any one of claims 1 to 9 and/or a method of applying an object detection model according to claim 10 when executing the computer program.

14. A computer-readable storage medium storing computer-executable instructions for performing the method of building an object detection model according to any one of claims 1 to 9 and/or the method of applying an object detection model according to claim 10.