CN116994024A

CN116994024A - Method, device, equipment, medium and product for identifying parts in container image

Info

Publication number: CN116994024A
Application number: CN202211522957.3A
Authority: CN
Inventors: 侯嘉悦; 龚星; 汪铖杰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-11-30
Filing date: 2022-11-30
Publication date: 2023-11-03

Abstract

The application discloses a method, a device, equipment, a medium and a product for identifying a part in a container image, and relates to the field of image processing. The method comprises the following steps: acquiring a container image, wherein the container image comprises at least one component for forming a target container, and the container image comprises a plurality of image blocks with a first size; performing feature coding on the container image by taking the second size as a feature coding range to obtain image coding representations corresponding to a plurality of image blocks; extracting features of the correlation degree among pixels in the container image through the image coding representation to obtain image feature representation corresponding to the container image; and classifying pixels in the container image based on the correlation degree between pixels indicated by the image characteristic representation, and generating a component identification result corresponding to the container image. The method can improve the accuracy of the part identification in the container image.

Description

Method, device, equipment, medium and product for identifying parts in container image

Technical Field

The present application relates to the field of image processing, and in particular, to a method, an apparatus, a device, a medium, and a product for identifying a component in a container image.

Background

The container is a standard loading tool and is assembled by hundreds of components with different sizes and structures, and the sizes and structures of each component are different. The quality detection of the container is an important link for ensuring that the container can be continuously used for a long time, when some parts of the container are lost or damaged, the continuous use of the container and the successful transportation of goods can be influenced to a certain extent, and the parts of the container have great differences in tolerance standards for defects because of the differences in structure and functions, so that the same defects such as deformation are different on different parts, and the precision of part identification is a precondition of the problem, so that higher precision requirements are also put forward for part identification.

In the related art, automatic container component recognition is realized by adopting image processing, namely, an image threshold is calculated according to a received scanned image of a container, the scanned image is subjected to binarization processing according to the image threshold to obtain a binarized image, and when a division point is detected in an image contour of the binarized image, the scanned image is subjected to division operation according to the division point to obtain a component image corresponding to the container component.

However, the parts in the container image are segmented only through simple image processing operation, so that the method is suitable for the situation that the dispersion degree among the parts to be identified is large, the identification effect is rough, and the identification accuracy is low.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment, a medium and a product for identifying components in a container image, which can improve the accuracy of identifying the components in the container image. The technical scheme is as follows:

in one aspect, there is provided a method of component identification in a container image, the method comprising:

acquiring a container image, wherein the container image comprises at least one component for forming a target container, and the container image comprises a plurality of image blocks with a first size;

performing feature coding on the container image by taking a second size as a feature coding range to obtain image coding representations corresponding to a plurality of image blocks, wherein the second size is larger than the first size;

extracting features of the correlation degree among pixels in the container image through the image coding representation to obtain an image feature representation corresponding to the container image;

and classifying pixels in the container image based on the pixel correlation degree indicated by the image feature representation, and generating a component identification result corresponding to the container image, wherein the component identification result is used for indicating the component type of the at least one component in the container image.

In another aspect, there is provided a component identification apparatus in a container image, the apparatus comprising:

an acquisition module, configured to acquire a container image, where the container image includes at least one component that forms a target container, and the container image includes a plurality of image blocks of a first size;

the encoding module is used for carrying out feature encoding on the container image by taking a second size as a feature encoding range to obtain image encoding representations corresponding to a plurality of image blocks, and the second size is larger than the first size;

the extraction module is used for extracting the characteristics of the correlation between pixels in the container image through the image coding representation to obtain the image characteristic representation corresponding to the container image;

the generation module is used for classifying pixels in the container image based on the pixel association degree indicated by the image feature representation, and generating a component identification result corresponding to the container image, wherein the component identification result is used for indicating the component type of the at least one component in the container image.

In another aspect, a computer device is provided, where the terminal includes a processor and a memory, where the memory stores at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement a method for identifying a component in a container image according to any one of the embodiments of the present application.

In another aspect, a computer readable storage medium having at least one program code stored therein is loaded and executed by a processor to implement a method of component identification in a container image according to any of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of identifying components in the container image as described in any of the above embodiments.

The technical scheme provided by the application at least comprises the following beneficial effects:

when the container image formed by a plurality of image blocks with the first size is subjected to image coding, the image blocks are subjected to image coding in the container image by taking the second size larger than the first size as a feature coding range, so that corresponding image coding representations are obtained, partial overlapping information exists among the image coding representations corresponding to different image blocks, more associated information in the image is effectively reserved, and when the feature extraction is carried out through the image coding representations, the association among the features is stronger, therefore, the association degree among the image blocks can be ensured while the identification accuracy is improved when the downstream identification and classification of the components in the container image are improved, even for the components in tight connection or with overlapping relation in the container image, the identification accuracy of the components in the container image is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a computer system provided in accordance with an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a method for identifying components in an image of a container provided in an exemplary embodiment of the present application;

FIG. 3 is a schematic illustration of a pallet deck correspondence provided by an exemplary embodiment of the present application;

FIG. 4 is a schematic view of a door lock lever of a container provided in an exemplary embodiment of the present application;

FIG. 5 is a schematic illustration of a corner fitting of a container provided in accordance with an exemplary embodiment of the present application;

FIG. 6 is a schematic illustration of a cross beam of a container provided in accordance with an exemplary embodiment of the present application;

FIG. 7 is a schematic illustration of a corner post of a container provided in accordance with an exemplary embodiment of the present application;

FIG. 8 is a schematic illustration of a container image acquired by a handheld mobile terminal in accordance with an exemplary embodiment of the present application;

FIG. 9 is a schematic diagram of a component recognition result provided by an exemplary embodiment of the present application;

FIG. 10 is a schematic diagram of a component recognition result provided by an exemplary embodiment of the present application;

FIG. 11 is a flowchart of a method for identifying components in an image of a container provided in an exemplary embodiment of the present application;

FIG. 12 is a schematic illustration of window segmentation of a container image provided in accordance with an exemplary embodiment of the present application;

FIG. 13 is a schematic diagram of the structure of a feed-forward neural network subunit provided in accordance with an exemplary embodiment of the present application;

fig. 14 is a schematic structural view of a feature extraction section provided by an exemplary embodiment of the present application;

FIG. 15 is a flowchart of a method for identifying components in an image of a container provided in an exemplary embodiment of the present application;

fig. 16 is a schematic diagram showing a connection manner between a feature extraction section and a feature decoding section provided by an exemplary embodiment of the present application;

FIG. 17 is a flowchart of component identification of a container image provided by an exemplary embodiment of the present application;

FIG. 18 is a block diagram of a component identification device in a container image provided in accordance with an exemplary embodiment of the present application;

FIG. 19 is a block diagram of a component identification device in a container image provided in accordance with an exemplary embodiment of the present application;

Fig. 20 is a schematic diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

First, the terms involved in the embodiments of the present application will be briefly described:

artificial intelligence (Artificial Intelligence, AI): the system is a theory, a method, a technology and an application system which simulate, extend and extend human intelligence by using a digital computer or a machine controlled by the digital computer, sense environment, acquire knowledge and acquire an optimal result by using the knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Machine Learning (ML): is a multi-domain interdisciplinary, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Computer Vision technology (CV): the computer vision is a science for researching how to make the machine "look at", and more specifically, it means that the camera and the computer are used to replace human eyes to identify, measure and other machine vision for the target, and further to make graphic processing, so that the computer is processed into an image more suitable for human eyes to observe or transmit to the instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, mapping, autopilot, intelligent transportation, and like techniques.

And (3) a container: a standardized steel box has a uniform structure and is convenient for mechanical loading and unloading. The global economy is greatly deepened in the circulation of commodities among countries, and containers are widely used as main commodity loading tools in maritime circulation in various transportation scenes such as ports, railways and the like. The quality detection of the container is an important link for ensuring that the container can be continuously used for a long time, and when some parts of the container are lost or damaged, the continuous use of the container and the successful transportation of goods can be influenced to a certain extent. However, different parts of the container have different quality control criteria, so that the overall quality detection of the container, including defect detection of each part, is various and complex in requirements, requires a large amount of manpower resources, and is not efficient. In order to save manpower and material resources and further improve quality detection efficiency, the embodiment of the application applies the computer vision technology in the AI technology to the part recognition process of the container image, and provides a part recognition method in the container image.

Optionally, the method provided by the embodiment of the application can be applied to at least one of the following implementation environments:

first kind: terminal-server

Referring to fig. 1, a schematic diagram of a computer system according to an exemplary embodiment of the present application is shown, where the computer system includes: an image acquisition device 111, a terminal 112, a server 120 and a communication network 130.

The image acquisition device 111 is used for acquiring a container image corresponding to the container; terminal 111 is configured to communicate with server 120 via communication network 130.

Alternatively, the image capturing device 111 and the terminal 112 may be implemented as the same device, i.e. the terminal 112 is provided with an image capturing function, and in one example, the terminal 112 captures an image of the container by calling its own camera; alternatively, the image capturing device 111 and the terminal 112 may be implemented as independent devices, that is, after the image capturing device 111 captures the container image, the container image is sent to the terminal 112 by a designated transmission mode, and the terminal 112 uploads the container image to the server 120.

Optionally, the terminal 112 includes various forms of devices such as a mobile phone, a tablet computer, a desktop computer, a portable notebook computer, a smart home appliance, a vehicle-mounted terminal, an aircraft, a mobile image collector, and the like.

Alternatively, the server 120 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server that provides cloud services, cloud databases, cloud security, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

Cloud Technology (Cloud Technology) refers to a hosting Technology that unifies serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. The cloud technology is based on the general names of network technology, information technology, integration technology, management platform technology, application technology and the like applied by the cloud computing business mode, can form a resource pool, and is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a large amount of computing, storage resources, such as video websites, picture-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing.

In some embodiments, the server 120 described above may also be implemented as a node in a blockchain system. Blockchain (Blockchain) is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanisms, encryption algorithms, and the like.

In one example, taking the terminal 112 as a mobile phone, the image capturing device 111 is implemented as a camera in the terminal 112, a user holds the terminal 112 to capture a container in the environment, so as to obtain a container image, the terminal 112 uploads the container image to the server 120, and the server 120 identifies the container image by calling a component identification service, so as to obtain a component identification result, wherein the component identification service obtains the component identification result through a component identification model after dividing and encoding the container image. In one example, the server 120 directly transmits the component identification result to the terminal 112; in another example, the server 120 transmits the component recognition result to a downstream quality detection service to recognize the component quality of the currently photographed container, and the quality recognition result is obtained, and the server 120 returns the quality recognition result to the terminal 112.

Second kind: terminal

In some embodiments, the method provided by the embodiment of the present application may also be implemented independently by the terminal. Illustratively, the terminal is capable of performing a component recognition function of the container, wherein the component recognition function may be performed by a specific application, or the terminal is a device dedicated to performing a related recognition function of the container, and is provided with a component recognition assembly. Schematically, a user shoots a container image through a terminal, and the terminal identifies a component in the container image by calling a component identification function to obtain a component identification result.

Referring to fig. 2, a flowchart of a method for identifying components in a container image according to an embodiment of the present application is schematically illustrated by taking the method applied to the server shown in fig. 1 as an example, and it should be noted that the method may also be implemented by a terminal, where the method includes:

step 210, acquiring a container image, wherein the container image comprises at least one component constituting a target container.

The container is a standardized and refined loading tool, and has uniform structure and convenient mechanical loading and unloading. The container is assembled by hundreds of components with different sizes and structures, such as corrugated plate surfaces, door lock rods, door handles, corner posts, corner pieces, bottom cross beams, floors and the like. The size and the structure of each component are different, the shape and the color are different, the structure of parts such as a cam head on a transom and a door is complex and fine, the areas of a corrugated plate surface, a floor, a door plate and the like of the container are large, the plate surface structure is simple, and the edges around corner posts, cross beams and the like are long.

As shown in fig. 3, a schematic diagram of a plate surface correspondence of a container according to an exemplary embodiment of the present application is shown, where a container image 310 shows a side plate surface 311 of the container.

As shown in fig. 4, which illustrates a schematic view of a door lock lever of a container according to an exemplary embodiment of the present application, a door lock lever 411 is included in a container image 410.

As shown in fig. 5, which illustrates a schematic view of a corner fitting of a container provided by an exemplary embodiment of the present application, a container image 510 illustrates a corner fitting 511 of a container.

As shown in fig. 6, which illustrates a schematic view of a beam of a container provided by an exemplary embodiment of the present application, a container image 610 illustrates a beam 611 of the container.

As shown in fig. 7, which illustrates a schematic view of a corner post of a container provided by an exemplary embodiment of the present application, a container image 710 illustrates a corner post 711 of the container.

Alternatively, the container image may be an image including a plurality of target containers, that is, the container image includes a plurality of target containers, and it is necessary to identify corresponding components for the plurality of target containers in the container image, respectively.

Alternatively, the container image may be an image of a local area of the target container, that is, a local area of the container image including the target container, and the component corresponding to the target container in the local area needs to be identified.

Optionally, the container image is acquired from a storage area in the server; alternatively, the container image is uploaded by the terminal.

In some embodiments, the server receives the container image from the terminal when the container image is uploaded by the terminal, wherein the terminal may be implemented as a mobile terminal, and the mobile terminal may be implemented as a mobile image capturing device, that is, the container image is obtained by photographing the target container through the mobile image capturing device.

That is, in order to improve the convenience of acquiring the container image, the container image may be acquired through a handheld mobile terminal (e.g., a mobile phone). The handheld mobile device is used as a collection facility of container pictures, and has the greatest characteristics that the mobility of the device has great convenience, however, the characteristics of unfixed visual angles, unfixed positions and the like exist, so that the forms of the containers in the collected pictures are various, and therefore, the same component is different in size and angle. And besides the container area to be tested in the picture, there are various background areas, such as the ground, the sky, even other stacked containers and the like. Therefore, the diversity of the container and the background increases the difficulty of identifying the container parts in the pictures acquired by the handheld device.

As shown in fig. 8, which illustrates a schematic view of a container image acquired by a handheld mobile terminal according to an exemplary embodiment of the present application, a worker may photograph a container from different angles and different distances through the handheld mobile terminal, wherein a container image 810 is an image photographed downward from a top view, a container image 820 and a container image 830 are images photographed from different distances from a side view, a container image 840 is an image photographed inward of the container, a container image 850 is an image photographed at a close distance, and a container image 860 is an image photographed from a rear side in a bottom view.

The method provided by the embodiment of the application can identify the parts in the container image shot by the handheld mobile terminal, and can ensure the accuracy of part identification while improving the convenience of container image acquisition.

In some embodiments, the server performs image division on the container image to obtain a plurality of image blocks of a first size. It should be noted that, the above-mentioned division of the image blocks of the container image is performed by determining the pixel area corresponding to the token sequence when the container image is input and encoded to obtain the image encoded representation (token sequence) corresponding to the container image, and the container image is not actually divided.

Illustratively, the container image is image-partitioned according to a first size to obtain a plurality of image blocks. Alternatively, the first size may be preset by the server, may be indicated by the terminal, or may be determined according to the container image. In one example, the first size may be implemented as 4px×4px.

In some embodiments, the first size is determined according to an image size of the container image, illustratively, a preset number of image blocks is obtained, and the first size is determined according to the number of image blocks and the image size, i.e. the container images of different image sizes are segmented into image blocks of the same number of image blocks.

In some embodiments, to reduce the data throughput of the component identification process, a background region in the container image is identified prior to segmentation of the container image, wherein the background region is a region in the image that does not include a container, and the identified background region is subjected to pixel modification by specifying a pixel value, e.g., modifying the pixel value of the background region to 0.

In some embodiments, the container image may also be subjected to pre-processing operations such as image graying, image denoising, image enhancement, etc., prior to segmentation of the container image.

And 220, performing feature coding on the container image by taking the second size as a feature coding range to obtain image coding representations corresponding to the image blocks.

And the second size is larger than the first size, namely, the second size larger than the first size is used as a feature coding range to perform feature coding on the image blocks in the container image so as to obtain image coding representations corresponding to the image blocks.

Optionally, the second size may be preset by the server, may be indicated by the terminal, or may be determined according to the first size. In one example, the second size may be implemented as 7px×7px.

In some embodiments, there is a specified multiple relationship between the first size and the second size, i.e., the second size is determined according to the first size and the specified multiple relationship, and in one example, the specified multiple relationship indicates that the corresponding side length of the second size is 1.5 times the corresponding side length of the first size, and when the first size is 4px×4px, the second size is 6px×6px.

In some embodiments, feature encoding of an image block may be implemented by a linear embedding (embedding) layer, i.e., pixels of a feature encoding range of a second size corresponding to the image block are input to the linear embedding layer and output to obtain an image encoded representation corresponding to the image block. In an embodiment of the present application, the linear embedded layer is an embedded layer with an Overlap (overlay).

And 230, extracting features of the correlation degree among pixels in the container image through the image coding representation to obtain an image feature representation corresponding to the container image.

Illustratively, each image block corresponds to an image coding representation (token), when the image feature representation of the container image is extracted, the pixel similarity between the image blocks can be indicated by calculating the similarity between the image coding representations, so that the correlation between the pixels in the container image is determined according to the pixel similarity, the features of other image blocks are fused to the current image block according to the correlation, and the extracted image feature representation corresponding to each image block comprises the correlation information of other image blocks and the current image block, so that the image feature representation corresponding to the image block is obtained.

In some embodiments, when the first size corresponding to the image block is larger, when feature extraction is performed according to the degree of inter-pixel association in the container image, feature extraction may be performed on the image block according to the degree of inter-pixel association in the image block, and then feature extraction may be performed on the container image according to the degree of inter-pixel association between the image blocks. Illustratively, the image coding representation is segmented with a preset segmentation size to obtain a plurality of image sub-coding representations corresponding to the image blocks, the similarity between each image sub-coding representation is calculated, the similarity is converted into weight, the image sub-coding representations are weighted and summed to obtain a new image coding representation for determining the relevance in the image blocks, and the pixel relevance between the image blocks is determined based on the new image coding representation.

In some embodiments, the feature extraction process is implemented by a feature extraction portion in a pre-trained component recognition model, that is, an image-encoded representation is input into the component recognition model, and the feature extraction portion in the component recognition model performs feature extraction on the image-encoded representation to obtain the image feature representation.

In some embodiments, the feature extraction portion converts the image coding representation into the image feature representation based on the image association degree between different image blocks, and performs feature integration on the image feature representations corresponding to the plurality of image blocks respectively to obtain an integrated feature representation, where the integrated feature representation is used for iterative coding based on the image association degree.

In some embodiments, the image correlation between image blocks described above may be determined by feature similarity between image encoded representations, and in one example, for image correlation between image block a and image block B, may be determined by calculating the vector distances of image encoded representation a and image encoded representation B in the vector space. Alternatively, the above-described vector distances may be implemented as at least one of a euclidean distance, a cosine distance, a mahalanobis distance, a hamming distance, etc. between the image encoded representations.

In some embodiments, after determining the image association degree between the ith image block and other image blocks, converting the image association degree into a corresponding weight, and carrying out weighted summation on the image coding representations corresponding to the image blocks in the container image according to the weight so as to obtain the image feature representation corresponding to the ith image block.

In one example, the image-encoded representation is weighted directly as a weight to the image-association.

In another example, the weights corresponding to the image relevance degrees are queried in a preset weight list, wherein the weights corresponding to different image relevance degrees ranges are recorded in the preset weight list, for example, as shown in table one, namely, if the image relevance degrees between the jth image block and the ith image block are r epsilon [0,1], when 0 is less than or equal to 0.5, the weight for weighting the image coding representation j of the jth image block is determined to be 0, namely, when the image feature representation of the ith image block is determined, the influence of the image blocks with the image relevance degrees lower than 0.5 is ignored; when r is more than or equal to 0.5 and less than 0.75, determining that the weight for weighting the image coding representation j of the j-th image block is 0.2; when r is more than or equal to 0.75 and less than 1, determining that the weight for weighting the image coding representation j of the j-th image block is 0.3; when r=1, it is determined that the weight for weighting the image-encoded representation j of the j-th image block is 0.5.

List one

Image association range	Weighting of
		[0,0.5)	0
[0.5,0.75)	0.2
		[0.75,1)	0.3
1	0.5

In some embodiments, when feature integration is performed on image feature representations corresponding to a plurality of image blocks, the number of image feature representations is reduced, and the dimension of the feature representations is increased, so that a hierarchical design result is formed by the network, namely, the receptive field corresponding to the features is increased, semantic information corresponding to the features is fully changed, and detail information of the image is reduced.

And 240, classifying pixels in the container image based on the correlation degree between pixels indicated by the image characteristic representation, and generating a component recognition result corresponding to the container image.

Illustratively, the component identification result is used to indicate a component type of at least one component in the container image.

In some embodiments, the generating process of the component recognition result is implemented by a feature decoding part in a pre-trained component recognition model, the feature extracted image feature representation contains rich semantic features of the image, and the designed feature decoding part decodes the image feature representation to perform recognition classification on pixels in the container image through the decoding process, so that the component recognition result corresponding to the container image is generated.

Alternatively, the component recognition result may be that different components in the container image are distinguished in the image by areas of different filling effects, where the filling effects correspond to the component types. Alternatively, the distinction between the above-mentioned filling effects may be distinguished by different filling colors, for example, yellow filling is used for the region corresponding to the door lock lever in the image, and green filling is used for the region corresponding to the inner corner column in the image; alternatively, the distinction between the above-described filling effects may be distinguished by filling different text information in the region, for example, a number "1" is marked in the region corresponding to the door lock lever in the image, and a number "2" is marked in the region corresponding to the corner post in the image.

In an example, as shown in fig. 9, a schematic diagram of a component recognition result 920 provided by an exemplary embodiment of the present application is shown, where the container image 910 is subjected to the above processing procedure to obtain a corresponding component recognition result 920, where areas corresponding to different components of the component recognition result 920 adopt different filling effects.

Alternatively, the above component identification result may also be that a plurality of labeling frames with labels are superimposed on the original container image, that is, the components in the container are labeled by the labeling frames in the component identification result, and the component types are labeled by the labels.

In an example, as shown in fig. 10, a schematic diagram of a component recognition result 1020 provided by an exemplary embodiment of the present application is shown, where a container image 1010 is subjected to the above processing procedure to obtain a corresponding component recognition result 1020, where container components in the component recognition result 1020 are marked with a frame, and the marked frame corresponds to a label of a component type.

In some embodiments, the component identification result is directly sent to the terminal by the server. In other embodiments, the component identification results are used in a downstream quality inspection service to detect component defects in a target container.

In some embodiments, the component recognition result includes a component area of at least one component, and when the component recognition result is used for the downstream quality detection service, in order to further improve the detection efficiency of the downstream component defect detection and the utilization rate of the extracted image feature representation, the abnormal pixel points existing in the component area are pre-recognized through the component recognition result and the image feature representation, so as to obtain an abnormal labeling result for the downstream detection process.

Illustratively, determining a region feature representation from a plurality of image feature representations based on the location of the component region in the container image; determining an abnormal point corresponding to the component region based on the pixel change condition indicated by the region characteristic representation; and marking the abnormal points on the component identification result to obtain an abnormal marking result. The component recognition result and the abnormal labeling result are commonly used for a component defect detection task of the target container.

In some embodiments, after determining the component region, a corresponding region feature representation is obtained from image feature representations corresponding to the plurality of image blocks. When the component area is positioned in the target image block, intercepting the image characteristic representation according to the mapping relation between the image pixels and the image characteristic representation to obtain the area characteristic representation; when the component area is formed by a plurality of image blocks, the image feature representation is intercepted and spliced according to the mapping relation between the image pixels and the image feature representation, so that the area feature representation is obtained. In some embodiments, the mapping relationship between the image feature representation and the image pixels is automatically saved by the server during the process of extracting the image feature.

In some embodiments, the pixel change condition is determined according to a difference between feature values in the region feature representation, and when a feature value having a mutation in the region feature representation is determined according to the pixel change condition and the feature value is determined not to belong to the feature value on the edge line corresponding to the component region, an abnormal point in the component region is determined.

In other embodiments, the server segments the region feature representation according to the specified feature length to obtain a plurality of region sub-feature representations, performs feature clustering on the region sub-feature representations, and determines that an image pixel corresponding to an isolated region sub-feature representation is an outlier when the isolated region sub-feature representation exists in the clustering result. Wherein the determining of the isolated region sub-feature representation may be based on a region sub-feature representation determination in a cluster in which the region sub-feature representation is located, i.e. in response to the region sub-feature representation in the target cluster being less than a specified number threshold, determining that the region sub-feature representation within the target cluster is an isolated region sub-feature representation.

In summary, in the method for identifying components in a container image provided by the embodiment of the application, when the container image formed by a plurality of image blocks with a first size is subjected to image coding, the image blocks are subjected to image coding in the container image by taking a second size which is larger than the first size as a feature coding range, so that corresponding image coding representations are obtained, partial overlapping information exists between the image coding representations corresponding to different image blocks, more associated information in the image is effectively reserved, and when feature extraction is performed through the image coding representations, the association between the features is stronger, therefore, the association degree between the image blocks can be ensured while the identification accuracy is improved when the downstream identification and classification of the components in the container image are performed, even for the components which are tightly connected or have overlapping relation in the container image, the identification accuracy of the components in the container image is improved.

Referring to fig. 11, a flowchart of a method for identifying a component in a container image according to an exemplary embodiment of the present application is shown, in which a feature extraction part in a component identification model is schematically described. The method comprises the following steps:

At step 1110, a container image is acquired.

Illustratively, the container image includes at least one component that makes up the target container. Alternatively, the components may be corrugated panels, door locks, door handles, corner posts, corner pieces, bottom beams, floors, and the like. Illustratively, the container image is divided into a plurality of image blocks of a first size.

In some embodiments, the container image is an image acquired and uploaded by a handheld mobile terminal.

And 1120, performing feature encoding on the container image by taking the second size as a feature encoding range to obtain image encoding representations corresponding to the image blocks.

And the second size is larger than the first size, namely, the second size larger than the first size is used as a feature coding range to perform feature coding on the image blocks in the container image so as to obtain image coding representations corresponding to the image blocks. Optionally, the second size may be preset by the server, may be indicated by the terminal, or may be determined according to the first size. In one example, the second size may be implemented as 7px×7px.

In step 1131, attention representations corresponding to the plurality of image blocks, respectively, are determined based on the multi-head attention mechanism.

In the embodiment of the application, the feature extraction part in the component recognition model is through a converter (converter) unit and an integration (Patch merge) unit, wherein the converter unit is used for obtaining the image feature representation corresponding to each image coding representation through a multi-head attention mechanism, and the integration unit is used for performing feature integration on the image feature representation to obtain an integrated feature representation.

Illustratively, the transducer unit includes a Multi-Head Self-Attention (Multi-Head) subunit and a feed-forward neural network (Feedforward Neural Network, FFN) subunit. The multi-head Attention subunit, namely, the multi-head Self-Attention structure of telecommunication, is a typical Self-Attention mechanism in each head (head), as shown in formula one, specifically, in each head, for each input image coding representation (token), a query vector Q, a keyword K and a feature vector Value of the image coding representation are calculated, then a relevance weight is calculated for Q of the image coding representation and K of all image coding representations, and finally the feature vectors Value of all image coding representations are weighted and summed to obtain a new feature vector Value of the image coding representation. The mechanism can better extract the context and long-range characteristic information by calculating the association degree between the image coding representations.

Equation one: multiHead (Q, K, V) =Concat (head) _i ,head ₂ ,...,head _h )W ^O

Wherein W is ^O 、W _i ^Q 、And->For a learnable network parameter, concat () is a merge function and Attention () is a kernel function of the Attention mechanism based on vector similarity.

That is, the multi-head attention subunit in the transducer unit determines the similarity of the image blocks between the first image coding representation of the ith image block and the second image coding representation of the jth image block, i and j are positive integers, and for the ith image block, the similarity of the image blocks corresponding to the ith image block is used as a weight, and the image coding representations of the plurality of image blocks are weighted and summed to obtain the attention representation corresponding to the ith image block.

In some embodiments, multiple sets of multi-head attention subunits and feedforward neural network subunits may be included in one transducer unit, wherein the i-1 th set of multi-head attention subunits and feedforward neural network subunits are output at the input of the i-th set of multi-head attention subunits and feedforward neural network subunits.

In some embodiments, to increase computational efficiency, a windowed computing mechanism may be introduced, i.e., window segmentation of the container image, with multi-headed attention computation for image-encoded representations within each window.

Illustratively, window segmentation is performed on the container image with a third size to obtain a plurality of candidate windows, the third size is larger than the first size, the similarity of image blocks among image coding representations of all image blocks in the kth candidate window is determined, for the ith image block in the kth candidate window, the similarity of the image blocks corresponding to the ith image block in the kth candidate window is taken as a weight, and the image coding representations in the kth candidate window are weighted and summed to obtain the attention representation corresponding to the ith image block.

Optionally, the third size may be preset by the server, may be indicated by the terminal, or may be determined according to the first size.

In one example, when the third dimension is determined from the first dimension, the third dimension may be implemented as a dimension that is a specified multiple of the first dimension.

As shown in fig. 12, a schematic diagram of window segmentation of a container image 1200 according to an exemplary embodiment of the present application is shown, where the container image 1200 corresponds to a plurality of image blocks 1210, and four candidate windows 1220 are obtained after window segmentation of the container image 1200 with a third size, where each candidate window 1220 includes a plurality of image blocks 1210.

When the window type transducer is adopted, certain area limitation is added in the application process of the multi-head attention mechanism, certain locality is introduced, the calculated amount is reduced, and meanwhile, better details are reserved for components with smaller area and fine edges such as door handles and the like, so that the recognition effect on small-size components in the component recognition result is improved.

Step 1132, stitching and compressing the attention representations corresponding to the image blocks to obtain a combined attention representation.

Illustratively, the attention representations output by the multi-headed attention sub-units correspond to image-encoded representations, i.e. the container image corresponds to a plurality of attention representations, and therefore the attention representations need to be stitched and compressed to get input to the feedforward neural network sub-units before being input to the feedforward neural network in the feedforward neural network sub-units.

In some embodiments, the multiple attention representations are stitched to obtain a stitched attention representation, and the stitched attention representation is multiplied by the assigned weight matrix to obtain the combined attention representation.

Alternatively, the above-mentioned assigned weight matrix may be preset by the system, or may be a learnable parameter, that is, may be trained during the training process of the component recognition model.

And step 1133, performing feature mapping processing on the combined attention representation through a pre-trained feedforward neural network to obtain an image feature representation.

In some embodiments, the feedforward neural network subunit may include one feedforward neural network, or may include a plurality of feedforward neural networks.

Since the positional relationship between the image encoded representations, in particular the two-dimensional structure information between the image pixels, is ignored in the computation of the multi-headed attention mechanism subunit. In order to solve the above-mentioned problems, in the feedforward neural network subunit in the embodiment of the present application, the feedforward neural network is set as a feedforward neural network with convolution, that is, a convolution layer is added between two fully-connected layers of a conventional feedforward neural network.

The feedforward neural network comprises a first full-connection layer, a convolution layer and a second full-connection layer, wherein the first full-connection layer is used for extracting features of the combined attention representation to obtain a first feature representation, the convolution kernel corresponding to the convolution layer is used for extracting relative position relations among image blocks in the first feature representation to obtain a second feature representation, and the second full-connection layer is used for extracting features of the second feature representation to obtain an image feature representation.

As shown in fig. 13, which illustrates a schematic diagram of a feedforward neural network subunit 1300 according to an exemplary embodiment of the present application, in the feedforward neural network subunit 1300, the combined attention representation is input into the first fully-connected layer 1310, the first feature representation output by the first fully-connected layer 1310 is input into the convolution layer 1320 again, the second feature representation output by the convolution layer 1320 is input into the second fully-connected layer 1330 again, and the resulting image feature representation is output.

In other words, when the convolution kernel of the convolution layer calculates the features, only the associated features between adjacent pixels are extracted, so that the relative position relation between image blocks in the combined attention representation is effectively introduced, the introduction of the convolution layer brings certain translational invariance and spatial relevance to the model in the feature extraction process, and the two-dimensional space structure information between the pixels is increased during feature extraction, so that the identification accuracy is improved during component identification based on the extracted features.

In some embodiments, the feature extraction portion further includes an integration unit, where the integration unit performs feature integration on image feature representations corresponding to the plurality of image blocks respectively to obtain an integrated feature representation, where the integrated feature representation is used for iterative encoding based on the image association degree. That is, the image feature representations output by the transform unit are subjected to feature integration, the number of the image feature representations is changed to one half of the original number, and the dimension of the image feature representations is changed to two times of the original number, so that the whole network forms a hierarchical design structure, the more the feature field is, the more the semantic information is full.

In one example, the feature extraction portion includes a plurality of transformers and a plurality of integration units, as shown in fig. 14, which shows a schematic structural diagram of the feature extraction portion 1400 provided in an exemplary embodiment of the present application, and the feature extraction portion 1400 includes a first Transformer 1410, a second Transformer 1420, a third Transformer 1430, and a fourth Transformer 1440, and a first integration unit 1450 integrating output features of the first Transformer 1410, a second integration unit 1460 integrating output features of the second Transformer 1420, and a third integration unit 1470 integrating output features of the third Transformer 1430.

Illustratively, the image coding representation with information output by the upper layer overlapped with each other is input to the first transform unit 1410, and the image feature representation a is obtained based on the multi-head attention mechanism with a window and the feedforward neural network output with convolution, and the image feature representation a is obtained by changing the feature number into one half of the original number and the dimension into two times of the original number through the first integration unit 1450, so as to obtain the integrated feature representation a.

The integrated feature representation a is input to a second Transformer unit 1420, and an image feature representation B is obtained based on a multi-head attention mechanism with a window and a feedforward neural network output with convolution, and the image feature representation B is obtained by changing the feature number into one half of the original and the dimension into two times of the original through a second integrating unit 1460.

The integrated feature representation B is input to a third transducer unit 1430, and an image feature representation C is obtained based on a multi-head attention mechanism with a window and a feedforward neural network output with convolution, and the image feature representation C is obtained by changing the feature number into one half of the original and the dimension into two times of the original through a third integrating unit 1470.

The integrated feature representation C is input to a fourth transducer unit 1440, which obtains an image feature representation D based on windowed multi-headed attention mechanisms and convolved feedforward neural network outputs. The image feature representation D is the feature output by the feature extraction section 1400 for input to a downstream feature decoding section.

The image characteristic representation of each layer is further extracted after being integrated, so that the characteristic association degree among different pixels is obtained in a larger range, a larger receptive field is obtained, and a better recognition effect is achieved on side plates, cross beams and the like with larger areas.

In step 1140, pixels in the container image are classified based on the degree of inter-pixel association indicated by the image feature representation, and a component recognition result corresponding to the container image is generated.

Referring to fig. 15, a flowchart of a method for identifying a component in an image of a container according to an exemplary embodiment of the present application is shown, in which a feature decoding portion in a component identification model is schematically described, and a feature encoding portion connected to the feature decoding portion includes a plurality of transform units, so that image feature representations with a plurality of different resolutions are corresponding, the method includes:

At step 1510, a container image is acquired.

And 1520, performing feature encoding on the container image with the second size as the feature encoding range to obtain image encoding representations corresponding to the plurality of image blocks.

In step 1530, feature extraction is performed on the degree of correlation between pixels in the container image by the image encoded representation, resulting in a multi-resolution image feature representation.

In the embodiment of the application, the feature extraction part of the component recognition model comprises a plurality of window-type transform units with convolution layers, wherein the number and the dimension corresponding to the features are integrated among the transform units through an integration unit, so that the plurality of transform units are realized as a hierarchical structure, and therefore, the image feature representation with a plurality of resolutions is corresponding.

Step 1541, rolling and upsampling the image feature representation corresponding to the ith resolution in the container image to obtain a first decoded feature representation corresponding to the ith resolution.

Step 1542, performing feature stitching on the first decoded feature representation corresponding to the ith resolution and the image feature representation corresponding to the jth resolution to obtain a second decoded feature representation corresponding to the jth resolution.

Wherein the j-th resolution is higher than the i-th resolution.

In the embodiment of the application, the feature decoding part comprises the same number of feature decoding units as the transform units, and each feature decoding unit is connected with the previous-stage feature decoding unit and the corresponding transform unit.

Schematically, as shown in fig. 16, a schematic diagram of a connection manner between a feature extraction portion 1610 and a feature decoding portion 1620 provided by an exemplary embodiment of the present application is shown, where the feature extraction portion 1610 and the feature decoding portion 1620 are included in the component recognition model 1600, and the feature extraction portion 1610 includes a first transducer unit 1611, a second transducer unit 1612, a third transducer unit 1613, and a fourth transducer unit 1614, and a first integrating unit 1615 for integrating output features of the first transducer unit 1611, a second integrating unit 1616 for integrating output features of the second transducer unit 1612, and a third integrating unit 1617 for integrating output features of the third transducer unit 1613.

The image feature representation D output from the fourth converter unit 1614 is input to the first decoding unit 1621 in the feature decoding section 1620, and the fourth image feature representation is convolved and upsampled by the first decoding unit 1621 to obtain a decoded feature representation a.

The image feature representation C output from the third Transformer unit 1613 is input to the second decoding unit 1622, and the image feature representation C is added to the decoded feature representation a in the second decoding unit 1622, and is subjected to convolution and upsampling to obtain the decoded feature representation B.

The image feature representation B output from the second converter 1612 is input to the third decoding unit 1623, and the image feature representation B is added to the decoded feature representation B in the third decoding unit 1623, and is subjected to convolution and upsampling to obtain the decoded feature representation C.

The image feature representation a output from the first converter 1611 is input to the fourth decoder 1624, and the fourth decoder 1624 adds up the image feature representation C and convolves the image feature representation with the decoded feature representation C to obtain a decoded feature D.

And 1543, performing feature fusion on the second decoding feature representation corresponding to each resolution to obtain a fusion feature representation.

Illustratively, the feature decoding portion 1620 in fig. 16 further includes a feature fusion unit 1625, where the feature fusion unit 1625 performs a splice fusion on the decoded features output by each decoding unit, so as to obtain a component recognition result, that is, the feature fusion unit 1625 splices the decoded feature representation a, the decoded feature representation B, the decoded feature representation C, and the decoded feature representation D to obtain a spliced decoded feature representation, and performs a convolution operation on the spliced decoded feature representation, so as to output a component recognition result.

Step 1544, classifying pixels in the container image based on the fused feature representation, and generating a component recognition result of the container image.

Aiming at the component identification service in the container, the areas of the left plate, the right plate and the like of the container are larger, the edges are gentle, the areas of the door lock rod and the like are smaller, the edges are tortuous, the component dimensions of different types are greatly different, and the requirements on the edges are different, so that the effects of multiple types are further considered.

Meanwhile, when the multi-size features are fused for decoding, the features can retain high-resolution edge features and low-resolution semantic features, and the accuracy of the identification result is improved by improving the diversity of information expressed by the features.

Referring to fig. 17 in combination with the above embodiments, a flowchart of component recognition of a container image according to an exemplary embodiment of the present application is shown, where an input container image 1701 is subjected to image encoding by an embedding layer 1710 to obtain a plurality of image encoded representations, the plurality of image encoded representations are input to a transducer unit 1720 for feature extraction to obtain an image feature representation, and the image feature representation is input to a decoder 1730 for multi-scale feature fusion, and the component recognition result 1702 is output.

In the embodiment of the application, the part recognition model is obtained by training a candidate part recognition model, the training of the candidate part recognition model adopts an online difficult-case mining (Online Hard Example Mining, OHEM) and cross entropy loss (Cross Entropy Loss) combined training strategy, namely, the difficulty level of a mined sample is adjusted in the training process, and part samples which are extremely difficult are abandoned to prevent network overfitting, wherein the part samples which are extremely difficult can be realized as the first 5% of samples which are arranged according to the difficulty level of the sample and have the greatest difficulty.

Meanwhile, class weights (cls) are adopted in the training process for various components _weight ) The specific formula is shown as a formula II.

Formula II:

wherein media is _freq Represents the median pixel number, cls _freq For the number of pixels of various types, the ratio is a gentle coefficient, and is generally set to 0.3, so that the problem of edge expansion of parts with too few pixels, such as door handles and the like, is reduced.

Referring to fig. 18, a block diagram of a component recognition apparatus in a container image according to an exemplary embodiment of the present application is shown, where the apparatus includes the following modules:

an acquisition module 1810 for acquiring a container image, wherein the container image includes at least one component forming a target container, and the container image includes a plurality of image blocks with a first size;

the encoding module 1820 is configured to perform feature encoding on the container image with a second size as a feature encoding range, so as to obtain image encoded representations corresponding to a plurality of image blocks, where the second size is greater than the first size;

the extracting module 1830 is configured to perform feature extraction on the degree of correlation between pixels in the container image through the image coding representation, so as to obtain an image feature representation corresponding to the container image;

and the generating module 1840 is configured to classify pixels in the container image based on the degree of correlation between pixels indicated by the image feature representation, and generate a component identification result corresponding to the container image, where the component identification result is used to indicate a component type of the at least one component in the container image.

In some alternative embodiments, as shown in fig. 19, the extraction module 1830 further comprises:

a determining unit 1831 for determining attention representations respectively corresponding to the plurality of image blocks based on a multi-head attention mechanism;

a stitching unit 1832, configured to stitch and compress attention representations corresponding to the plurality of image blocks to obtain a combined attention representation;

and a mapping unit 1833, configured to perform feature mapping processing on the combined attention representation through a pre-trained feedforward neural network, to obtain the image feature representation.

In some alternative embodiments, the determining unit 1831 is further configured to determine a similarity of image blocks between the first image-encoded representation of the ith image block and the second image-encoded representation of the jth image block, where i and j are positive integers; and for the ith image block, taking the similarity of the image blocks corresponding to the ith image block as a weight, and carrying out weighted summation on the image coding representations of the plurality of image blocks to obtain the attention representation corresponding to the ith image block.

In some alternative embodiments, the extraction module 1830 further comprises:

a dividing unit 1834, configured to perform window division on the container image with a third size, to obtain a plurality of candidate windows, where the third size is greater than the first size;

The determining unit 1831 is further configured to determine a similarity of image blocks between the image-encoded representations of the image blocks in the kth candidate window; and for the ith image block in the kth candidate window, taking the similarity of the image block corresponding to the ith image block in the kth candidate window as a weight, and carrying out weighted summation on the image coding representation in the kth candidate window to obtain the attention representation corresponding to the ith image block.

In some alternative embodiments, the feed forward neural network includes a first fully connected layer, a convolutional layer, and a second fully connected layer;

the mapping unit 1833 is further configured to perform feature extraction on the combined attention representation through the first fully-connected layer to obtain a first feature representation; extracting the relative position relation between the image blocks in the first characteristic representation through a convolution kernel corresponding to the convolution layer to obtain a second characteristic representation; and carrying out feature extraction on the second feature representation through the second full connection layer to obtain the image feature representation.

In some optional embodiments, the image feature representations of the container image include image feature representations corresponding to the plurality of image blocks respectively;

The apparatus further comprises: the integration module 1850 is configured to perform feature integration on image feature representations corresponding to the plurality of image blocks, to obtain an integrated feature representation, where the integrated feature representation is used for performing iterative encoding based on the image association degree.

In some alternative embodiments, the image block includes a plurality of image feature representations of different resolutions;

the generating module 1840 further includes:

a decoding unit 1841, configured to perform convolution and upsampling on an image feature representation corresponding to an ith resolution in the container image, to obtain a first decoded feature representation corresponding to the ith resolution; performing feature stitching on a first decoding feature representation corresponding to an ith resolution and an image feature representation corresponding to a jth resolution to obtain a second decoding feature representation corresponding to the jth resolution, wherein the jth resolution is higher than the ith resolution;

a fusion unit 1842, configured to perform feature fusion on the second decoded feature representation corresponding to each resolution, so as to obtain a fused feature representation;

a generating unit 1843, configured to classify pixels in the container image based on the fusion feature representation, and generate the component identification result of the container image.

In some alternative embodiments, the acquiring module 1810 is further configured to receive the container image, where the container image is obtained by capturing the target container with a mobile image capturing device.

In some alternative embodiments, the component recognition result includes a component area of at least one component;

the apparatus further comprises: an annotating module 1860 for determining a region feature representation from a plurality of image feature representations based on a position of the component region in the container image; determining an abnormal point corresponding to the component area based on the pixel change condition indicated by the area characteristic representation; and marking the abnormal points on the component identification result to obtain an abnormal marking result, wherein the component identification result and the abnormal marking result are commonly used for a component defect detection task of the target container.

In summary, in the component recognition device in the container image provided by the embodiment of the application, when the container image formed by a plurality of image blocks with the first size is subjected to image encoding, the image blocks are subjected to image encoding in the container image by taking the second size larger than the first size as the feature encoding range, so that corresponding image encoding representations are obtained, partial overlapping information exists between the image encoding representations corresponding to different image blocks, more associated information in the image is effectively reserved, and when feature extraction is performed through the image encoding representations, the association between features is stronger, therefore, the association degree between the image blocks can be ensured while the recognition accuracy is improved when the downstream recognition and classification of the components in the container image are improved, even the components which are closely connected or have overlapping relation in the container image can be accurately recognized, and the accuracy of the component recognition in the container image is improved.

It should be noted that: the component recognition device in the container image provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the component recognition device in the container image provided in the above embodiment belongs to the same concept as the component recognition method in the container image, and the detailed implementation process of the component recognition device is detailed in the method embodiment, which is not described herein.

Fig. 20 is a schematic diagram showing a structure of a server according to an exemplary embodiment of the present application. Specifically, the following structure is included.

The server 2000 includes a central processing unit (Central Processing Unit, CPU) 2001, a system Memory 2004 including a random access Memory (Random Access Memory, RAM) 2002 and a Read Only Memory (ROM) 2003, and a system bus 2005 connecting the system Memory 2004 and the central processing unit 2001. The server 2000 also includes a mass storage device 2006 for storing an operating system 2013, application programs 2014, and other program modules 2015.

The mass storage device 2006 is connected to the central processing unit 2001 through a mass storage controller (not shown) connected to the system bus 2005. The mass storage device 2006 and its associated computer-readable media provide non-volatile storage for the server 2000. That is, the mass storage device 2006 may include a computer-readable medium (not shown) such as a hard disk or compact disc read only memory (Compact Disc Read Only Memory, CD-ROM) drive.

Computer readable media may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), charged erasable programmable read-only memory (Electrically Erasable Programmable Read Only Memory, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (Digital Versatile Disc, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that computer storage media are not limited to the ones described above. The system memory 2004 and mass storage 2006 described above may be collectively referred to as memory.

The server 2000 may also operate via a network such as the internet, connected to remote computers on the network, in accordance with various embodiments of the present application. I.e., the server 2000 may be connected to the network 2012 via a network interface unit 2011 coupled to the system bus 2005, or alternatively, the network interface unit 2011 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs, one or more programs stored in the memory and configured to be executed by the CPU.

Embodiments of the present application also provide a computer device including a processor and a memory storing at least one instruction, at least one program, code set, or instruction set, the at least one instruction, at least one program, code set, or instruction set being loaded and executed by the processor to implement the method of identifying a component in a container image provided by the above method embodiments. Alternatively, the computer device may be a terminal or a server.

Embodiments of the present application also provide a computer readable storage medium having stored thereon at least one instruction, at least one program, a code set, or an instruction set, the at least one instruction, the at least one program, the code set, or the instruction set being loaded and executed by a processor to implement the method for identifying a component in a container image provided by the above method embodiments.

Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method of identifying components in the container image as described in any of the above embodiments.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid State Drives), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, resistance Random Access Memory) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A method of identifying components in an image of a container, the method comprising:

2. The method according to claim 1, wherein the feature extraction of the degree of correlation between pixels in the container image by the image coding representation to obtain an image feature representation corresponding to the container image comprises:

determining attention representations respectively corresponding to the plurality of image blocks based on a multi-head attention mechanism;

splicing and compressing the attention representations corresponding to the image blocks to obtain a combined attention representation;

and performing feature mapping processing on the combined attention representation through a pre-trained feedforward neural network to obtain the image feature representation.

3. The method of claim 2, wherein the determining, based on the multi-headed gaze mechanism, a gaze representation for each of the plurality of image blocks comprises:

determining a similarity of image blocks between a first image encoded representation of an ith image block and a second image encoded representation of a jth image block, i and j being positive integers;

and for the ith image block, taking the similarity of the image blocks corresponding to the ith image block as a weight, and carrying out weighted summation on the image coding representations of the plurality of image blocks to obtain the attention representation corresponding to the ith image block.

4. The method of claim 2, wherein the determining, based on the multi-headed gaze mechanism, a gaze representation for each of the plurality of image blocks comprises:

window segmentation is carried out on the container image with a third size, so that a plurality of candidate windows are obtained, and the third size is larger than the first size;

determining image block similarity between image coded representations of respective image blocks within a kth candidate window;

and for the ith image block in the kth candidate window, taking the similarity of the image block corresponding to the ith image block in the kth candidate window as a weight, and carrying out weighted summation on the image coding representation in the kth candidate window to obtain the attention representation corresponding to the ith image block.

5. The method of claim 2, wherein the feed forward neural network comprises a first fully connected layer, a convolutional layer, and a second fully connected layer;

the feature mapping processing is performed on the combined attention representation through a pre-trained feedforward neural network to obtain the image feature representation, and the feature mapping processing comprises the following steps:

performing feature extraction on the combined attention representation through the first full connection layer to obtain a first feature representation;

Extracting the relative position relation between the image blocks in the first characteristic representation through a convolution kernel corresponding to the convolution layer to obtain a second characteristic representation;

and carrying out feature extraction on the second feature representation through the second full connection layer to obtain the image feature representation.

6. The method according to any one of claims 1 to 5, wherein the image feature representation of the container image comprises image feature representations corresponding to the plurality of image blocks, respectively;

the method further comprises the steps of:

and carrying out feature integration on the image feature representations corresponding to the image blocks respectively to obtain integrated feature representations, wherein the integrated feature representations are used for carrying out iterative encoding based on the image association degree.

7. The method of any one of claims 1 to 5, wherein the image block comprises a plurality of image feature representations of different resolutions;

the classifying pixels in the container image based on the pixel correlation degree indicated by the image feature representation, and generating a component identification result corresponding to the container image, includes:

the image feature representation corresponding to the ith resolution in the container image is rolled and up-sampled to obtain a first decoding feature representation corresponding to the ith resolution;

Performing feature stitching on a first decoding feature representation corresponding to an ith resolution and an image feature representation corresponding to a jth resolution to obtain a second decoding feature representation corresponding to the jth resolution, wherein the jth resolution is higher than the ith resolution;

performing feature fusion on the second decoding feature representation corresponding to each resolution to obtain a fusion feature representation;

classifying pixels in the container image based on the fused feature representation, generating the component identification result of the container image.

8. A method according to any one of claims 1 to 3, wherein said acquiring an image of a container comprises:

and receiving the container image, wherein the container image is obtained by shooting the target container through a mobile image acquisition device.

9. A method according to any one of claims 1 to 3, wherein the component recognition result includes a component area of at least one component;

the method further comprises the steps of:

determining a region feature representation from a plurality of image feature representations based on the location of the component region in the container image;

determining an abnormal point corresponding to the component area based on the pixel change condition indicated by the area characteristic representation;

And marking the abnormal points on the component identification result to obtain an abnormal marking result, wherein the component identification result and the abnormal marking result are commonly used for a component defect detection task of the target container.

10. A component identification device in a container image, the device comprising:

11. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the method of component identification in a container image as claimed in any one of claims 1 to 9.

12. A computer readable storage medium having stored therein at least one program code loaded and executed by a processor to implement the method of component identification in a container image as claimed in any one of claims 1 to 9.

13. A computer program product comprising a computer program or instructions which, when executed by a processor, implements a method of component identification in an image of a container as claimed in any one of claims 1 to 9.