CN117636326A - License plate detection method and device, storage medium and electronic equipment - Google Patents

License plate detection method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN117636326A
CN117636326A CN202311706833.5A CN202311706833A CN117636326A CN 117636326 A CN117636326 A CN 117636326A CN 202311706833 A CN202311706833 A CN 202311706833A CN 117636326 A CN117636326 A CN 117636326A
Authority
CN
China
Prior art keywords
license plate
text
detection
image
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311706833.5A
Other languages
Chinese (zh)
Inventor
张玉亭
刘江
左鑫孟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Corp Ltd
Original Assignee
China Telecom Corp Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Corp Ltd filed Critical China Telecom Corp Ltd
Priority to CN202311706833.5A priority Critical patent/CN117636326A/en
Publication of CN117636326A publication Critical patent/CN117636326A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/625License plates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses a license plate detection method, a license plate detection device, a storage medium and electronic equipment. Wherein the method comprises the following steps: acquiring a license plate image to be identified; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics. The method solves the technical problems that the related multi-mode detection method lacks the learning of local area information and has no zero sample detection capability in license plate detection.

Description

License plate detection method and device, storage medium and electronic equipment
Technical Field
The application relates to the technical field of data processing, in particular to a license plate detection method, a license plate detection device, a storage medium and electronic equipment.
Background
With the development of intelligence and the improvement of vehicle popularity, license plate information is required to be acquired through images in many scenes at present to identify vehicle behaviors, vehicle identities and the like, such as a road side parking system and an intelligent bayonet system.
At present, most license plate detection is similar to natural scene text detection, and is based on text frame detection, the main difference is that the license plate has different ground colors, so that the license plate needs to be distinguished from the natural scene text in the detection process, and therefore, local area detection is more focused in license plate detection, while a CLIP (Contrastive Language-Image Pre-training, contrast text-Image pair Pre-training) model is adopted for natural scene text detection at present, and the CLIP model is a way of effectively learning Image-level visual representation on a large number of original Image-text pairs, namely, the global feature of the whole Image is focused more, and better performance is achieved in an Image classification task, but because of lack of local information learning, local example features are harder to distinguish. In addition, the related art also proposes a method for converting the CLIP model into a scene text detector, which can directly use a priori knowledge in the CLIP model in the scene text detector without a pre-training process, improve the existing scene text detector, and improve few-shot training capability, but does not have zero sample detection capability.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a license plate detection method, a device, a storage medium and electronic equipment, which at least solve the technical problems that the related multi-mode detection method lacks learning of local area information and has no zero sample detection capability in license plate detection.
According to an aspect of an embodiment of the present application, there is provided a license plate detection method, including: acquiring a license plate image to be identified; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
Optionally, the training process of the target license plate detection network includes: constructing an initial license plate detection network; obtaining multiple groups of sample data, wherein each group of sample data comprises: the first license plate comprises a first license plate image and a first text prompt, wherein the first license plate image is used for reflecting a first license plate boundary box of the first license plate, and the first text prompt is used for reflecting a first license plate type of the first license plate; and sequentially and iteratively training the initial license plate detection network according to the plurality of groups of sample data to obtain the target license plate detection network.
Optionally, the initial license plate detection network includes: a visual encoder, a text embedding module and a regional visual embedding module in the GLIP model; a text region alignment module based on the new GLIP model; a visual cue generator, a fine granularity embedding module, a local embedding module, a coarse granularity text region module and a text detector in the scene text detection model; a text classifier consisting of a multi-layer convolutional neural network.
Optionally, training the initial license plate detection network in sequence according to the plurality of sets of sample data to obtain the target license plate detection network, including: analyzing each group of sample data through an initial license plate detection network to obtain a corresponding predicted license plate boundary frame and a predicted license plate type; constructing a target loss function based on the first license plate boundary box and the first license plate type, the predicted license plate boundary box and the predicted license plate type in each set of sample data; and adjusting network parameters of the initial license plate detection network according to the target loss function to obtain the target license plate detection network.
Optionally, analyzing each set of sample data through an initial license plate detection network to obtain a corresponding predicted license plate bounding box and predicted license plate type, including: for each group of sample data, processing the first vehicle image by using a visual encoder to obtain corresponding first visual features, extracting the first visual features by using a regional visual embedding module to obtain second visual features, processing the first text prompt by using a text encoder to obtain corresponding first text features, and extracting the first text features by using a text embedding module to obtain corresponding second text features; processing the second visual features and the second text features through a visual cue generator to obtain visual cues, converting the visual cues by using a fine-granularity embedding module to obtain fine-granularity visual features, and fusing the second visual features and the fine-granularity visual features by using a local embedding module to obtain local visual features; matching and aligning the second text feature and the local visual feature by using a coarse-granularity text region module, and obtaining a corresponding text segmentation diagram after dot product and sigmoid operation, wherein the text segmentation diagram comprises a corresponding text instance type label and a text instance boundary box; and fusing the text segmentation map with the local visual features, and respectively outputting a predicted license plate boundary frame and a predicted license plate type of the first license plate in the first license plate image by utilizing a text detector and a text classifier.
Optionally, constructing the target loss function based on the first vehicle license plate bounding box and the first vehicle license plate type, the predicted vehicle license plate bounding box and the predicted vehicle license plate type within each set of sample data includes: constructing a detection loss function based on the first license plate bounding box and the predicted license plate bounding box in each set of sample data; constructing a classification loss function based on the first license plate type and the predicted license plate type in each set of sample data; constructing an alignment loss function based on the extracted second text feature and local visual feature of each set of sample data; constructing an auxiliary loss function based on the text instance type and the text instance bounding box, the predicted license plate bounding box and the predicted license plate type in the text segmentation map; the objective loss function is composed of a detection loss function, a classification loss function, an alignment loss function, and an auxiliary loss function.
Optionally, the tag of the target license plate type includes at least one of: license plate color, license plate font, license plate number.
According to another aspect of the embodiments of the present application, there is also provided a license plate detection device, including: the acquisition module is used for acquiring the license plate image to be identified; the detection module is used for detecting and analyzing the license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary frame corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and a text detection result is obtained based on the visual characteristics and the text characteristics.
According to another aspect of the embodiments of the present application, there is further provided a non-volatile storage medium, where the non-volatile storage medium includes a stored computer program, and a device where the non-volatile storage medium is located executes the license plate detection method described above by running the computer program.
According to another aspect of the embodiments of the present application, there is also provided an electronic device including: the license plate detection device comprises a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the license plate detection method through the computer program.
In the embodiment of the application, a license plate image to be identified is obtained; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
According to the target license plate detection network combining the scene text detection model and the language image-based pre-training GLIP model, the zero sample detection capability can be improved through deep cross-modal fusion, the learning of local area information can be performed through the GLIP model, and the recognition capability on the local area can be improved through the phrase alignment training in each area/frame and the text prompt, so that the technical problems that the learning of the local area information is lacking and the zero sample detection capability is not available in license plate detection due to the related multi-modal detection method are solved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
fig. 1 is a hardware block diagram of a computer terminal for implementing a vehicle detection method according to an embodiment of the present application;
FIG. 2 is a flow chart of an alternative license plate detection method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a TCM architecture of the related art;
FIG. 4 is a schematic architecture diagram of an alternative initial license plate detection network according to an embodiment of the present application;
FIG. 5 is a schematic illustration of an alternative electric license plate of the related art;
FIG. 6 is an inference scenic plot of an alternative target license plate detection model according to an embodiment of the application;
fig. 7 is a schematic structural diagram of an alternative license plate detection device according to an embodiment of the present application.
Detailed Description
In order to make the present application solution better understood by those skilled in the art, the following description will be made in detail and with reference to the accompanying drawings in the embodiments of the present application, it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, shall fall within the scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In addition, the related information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for presentation, analyzed data, etc.) referred to in this application are information and data authorized by the user or sufficiently authorized by the parties. For example, an interface is provided between the system and the relevant user or institution, before acquiring the relevant information, the system needs to send an acquisition request to the user or institution through the interface, and acquire the relevant information after receiving the consent information fed back by the user or institution.
For better understanding of the embodiments of the present application, technical terms related in the embodiments of the present application are explained below:
the english full name of CLIP is Contrastive Language-Image Pre-training, a Pre-training method or model based on a contrast text-Image pair. CLIP is a multimodal model based on contrast learning, whose training data is text-image pairs: an image and its corresponding text description, thereby enabling the CLIP model to learn the matching relationship of the text-image pair by contrast learning. Wherein CLIP comprises two models: text Encoder and Image Encoder, wherein the Text Encoder is used for extracting Text characteristics, and a Text transformation model commonly used in NLP can be adopted; the Image Encoder is used to extract the features of the Image, which can use the commonly used ResNet50, and the extracted text features and Image features are compared and learned by the CLIP. Through the above description of the CLIP, it can be seen that the CLIP after training is actually two models, and there is one text model besides the visual model, so that the CLIP model can directly implement zero-shot image classification, that is, can implement classification on a specific downstream task without any training data, and the specific implementation only needs the following two steps: 1) Constructing description Text of each category according to the classification label of the task (a photo of { label }), inputting the Text into a Text Encoder to obtain corresponding Text features, and obtaining N Text features if the category data is N; 2) Inputting the Image to be predicted into an Image Encoder to obtain Image features, calculating scaled cosine similarity with N text features, selecting a category corresponding to the text with the largest similarity as an Image classification prediction result, and further, inputting the similarity as logits into a softmax to obtain the prediction probability of each category.
GLIP (Grounded Language-Image Pre-tracking) is used to learn visual representations that are object-level, language-aware, and semantically rich. GLIP unifies the task of target detection and phrase association for pre-training, which combines two benefits: 1) It allows GLIP to learn from detection and correlation data to improve both tasks and to guide a good correlation model; 2) GLIP can generate a bounding box by means of self-training to take advantage of the large number of image-text pairs, which is a learned representation with rich meaning. In contrast to the CLIP model, the GLIP model exposes phrase grouping, a task of identifying fine-grained correspondence between phrases in sentences and objects (or regions) in images, which is a learning object level (object-level).
Example 1
According to the embodiments of the present application, there is provided an embodiment of a vehicle detection method, it should be noted that the steps shown in the flowchart of the drawings may be performed in a computer system such as a set of computer executable instructions, and that although a logical order is shown in the flowchart, in some cases the steps shown or described may be performed in an order different from that herein.
The method embodiments provided by the embodiments of the present application may be performed in a mobile terminal, a computer terminal, or similar computing device. Fig. 1 shows a hardware block diagram of a computer terminal for implementing a vehicle detection method. As shown in fig. 1, the computer terminal 10 may include one or more processors 102 (shown as 102a, 102b, … …,102 n) 102 (the processor 102 may include, but is not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA), a memory 104 for storing data, and a transmission device 106 for communication functions. In addition, the method may further include: a display, an input/output interface (I/O interface), a Universal Serial BUS (USB) port (which may be included as one of the ports of the BUS), a network interface, a power supply, and/or a camera. It will be appreciated by those of ordinary skill in the art that the configuration shown in fig. 1 is merely illustrative and is not intended to limit the configuration of the electronic device described above. For example, the computer terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
It should be noted that the one or more processors 102 and/or other data processing circuits described above may be referred to generally herein as "data processing circuits. The data processing circuit may be embodied in whole or in part in software, hardware, firmware, or any other combination. Furthermore, the data processing circuitry may be a single stand-alone processing module or incorporated, in whole or in part, into any of the other elements in the computer terminal 10. As referred to in the embodiments of the present application, the data processing circuit acts as a processor control (e.g., selection of the path of the variable resistor termination to interface).
The memory 104 may be used to store software programs and modules of application software, such as program instructions/data storage devices corresponding to the vehicle detection method in the embodiments of the present application, and the processor 102 executes the software programs and modules stored in the memory 104, thereby executing various functional applications and data processing, that is, implementing the vehicle detection method of the application program. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the computer terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission means 106 is arranged to receive or transmit data via a network. The specific examples of the network described above may include a wireless network provided by a communication provider of the computer terminal 10. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module for communicating with the internet wirelessly.
The display may be, for example, a touch screen type Liquid Crystal Display (LCD) that may enable a user to interact with a user interface of the computer terminal 10.
In the above operating environment, fig. 2 is a schematic flow chart of an alternative license plate detection method according to an embodiment of the present application, as shown in fig. 2, the method at least includes steps S202-S204, where:
step S202, obtaining a license plate image to be identified.
The license plate image to be identified can be understood as a picture containing the license plate to be identified.
Step S204, detecting and analyzing the license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary frame corresponding to the license plate image to be identified.
The target license plate detection network is obtained by combining a scene text detection model and a pre-training GLIP model based on a language image, wherein the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-training CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
Specifically, the scene text detection model is TCM (Turning a CLIP Model), which can directly use the prior knowledge in the CLIP model in the scene text detector without a pre-training process, can improve the existing scene text detector, and can improve few-shot training capability. Fig. 3 is a schematic architecture diagram of a TCM of the related art, as shown in fig. 3, in which a CLIP pre-trained res net50 model is used as an Image Encoder (Image Encoder) and a global Image embedding feature is obtained using the Image Encoder. And taking the wafer pre-trained transducer model as a Text Encoder (Text Encoder), and extracting the Text by using the Text Encoder, wherein the model uses a predefined discrete language prompt as 'Text', and adds a learnable prompt to learn the strong transferability of the Text embedding, so that zero sample transfer of the wafer model is promoted. Although predefined cues and learnable cues are more effective for guiding the CLIP model, in open scenarios where the test text instance does not match the training image, it may be affected by limited little sample or generalization capability, for which reason a language cue generator (Language Prompt Generator) is added to the TCM to generate a feature vector as a conditional cue, which may be combined with the input of the text encoder for each image, resulting in a new cue input for the text encoder conditioned on the input image. It should be noted that the TCM is described in detail in the related articles, so that a detailed frame structure of the model is not described in detail in the embodiments of the present application.
As an optional implementation manner, the training process of the target license plate detection network includes the following steps:
step S1: and constructing an initial license plate detection network.
Specifically, as shown in fig. 4, the model architecture of the initial license plate detection network includes: a visual Encoder (Image Encoder), a Text Encoder (Text Encoder), a Text embedding (Text embedding) module, a region visual embedding (Region Image Embeddings) module within the GLIP model; a text Region alignment (Word-Region alignment) module based on the GLIP model; a visual cue generator (Vision Prompt) within the scene text Detection model, a Fine-granularity embedding (Fine-grained Embeddings) module, a local embedding (Locality Embeddings) module, a coarse-granularity text region (Corase Text Region) module, a text detector (Detection); a text classifier (clsssi-tion) consisting of a multi-layer convolutional neural network.
The text region alignment module is added to the module for aligning the category and the local feature based on the GLIP model in order to promote local feature learning and accurate positioning.
In the embodiment of the application, the constructed initial license plate detection network improves the recognition capability on the local area by performing text feature extraction and visual feature extraction through a visual encoder and a text encoder using a GLIP model and performing phrase alignment training on each area/frame and a text prompt through the GLIP model. And meanwhile, the zero sample detection capability of the model is improved by depth cross-mode fusion.
Step S2: multiple sets of sample data are acquired.
Wherein each set of sample data comprises: the first license plate image is used for reflecting a first license plate boundary box of the first license plate, and the first text prompt is used for reflecting a first license plate type of the first license plate.
Optionally, the first text prompt may be "The Type class" of license plate in The photo ", where The Type class is defined as a license plate Type, and The tag of The license plate Type may be a license plate color, a license plate font, or a license plate number. For example, in a common traffic scene, license plate patterns of two-wheel and four-wheel electric vehicles are various, as shown in fig. 5, it is not difficult to see that different types of license plates have different colors, and therefore, the license plate colors can be used as license plate types.
Particularly, in the embodiment of the application, text prompt is designed according to the license plate type label, the license plate type recognition capability can be increased on the premise of less reasoning time consumption, and if other recognition tasks are required to be increased, the recognition capability can be realized by increasing a solution terminal in comparison with the method, so that the technical problem that the related scene text detection method does not have the multi-task capability is solved.
Step S3: and sequentially and iteratively training the initial license plate detection network according to the plurality of groups of sample data to obtain the target license plate detection network.
Specifically, the step S3 may be further refined into the following steps:
step S31: and analyzing each group of sample data through an initial license plate detection network to obtain a corresponding predicted license plate boundary frame and a predicted license plate type.
In the technical scheme provided in the step S31, the method further includes the following steps:
the first step: and for each group of sample data, processing the first car plate image by using a visual encoder to obtain corresponding first visual features, extracting the first visual features by using a regional visual embedding module to obtain second visual features, processing the first text prompt by using a text encoder to obtain corresponding first text features, and extracting the first text features by using a text embedding module to obtain corresponding second text features.
The above steps are understood to be the extraction of image features and text features. I.e. the first car plate image in the sample dataIs input into a GLIP pre-trained visual Encoder (Image Encoder) to obtain a corresponding first visual feature I=image Encoder (I ) Wherein, the input size of the image defaults to 1024 x 1024./>A first visual characteristic representing an output, wherein +.>C represents the number of channels of the visual feature, s represents the magnification of the downsampling of the image. And then, carrying out feature extraction on the first visual features through a regional visual embedding module (Region Image Embedding) to obtain second visual features.
At the same time, a first Text prompt Text in the sample data is input into a Text Encoder (Text Encoder) trained in GLIP to obtain a corresponding first Text featureWherein D is set to 512 by default. Then extracting the first Text feature by Text embedding module (Text embedding) to obtain second Text feature t out
And a second step of: and processing the second visual features and the second text features through a visual cue generator to obtain visual cues, converting the visual cues by utilizing a fine-granularity embedding module to obtain fine-granularity visual features, and fusing the second visual features and the fine-granularity visual features by utilizing a local embedding module to obtain local visual features.
The above steps can be understood as: generating a visual cue, adaptively characterizing the second text feature t using a visual cue generator in Is propagated into the second visual feature, formally modeling interactions between visual embedding (Q) and text embedding (K, V) using cross-attention mechanisms in a transfomer, and then learning visual cuesTo pass the image level to the text instance level, which may be defined as follows:
reuse of fine-grained embedded modules for visual cuesConverting to obtain fine-grained visual characteristics, and finally fusing the fine-grained visual characteristics>And second visual feature I to obtain the text-aware local visual feature +.>For example language matching and downstream detection head, classification head:
and a third step of: and matching and aligning the second text feature and the local visual feature by using the coarse-granularity text region module, and obtaining a corresponding text segmentation diagram after dot product and sigmoid operation, wherein the text segmentation diagram comprises a corresponding text instance type label and a text instance boundary box.
The text segmentation map in the above step is obtained according to the matching relation between the text instance and the language, and the expression can be written as follows:
where τ represents the temperature coefficient.
Fourth step: and fusing the text segmentation map with the local visual features, and respectively outputting a predicted license plate boundary frame and a predicted license plate type of the first license plate in the first license plate image by utilizing a text detector and a text classifier.
The above steps can be understood as: in the reasoning stage, the output of the corresponding task head is used as the final output of the model, namely, the predicted license plate boundary box output by the text detection head is used as the final output of the model, and the predicted license plate type output by the text classification head is used as the final output of the model.
Step S32: a target loss function is constructed based on the first vehicle license bounding box and the first vehicle license type, the predicted vehicle license bounding box and the predicted vehicle license type within each set of sample data.
Specifically, the target loss function comprises four parts, namely a detection loss function, a classification loss function, an alignment loss function and an auxiliary loss function, wherein,
constructing a detection loss function L based on a first license plate bounding box and a predicted license plate bounding box within each set of sample data det
Constructing a classification loss function L based on a first license plate type and a predicted license plate type within each set of sample data cls
Constructing an alignment loss function L based on the extracted second text feature and local visual feature of each set of sample data alignment
Constructing an auxiliary loss function L based on the text instance type and text instance bounding box, the predicted license plate bounding box and the predicted license plate type in the text segmentation map aux . The auxiliary loss function is that the label of the text area is used as the supervision of the text segmentation graph in the TCM, and the defined auxiliary loss can be written as:
Wherein y is as defined above ij And P ij Representing the probabilities of labels and predictions belonging to the text instance at pixel (i, j) positions, respectively; the text segmentation map P will sum the local visual featuresThe outputs for the downstream detection head and the end-to-end text recognition head are fused to explicitly incorporate language priors into the text detection and recognition tasks.
Thus, the objective loss function of an embodiment of the present application can be written as:
L total =L det +L cls +L alignment +L aux
step S33: and adjusting network parameters of the initial license plate detection network according to the target loss function to obtain the target license plate detection network.
In the above embodiment, the multi-mode-based target vehicle detection model provided in the embodiment of the present application includes the following features: extracting characteristics of input text prompts and pictures through an image encoder and a text encoder which are pretrained by GLIP; the zero sample detection capability of the model is improved through depth cross-mode fusion; improving recognition on the local region by training each region/border to align with a phrase in the text prompt; in order to enable the model to have multi-task capability, corresponding text prompts are determined according to license plate types, and license plate type classification heads are added to a decoding end to complete license plate type recognition.
For example, fig. 6 is an inference scene diagram of an alternative target license plate detection model according to an embodiment of the present application, and as shown in fig. 6, license plate bounding boxes of all vehicles in a picture and license plate type recognition results of each box can be obtained through the target license plate detection model based on multiple modes in the present application.
Based on the above-mentioned schemes defined in step S202 to step S204, it can be known that, in the embodiment, in the model training process, text prompts are designed according to license plate type labels, and license plate type recognition capability can be increased on the premise of less reasoning time consumption, and if other recognition tasks are to be increased, recognition capability can be realized by increasing a solution terminal in analogy to the method, so as to solve the technical problem that the related scene text detection method does not have multi-task capability; meanwhile, a GLIP model is utilized to design a local feature and text feature alignment module, so that the learning of the local feature is enhanced, and the local detection and recognition capability of the model is improved, so that the technical problem that the local instance features are difficult to distinguish in the text detection of related scenes is solved; in addition, the depth cross-mode fusion lifting model is adopted in the embodiment of the application, so that the technical problem that the prior technology does not have the zero sample detection capability in the license plate detection method is solved.
Example 2
Based on embodiment 1 of the present application, an embodiment of a license plate detection device is also provided, and the device executes the license plate detection method of the embodiment when running. Fig. 7 is a schematic structural diagram of an optional license plate detection device according to an embodiment of the present application, as shown in fig. 7, where the license plate detection device at least includes an acquisition module 71 and a detection module 73, where:
an acquisition module 71, configured to acquire a license plate image to be identified;
the detection module 73 is configured to perform detection analysis on a license plate image to be identified by using a pre-trained target license plate detection network to obtain a target license plate type and a target license plate bounding box corresponding to the license plate image to be identified, where the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, and the scene text detection model is a model that uses the pre-trained CLIP model based on a contrast language image to encode an input image and an input text respectively to obtain visual features and text features, and obtains a text detection result based on the visual features and the text features.
Step S1: and constructing an initial license plate detection network.
Specifically, as shown in fig. 4, the model architecture of the initial license plate detection network includes: a visual Encoder (Image Encoder), a Text Encoder (Text Encoder), a Text embedding (Text embedding) module, a region visual embedding (Region Image Embeddings) module within the GLIP model; a text Region alignment (Word-Region alignment) module based on the GLIP model; a visual cue generator (Vision Prompt) within the scene text Detection model, a Fine-granularity embedding (Fine-grained Embeddings) module, a local embedding (Locality Embeddings) module, a coarse-granularity text region (Corase Text Region) module, a text detector (Detection); a text classifier (clsssi-tion) consisting of a multi-layer convolutional neural network.
The text region alignment module is added to the module for aligning the category and the local feature based on the GLIP model in order to promote local feature learning and accurate positioning.
In the embodiment of the application, the constructed initial license plate detection network improves the recognition capability on the local area by performing text feature extraction and visual feature extraction through a visual encoder and a text encoder using a GLIP model and performing phrase alignment training on each area/frame and a text prompt through the GLIP model. And meanwhile, the zero sample detection capability of the model is improved by depth cross-mode fusion.
Step S2: multiple sets of sample data are acquired.
Wherein each set of sample data comprises: the first license plate image is used for reflecting a first license plate boundary box of the first license plate, and the first text prompt is used for reflecting a first license plate type of the first license plate.
Optionally, the first text prompt may be "The Type class" of license plate in The photo ", where The Type class is defined as a license plate Type, and The tag of The license plate Type may be a license plate color, a license plate font, or a license plate number.
Step S3: and sequentially and iteratively training the initial license plate detection network according to the plurality of groups of sample data to obtain the target license plate detection network.
Specifically, the step S3 may be further refined into the following steps:
step S31: and analyzing each group of sample data through an initial license plate detection network to obtain a corresponding predicted license plate boundary frame and a predicted license plate type.
In the technical scheme provided in the step S31, the method further includes the following steps:
the first step: for each group of sample data, processing the first vehicle image by using a visual encoder to obtain corresponding first visual features, extracting the first visual features by using a regional visual embedding module to obtain second visual features, processing the first text prompt by using a text encoder to obtain corresponding first text features, and extracting the first text features by using a text embedding module to obtain corresponding second text features;
and a second step of: processing the second visual features and the second text features through a visual cue generator to obtain visual cues, converting the visual cues by using a fine-granularity embedding module to obtain fine-granularity visual features, and fusing the second visual features and the fine-granularity visual features by using a local embedding module to obtain local visual features;
And a third step of: performing text instance-language matching alignment on the second text feature and the local visual feature by using a coarse-granularity text region module, and obtaining a corresponding text segmentation diagram after dot product and sigmoid operation, wherein the text segmentation diagram comprises a corresponding text instance type label and a text instance boundary box;
fourth step: and fusing the text segmentation map with the local visual features, and respectively outputting a predicted license plate boundary frame and a predicted license plate type of the first license plate in the first license plate image by utilizing a text detector and a text classifier.
Step S32: a target loss function is constructed based on the first vehicle license bounding box and the first vehicle license type, the predicted vehicle license bounding box and the predicted vehicle license type within each set of sample data.
Specifically, the target loss function comprises four parts, namely a detection loss function, a classification loss function, an alignment loss function and an auxiliary loss function, wherein,
constructing a detection loss function L based on a first license plate bounding box and a predicted license plate bounding box within each set of sample data det
Constructing a classification loss function L based on a first license plate type and a predicted license plate type within each set of sample data cls
Constructing an alignment loss function L based on the extracted second text feature and local visual feature of each set of sample data alignment
Constructing an auxiliary loss function L based on the text instance type and text instance bounding box, the predicted license plate bounding box and the predicted license plate type in the text segmentation map aux
Thus, the objective loss function of an embodiment of the present application can be written as:
L total =L det +L cls +L alignment +L aux
step S33: and adjusting network parameters of the initial license plate detection network according to the target loss function to obtain the target license plate detection network.
Note that each module in the license plate detection device may be a program module (for example, a set of program instructions for realizing a specific function), or may be a hardware module, and for the latter, it may be represented by the following form, but is not limited thereto: the expression forms of the modules are all a processor, or the functions of the modules are realized by one processor.
Example 3
According to an embodiment of the present application, there is also provided a non-volatile storage medium having a program stored therein, wherein when the program runs, a device on which the non-volatile storage medium is controlled to execute the license plate detection method in embodiment 1.
Optionally, the device where the nonvolatile storage medium is located performs the following steps by running the program: acquiring a license plate image to be identified; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
According to an embodiment of the present application, there is further provided a processor, where the processor is configured to run a program, and when the program runs, the license plate detection method in embodiment 1 is executed.
Optionally, the program execution realizes the following steps: acquiring a license plate image to be identified; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
There is also provided, in accordance with an embodiment of the present application, an electronic device, where the electronic device includes one or more processors; and a memory for storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for running the program, wherein the program is configured to perform the license plate detection method in embodiment 1 described above when run.
Optionally, the processor is configured to implement the following steps by computer program execution: acquiring a license plate image to be identified; detecting and analyzing a license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
The foregoing embodiment numbers of the present application are merely for describing, and do not represent advantages or disadvantages of the embodiments.
In the foregoing embodiments of the present application, the descriptions of the embodiments are emphasized, and for a portion of this disclosure that is not described in detail in this embodiment, reference is made to the related descriptions of other embodiments.
In the several embodiments provided in the present application, it should be understood that the disclosed technology content may be implemented in other manners. The above-described embodiments of the apparatus are merely exemplary, and the division of units may be a logic function division, and there may be another division manner in actual implementation, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interfaces, units or modules, or may be in electrical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be essentially or a part contributing to the related art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application and are intended to be comprehended within the scope of the present application.

Claims (10)

1. A license plate detection method, comprising:
acquiring a license plate image to be identified;
and detecting and analyzing the license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary box corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and obtaining a text detection result based on the visual characteristics and the text characteristics.
2. The method of claim 1, wherein the training process of the target license plate detection network comprises:
constructing an initial license plate detection network;
Obtaining a plurality of groups of sample data, wherein each group of sample data comprises the following components: a first license plate image and a first text prompt, and the first license plate image is used for reflecting a first license plate boundary box of a first license plate, and the first text prompt is used for reflecting a first license plate type of the first license plate;
and sequentially and iteratively training the initial license plate detection network according to a plurality of groups of sample data to obtain the target license plate detection network.
3. The method according to claim 2, wherein the initial license plate detection network comprises: a visual encoder, a text embedding module and a region visual embedding module in the GLIP model; a text region alignment module based on the new GLIP model; a visual cue generator, a fine granularity embedding module, a local embedding module, a coarse granularity text region module and a text detector in the scene text detection model;
a text classifier consisting of a multi-layer convolutional neural network.
4. The method of claim 2, wherein the training the initial license plate detection network to obtain the target license plate detection network comprises:
Analyzing each group of sample data through the initial license plate detection network to obtain a corresponding predicted license plate boundary frame and a predicted license plate type;
constructing a target loss function based on the first vehicle license plate bounding box and the first vehicle license plate type, the predicted vehicle license plate bounding box and the predicted vehicle license plate type within each set of the sample data;
and adjusting network parameters of the initial license plate detection network according to the target loss function to obtain the target license plate detection network.
5. The method of claim 3, wherein analyzing each set of the sample data through the initial license plate detection network to obtain a corresponding predicted license plate bounding box and predicted license plate type comprises:
for each group of sample data, processing the first car plate image by using the visual encoder to obtain corresponding first visual features, extracting features of the first visual features by using the regional visual embedding module to obtain second visual features, processing the first text prompt by using the text encoder to obtain corresponding first text features, and extracting features of the first text features by using the text embedding module to obtain corresponding second text features;
Processing the second visual feature and the second text feature through the visual cue generator to obtain a visual cue, converting the visual cue by using the fine-granularity embedding module to obtain a fine-granularity visual feature, and fusing the second visual feature and the fine-granularity visual feature by using the local embedding module to obtain a local visual feature;
matching and aligning the second text feature and the local visual feature by using the coarse-granularity text region module, and obtaining a corresponding text segmentation diagram after dot product and sigmoid operation, wherein the text segmentation diagram comprises a corresponding text instance type label and a text instance boundary box;
and fusing the text segmentation map with the local visual features, and respectively outputting the predicted license plate boundary frame and the predicted license plate type of the first license plate in the first license plate image by using the text detector and the text classifier.
6. The method of claim 5, wherein constructing an objective loss function based on the first vehicle license plate bounding box and the first vehicle license plate type, the predicted vehicle license plate bounding box and the predicted vehicle license plate type within each set of the sample data comprises:
Constructing a detection loss function based on a first license plate bounding box and the predicted license plate bounding box within each set of the sample data;
constructing a classification loss function based on the first license plate type and the predicted license plate type in each set of the sample data;
constructing an alignment loss function based on the second text feature and the local visual feature extracted from each set of the sample data;
constructing an auxiliary loss function based on the text instance type and the text instance bounding box in the text segmentation map, the predicted license plate bounding box and the predicted license plate type;
the objective loss function is composed of the detection loss function, the classification loss function, the alignment loss function, and the auxiliary loss function.
7. The method of claim 1, wherein the tag of the target license plate type comprises at least one of: license plate color, license plate font, license plate number.
8. A license plate detection device, comprising:
the acquisition module is used for acquiring the license plate image to be identified;
the detection module is used for detecting and analyzing the license plate image to be identified by adopting a pre-trained target license plate detection network to obtain a target license plate type and a target license plate boundary frame corresponding to the license plate image to be identified, wherein the target license plate detection network is obtained by combining a scene text detection model and a pre-trained GLIP model based on a language image, the scene text detection model is a model for respectively encoding an input image and an input text by utilizing the pre-trained CLIP model based on a contrast language image to obtain visual characteristics and text characteristics, and a text detection result is obtained based on the visual characteristics and the text characteristics.
9. A non-volatile storage medium, wherein a computer program is stored in the non-volatile storage medium, and wherein a device in which the non-volatile storage medium is located executes the license plate detection method according to any one of claims 1 to 7 by running the computer program.
10. An electronic device, comprising: a memory and a processor for executing a program stored in the memory, wherein the program is executed to perform the license plate detection method according to any one of claims 1 to 7.
CN202311706833.5A 2023-12-12 2023-12-12 License plate detection method and device, storage medium and electronic equipment Pending CN117636326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311706833.5A CN117636326A (en) 2023-12-12 2023-12-12 License plate detection method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311706833.5A CN117636326A (en) 2023-12-12 2023-12-12 License plate detection method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN117636326A true CN117636326A (en) 2024-03-01

Family

ID=90019837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311706833.5A Pending CN117636326A (en) 2023-12-12 2023-12-12 License plate detection method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN117636326A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097686A (en) * 2024-04-25 2024-05-28 支付宝(杭州)信息技术有限公司 Multi-mode multi-task medical large model training method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097686A (en) * 2024-04-25 2024-05-28 支付宝(杭州)信息技术有限公司 Multi-mode multi-task medical large model training method and device

Similar Documents

Publication Publication Date Title
Wang et al. From two to one: A new scene text recognizer with visual language modeling network
CN109949317B (en) Semi-supervised image example segmentation method based on gradual confrontation learning
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN110750959A (en) Text information processing method, model training method and related device
Rupprecht et al. Guide me: Interacting with deep networks
Da et al. Levenshtein ocr
CN114170468B (en) Text recognition method, storage medium and computer terminal
CN113360621A (en) Scene text visual question-answering method based on modal inference graph neural network
CN117636326A (en) License plate detection method and device, storage medium and electronic equipment
CN116229056A (en) Semantic segmentation method, device and equipment based on double-branch feature fusion
EP4302234A1 (en) Cross-modal processing for vision and language
CN110490189A (en) A kind of detection method of the conspicuousness object based on two-way news link convolutional network
CN116434023A (en) Emotion recognition method, system and equipment based on multi-mode cross attention network
CN115909374B (en) Information identification method, device, equipment, storage medium and program product
Yuan et al. Rrsis: Referring remote sensing image segmentation
CN117012370A (en) Multi-mode disease auxiliary reasoning system, method, terminal and storage medium
Zheng et al. Cmfn: Cross-modal fusion network for irregular scene text recognition
CN115982652A (en) Cross-modal emotion analysis method based on attention network
CN113362088A (en) CRNN-based telecommunication industry intelligent customer service image identification method and system
CN116612466B (en) Content identification method, device, equipment and medium based on artificial intelligence
CN117746441B (en) Visual language understanding method, device, equipment and readable storage medium
CN110472728B (en) Target information determining method, target information determining device, medium and electronic equipment
Patel et al. Connectionist Temporal Classification Model for Dynamic Hand Gesture Recognition using RGB and Optical flow Data.
CN114281938A (en) Relationship extraction method, device, equipment and storage medium
Robinet Minimizing Supervision for Vision-Based Perception and Control in Autonomous Driving

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination