CN115035455A - Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation - Google Patents

Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation Download PDF

Info

Publication number
CN115035455A
CN115035455A CN202210707517.9A CN202210707517A CN115035455A CN 115035455 A CN115035455 A CN 115035455A CN 202210707517 A CN202210707517 A CN 202210707517A CN 115035455 A CN115035455 A CN 115035455A
Authority
CN
China
Prior art keywords
video
features
text
category
visual
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210707517.9A
Other languages
Chinese (zh)
Inventor
佘清顺
黄海烽
赵洲
陈哲乾
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yizhi Intelligent Technology Co ltd
Zhejiang University ZJU
Original Assignee
Hangzhou Yizhi Intelligent Technology Co ltd
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yizhi Intelligent Technology Co ltd, Zhejiang University ZJU filed Critical Hangzhou Yizhi Intelligent Technology Co ltd
Priority to CN202210707517.9A priority Critical patent/CN115035455A/en
Priority to LU502690A priority patent/LU502690B1/en
Publication of CN115035455A publication Critical patent/CN115035455A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Image Analysis (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-category video time positioning method, a cross-category video time positioning system and a storage medium based on multi-modal domain self-adaption, and belongs to the field of computer vision. Acquiring videos of different categories and corresponding query texts, and extracting visual features and text features; semantic information calibration is carried out through visual features and text features of a cross-modal feature calibrator target category video; randomly masking the visual features of the target type videos through a video feature reconstructor and reconstructing the visual features; fusing the video features and the text features through a cross-modal feature fusion device; performing single mode field invariance feature expression learning on the video features and the text features through a field discriminator, and performing cross-mode field invariance feature expression learning on the initial fusion features; and predicting the final fusion characteristics of the source category video through a double affine predictor. The method realizes the time positioning aiming at the cross-category video, and improves the generalization capability of the model to the unknown target video.

Description

Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
Technical Field
The invention relates to the field of computer vision, in particular to a cross-category video time positioning method, a system and a storage medium based on multi-modal domain self-adaption.
Background
The video temporal localization task aims to give a video segment temporal boundary corresponding to the query text from the unprocessed video. The traditional method comprises a fully supervised learning method and a weakly supervised learning method. The fully supervised learning consumes longer time in the training stage, needs a large amount of manual data labeling, and is time-consuming and labor-consuming; although the weak supervised learning does not need a large amount of manual labeling data, the performance of the model is greatly different from that of a full supervised learning model due to the lack of enough labeling data, and the two methods are developed on the premise that training data and test data are distributed identically, and domain deviation between different types of scenes in the real world is not considered. The model trained based on the two methods has poor generalization on unknown class data, and cannot well meet the requirements of real world scenes.
Disclosure of Invention
Aiming at the problems, the invention provides a cross-category video time positioning method based on the multi-modal domain countermeasure self-adaption, so as to improve the generalization capability of a model for dealing with unknown target data.
Therefore, the technical scheme adopted by the invention is as follows:
in a first aspect, the present invention provides a cross-category video time positioning method based on adaptation against multi-modal domain, comprising the following steps:
s1: acquiring a source category video, a target category video and a query text of each video corresponding to the source category video and the target category video, extracting initial visual features of the videos and initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;
s2: performing semantic information calibration on the visual features and text features of the target category video obtained in the step S1 through a cross-modal feature calibrator;
s3: randomly masking the visual features of the target type video obtained in the step S1 by a video feature reconstructor, and performing visual feature reconstruction to obtain reconstructed visual features;
s4: fusing the video features and the text features obtained in the step S1 through a cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;
s5: respectively performing single mode field invariance feature expression learning on the video features and the text features obtained in the step S1 through a field discriminator, and performing cross-mode field invariance feature expression learning on the initial fusion features obtained in the step S4;
s6: and predicting the final fusion characteristics of the source type videos obtained in the step S4 through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.
In a second aspect, the invention provides a cross-category video time positioning system based on the adaptation of the confrontation multi-modal field, which is used for realizing the above cross-category video time positioning method based on the adaptation of the confrontation multi-modal field.
In a third aspect, the present invention provides a computer readable storage medium, on which a program is stored, which, when being executed by a processor, is configured to implement the above cross-category video time-localization method based on countering multi-modal domain adaptation.
Compared with the prior art, the invention has the advantages that:
the invention provides a basic training module which is mainly used for supervised learning of source type data with labels; a confrontation field identification module is provided for learning the field invariance characteristics; a cross-module feature calibration module is provided for reducing semantic gaps between different modal features in the target category data; a video reconstruction module is provided for learning temporal semantic relationships and discriminable feature expressions. The invention firstly provides and realizes the video time positioning task aiming at cross-category video data, and the experimental result shows that the model provided by the invention has good generalization capability.
Drawings
FIG. 1 is a block diagram illustrating an overall architecture of a cross-category video temporal localization method based on countering multi-modal domain adaptation in accordance with an exemplary embodiment;
FIG. 2 is a schematic diagram illustrating a confrontation domain authentication module according to an exemplary embodiment;
FIG. 3 is a schematic diagram illustrating a cross-modal feature calibration module in accordance with an exemplary embodiment;
FIG. 4 is a schematic diagram of a video feature reconstruction module shown in accordance with an exemplary embodiment; fig. 5 is a schematic diagram illustrating a data processing capable device terminal according to an example embodiment.
Detailed Description
The invention is further illustrated with reference to the following figures and examples. The figures are only schematic illustrations of the invention, and some of the block diagrams shown in the figures are functional entities, which do not necessarily have to correspond to physically or logically separate entities, and which may be implemented in software, or in one or more hardware modules or integrated circuits, or in different network and/or processor systems and/or microcontroller systems.
As shown in fig. 1, the cross-category video time positioning method based on the adaptation against the multi-modal domain proposed by the present invention includes the following steps:
s1: firstly, respectively extracting source class video and target class through a 3-dimensional convolutional neural networkIdentifying semantic information of the video to obtain visual characteristics
Figure BDA0003705949050000031
And feature coding is carried out through a visual coder to finally obtain the coded visual features
Figure BDA0003705949050000032
Then, semantic information of the source type query text and the target type query text is extracted through a Glove model to obtain text characteristics
Figure BDA0003705949050000033
And finally obtaining the coded text characteristics by performing characteristic coding through a text coder
Figure BDA0003705949050000034
The invention realizes the time positioning task aiming at the cross-category video for the first time, wherein the input data comprises the source category data with the label and the target category data without the label, so as to improve the generalization capability of the model to the unknown category video, and the input data is defined as follows:
Figure BDA0003705949050000035
Figure BDA0003705949050000036
wherein, V s Is a source class video, Q s Is a query text set of source category videos, T s Is a set of tags for the source category data,
Figure BDA0003705949050000037
representing the ith source video in the source category videos and its corresponding query text,
Figure BDA0003705949050000038
representing source category videoThe ith query text matches the real time boundary of the ith source video; v t Is object class video, Q t Is a query text set, T, for a target category video t Is a tag set of target category data, which is null in this embodiment;
Figure BDA0003705949050000039
representing the ith target video in the target category videos and the corresponding query text; b represents the batch size, i.e., the number of source or target videos and their corresponding query text that are entered into the model each time.
In the step, the same visual feature extractor and the same visual encoder are adopted to obtain the visual features of the videos of different types and encode the visual features, and the same text feature extractor and the same text encoder are adopted to obtain the text features of the query texts corresponding to the videos of different types and encode the text features.
Specifically, the implementation process of step S1 is:
s1-1: extracting input source class video V through 3-dimensional convolutional neural network s And object class video V t To obtain initial visual features
Figure BDA0003705949050000041
Query text Q for extracting source category videos through Glove model s And query text Q for target category videos t Obtaining initial text features based on the semantic information
Figure BDA0003705949050000042
S1-2: for the initial visual characteristics obtained in step S1-1
Figure BDA0003705949050000043
And initial text features
Figure BDA0003705949050000044
Respectively projecting to the same hidden dimension through a convolution layer and a linear projection layer, and coding through a multi-head attention layer to obtain coded visual features and text features as final featuresCharacterizing for subsequent calculation; the coded visual features and text features are expressed as:
Figure BDA0003705949050000045
Figure BDA0003705949050000046
Figure BDA0003705949050000047
Figure BDA0003705949050000048
wherein n is the number of frames in each video, m is the query text length, d is the hidden dimension,
Figure BDA0003705949050000049
as a visual characteristic of the source category video,
Figure BDA00037059490500000410
for visual features of the ith source category video,
Figure BDA00037059490500000411
for the visual characteristics of the jth frame of the ith source category video,
Figure BDA00037059490500000412
for the visual characteristics of the target category of video,
Figure BDA00037059490500000413
for the visual features of the ith object class video,
Figure BDA00037059490500000414
for the visual characteristics of the jth frame of the ith target class video,
Figure BDA00037059490500000415
the text features of the query text for the source category video,
Figure BDA00037059490500000416
for the text features of the query text of the ith source category video,
Figure BDA00037059490500000417
for the text feature of the jth word in the query text of the ith source category video,
Figure BDA00037059490500000418
for the text features of the query text of the target category video,
Figure BDA00037059490500000419
text characteristics of the query text for the ith target category video,
Figure BDA00037059490500000420
text characteristics of the jth word in the query text of the ith target category video;
Figure BDA00037059490500000421
in this embodiment, it will be used to encode the initial visual features
Figure BDA00037059490500000422
And a linear projection layer as a visual encoder for encoding the initial text features
Figure BDA00037059490500000423
One convolutional layer and one linear projection layer of (a) are denoted as text encoders.
S2: the visual characteristics obtained in the step S1 are calibrated by the cross-modal characteristic calibrator
Figure BDA00037059490500000424
And text features
Figure BDA00037059490500000425
Semantic information alignment is performed as shown in fig. 3.
In this step, the cross-modal feature calibrator applies a loss function
Figure BDA00037059490500000426
The visual features and the text features in the positive sample are made to be closer in the semantic space, and the similarity of the visual features and the text features in the negative sample in the semantic space is further reduced.
Loss function
Figure BDA0003705949050000051
The definition is as follows:
Figure BDA0003705949050000052
wherein, l (,) is used for calculating cosine similarity between vectors;
Figure BDA0003705949050000053
is the visual characteristics obtained according to step S1
Figure BDA0003705949050000054
And text features
Figure BDA0003705949050000055
Positive samples taken along the time axis average;
Figure BDA0003705949050000056
and
Figure BDA0003705949050000057
is directed to
Figure BDA0003705949050000058
And
Figure BDA0003705949050000059
selected corresponding negative examples, Z V Is directed to
Figure BDA00037059490500000510
Set of negative examples of (2), Z Q Is directed to
Figure BDA00037059490500000511
A set of negative examples of; Δ is a boundary.
The visual features and the text features of the target category after being encoded in the defining step S1 are respectively:
Figure BDA00037059490500000512
computing
Figure BDA00037059490500000513
And
Figure BDA00037059490500000514
average along the time axis, resulting in positive samples:
Figure BDA00037059490500000515
definition of
Figure BDA00037059490500000516
Negative sample of
Figure BDA00037059490500000517
Figure BDA00037059490500000518
Negative sample of
Figure BDA00037059490500000519
Figure BDA00037059490500000520
Loss function across modal feature calibrators during training
Figure BDA00037059490500000521
The calculation process is as follows:
Figure BDA00037059490500000522
s3: the initial visual characteristics of the target category video obtained in the step S1 are reconstructed by the video characteristic reconstructor
Figure BDA00037059490500000527
Random mask and visual feature reconstruction are carried out to obtain reconstructed visual feature V recon Further learning the time sequence semantic relation and discriminable characteristics between the visual characteristics and the text characteristics, as shown in FIG. 4;
s3-1: randomly comparing the initial visual characteristics of the target category video obtained in the step S1
Figure BDA00037059490500000523
Masking operation is carried out according to beta probability, and coding is carried out through a visual coder to obtain coded mask visual characteristics
Figure BDA00037059490500000524
S3-2: performing cross-modal feature fusion on the mask visual features obtained in the step S3-1
Figure BDA00037059490500000525
And the text characteristics of the query text of the target category video obtained in step S1
Figure BDA00037059490500000526
Fusing to obtain the initial fusion characteristic F of the mask video m
S3-3: the mask visual characteristics obtained according to step S3-1
Figure BDA0003705949050000061
And the initial fusion feature F obtained in step S3-2 m Reconstructing video characteristics to obtain reconstructed visual characteristics V recon The calculation process is as follows:
Figure BDA0003705949050000062
wherein Conv1D is a 1-dimensional convolutional layer, ReLU is an activation function,
Figure BDA0003705949050000063
representing operations added by element.
In this embodiment, in the video feature reconstructor in step S3, a loss calculation process in the training process is as follows:
Figure BDA0003705949050000064
where mselos is the loss of mean square error.
S4: respectively carrying out cross-modal feature fusion on the video features and the text features obtained in the step S1
Figure BDA0003705949050000065
Figure BDA0003705949050000066
And
Figure BDA0003705949050000067
performing feature fusion to obtain a fusion feature F s
Figure BDA0003705949050000068
F t
Figure BDA0003705949050000069
S4-1: calculating the visual characteristics of the source type video obtained in step S1
Figure BDA00037059490500000610
And text features
Figure BDA00037059490500000611
Cross-mode similarity matrix S between them belongs to R n×m (ii) a In this embodiment, a cross-modal similarity matrix is calculated by using cosine similarity;
s4-2: respectively normalizing the cross-modal similarity matrix S obtained in the step S4-1 along rows and columns by utilizing a SoftMax function to obtain a similarity density matrix S r And S c
S4-3: the similarity density matrix S obtained according to the step S4-2 r And S c And calculating to obtain a video-query text attention matrix A v ∈R n×d And query the text-to-video attention matrix A q ∈R m×d The calculation process is as follows:
Figure BDA00037059490500000612
Figure BDA00037059490500000613
s4-4: the video-query text attention matrix A calculated according to the step S4-3 v And query the text-to-video attention matrix A q Calculating to obtain initial fusion characteristics F of source type videos s The calculation process is as follows:
Figure BDA00037059490500000614
wherein, W f ∈R 4d×d ,b f ∈R d Both are learnable parameters;
Figure BDA00037059490500000615
Figure BDA00037059490500000616
Figure BDA00037059490500000617
representing the initial fusion characteristics of the ith source class video,
Figure BDA00037059490500000618
and representing the initial fusion characteristics of the jth frame of the ith source category video.
S4-5: the initial fusion feature F obtained according to step S4-4 s Obtaining final fusion characteristics of source type videos through a multi-head attention mechanism
Figure BDA0003705949050000071
Wherein,
Figure BDA0003705949050000072
representing the final fusion features of the ith source class video,
Figure BDA0003705949050000073
representing the final fusion characteristics of the jth frame of the ith source category video;
similarly, according to the method of steps S4-1 to S4-5, the
Figure BDA0003705949050000074
By replacement with
Figure BDA0003705949050000075
Will be provided with
Figure BDA0003705949050000076
By replacement with
Figure BDA0003705949050000077
Obtaining the initial fusion characteristics F of the target category video t And final blend feature
Figure BDA0003705949050000078
S5: respectively comparing the video features obtained in step S1 with the domain identifier
Figure BDA0003705949050000079
And text features
Figure BDA00037059490500000710
Performing unimodal domain invariance feature expression learning and performing initial fusion feature F obtained in step S4 s 、F t Performing cross-modal domain invariance feature expression learning so as to reduce the domain gaps of different modal features;
in this embodiment, as shown in FIG. 2, the domain identifier includes a visual feature identifier D v Text feature identifier D q And a fusion feature discriminator D f
For video features
Figure BDA00037059490500000711
And text features
Figure BDA00037059490500000712
Video feature discriminator D v And a text feature discriminator D q The calculation process is as follows:
Figure BDA00037059490500000713
Figure BDA00037059490500000714
wherein, MLP v And MLP p The method is a multi-layer perceptron for visual features and text features, GRL is a gradient inversion layer, and sigma is a sigmod function.
For the initial fusion feature F s /F t Merging features identifier D f The calculation process is as follows:
D f (F k )=σ(MLP f (GRL(F k ))),k∈{s,t}
wherein, MLP f Is a multi-layer perceptron.
Video feature discriminator D v Text feature identifier D q And a fusion feature discriminator D f Output [0,1 ]]The scalar in between to represent the probability that the input feature is from the target class.
In the training process, firstly, the domain label is defined
Figure BDA00037059490500000715
Figure BDA00037059490500000716
That is, the feature label from the source category is 0, and the feature label from the target category is 1.
The calculation process of the confrontation loss of the domain discriminator in the training process is as follows:
Figure BDA0003705949050000081
wherein BCELoss is a binary cross entropy loss.
S6: final fusion characteristics of the source category video obtained in step S4 through double affine predictor
Figure BDA0003705949050000087
And predicting to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability of all possible results corresponding to the query text is the final prediction result.
S6-1: defining the candidate segment P (fs, fe) of query text match of each video in the source category video, wherein fs is the start frame number, fe is the end frame number, fs, fe are e [1, n ∈]. The final fusion feature of each video obtained from step S4
Figure BDA0003705949050000082
Extracting the corresponding fusion characteristics of the starting frame and the ending frame of the candidate segment
Figure BDA0003705949050000083
Generation of start frames and knots through linear layersHidden feature r of bundle frame fs 、r fe ∈R d The calculation process is as follows:
Figure BDA0003705949050000084
wherein, W 1 /W 2 ∈R 4d×d 、b 1 /b 2 ∈R d Parameters may be learned for the linear layer.
Since each video in the source category video consists of n video frames, all candidate segments can be represented as P ═ fs, fs ∈ [1, n ], fe ∈ [1, n ], fs ≦ fe.
S6-2: introducing the hidden feature r obtained in the step S6-1 fs 、r fe Obtaining a matrix containing the prediction probabilities of all candidate video segments P through a double affine mechanism
Figure BDA0003705949050000085
And combining the frames corresponding to the maximum value of the matrix M into a final prediction result. M is a group of P The calculation process is as follows:
Figure BDA0003705949050000086
wherein, U m ∈R d×d ,W m ∈R d ,b m E.g. R is a learnable parameter, M P Representing the probability that the video clip P corresponds to the query text.
In this embodiment, the dual affine predictor in step S6 uses a scaling intersection ratio (IoU) function as a supervision signal to complete the training process in the training process. The specific contents are as follows:
IoU Score of video clip P P The calculation process is as follows:
Figure BDA0003705949050000091
wherein τ is (τ) fsfe ) To search forThe true time boundary for text matching is queried. (Max score) representing IoU score matrix
Figure BDA0003705949050000092
Maximum value of (2).
The loss calculation process of the double affine predictor in the training process is as follows:
Figure BDA0003705949050000093
preferably, the final loss function of the training phase
Figure BDA0003705949050000094
Expressed as:
Figure BDA0003705949050000095
wherein λ 1 、λ 2 、λ 3 Is a hyper-parameter.
The embodiment also provides a cross-category video time positioning system based on the multi-modal domain confrontation self-adaptation, and the system is used for realizing the embodiment. The terms "module," "unit," and the like as used below may implement a combination of software and/or hardware of predetermined functions. Although the system described in the following embodiments is preferably implemented in software, an implementation in hardware, or a combination of software and hardware, is also possible.
The system comprises:
the data acquisition and feature extraction module is used for acquiring the source type videos, the target type videos and the query texts of the videos corresponding to the source type videos and the target type videos, extracting the initial visual features of the videos and the initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;
the cross-modal characteristic calibration module is used for performing semantic information calibration on the obtained visual characteristics and text characteristics of the target category video through the cross-modal characteristic calibrator;
the video feature reconstruction module is used for randomly masking the visual features of the target type videos through the video feature reconstructor and reconstructing the visual features to obtain reconstructed visual features;
the cross-modal feature fusion module is used for fusing the obtained video features and the text features through the cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;
the domain identification module is used for respectively performing monomodal domain invariance feature expression learning on the obtained video features and text features through the domain identifier and performing cross-modal domain invariance feature expression learning on the obtained initial fusion features;
and the double affine prediction module is used for predicting the obtained final fusion characteristics of the source type video through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, and the maximum prediction probability is the final prediction result.
The data acquisition and feature extraction module, the cross-modal feature fusion module and the double affine prediction module form a basic training module.
The implementation process of the function and the action of each module in the system is specifically described in the implementation process of the corresponding step in the method, for example, one specific process of the system may be:
(1) acquiring a source category video, a target category video and a query text of each video corresponding to the source category video and the target category video, extracting initial visual features of the videos and initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;
(2) performing semantic information calibration on the visual features and text features of the target category video obtained in the step (1) through a cross-modal feature calibrator;
(3) randomly masking the visual features of the target category videos obtained in the step (1) through a video feature reconstructor, and reconstructing the visual features to obtain reconstructed visual features;
(4) fusing the video features and the text features obtained in the step (1) through a cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source category video and initial fusion features and final fusion features of the target category video;
(5) respectively carrying out monomodal domain invariance feature expression learning on the video features and the text features obtained in the step (1) and carrying out cross-modal domain invariance feature expression learning on the initial fusion features obtained in the step (4) by a domain identifier;
(6) and (4) predicting the final fusion characteristics of the source type video obtained in the step (4) through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.
For the system embodiment, since it basically corresponds to the method embodiment, the specific implementation manner of each step may refer to the description of the method portion, and is not described herein again. The above described system embodiments are merely illustrative, wherein the modules illustrated as separate components may or may not be physically separate, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
Embodiments of the system of the present invention may be applied to any data processing capable device, which may be a device or apparatus such as a computer. The system embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. In terms of hardware, as shown in fig. 5, in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, any device with data processing capability where the system is located in the embodiment may also include other hardware according to the actual function of the any device with data processing capability, which is not described in detail herein.
The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the above cross-category video time positioning method based on the adaptation against the multi-modal domain.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both internal storage units and external storage devices of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing capable device, and may also be used for temporarily storing data that has been output or is to be output.
The technical effects of the present invention are verified by experiments below.
The invention adopts an activityNet Captions data set and a Charides-STA data set to carry out model training and verification, wherein the activityNet Captions data set comprises 5 categories of sports, social contact, diet, housework and personal maintenance; the Charades-STA dataset contains 5 categories kitchen, bedroom, living room, bathroom and hallway.
The experimental parameters of the invention are set as follows: optimizing the training process by adopting an AdamW optimizer with the attenuation rate of 1 e-6; the learning rate is 1 e-3; batch size 64; for ActivityNet Captions dataset, droupout is 0.4; for the Charrades-STA dataset, droupout is 0.3; lambda [ alpha ] 1 、λ 2 、λ 3 0.5, 0.2 and 0.5 respectively; random masking probability beta of0.2。
First, the present invention verifies the performance of the basic training module (AMDA-base), and the statistical results are shown in tables 1 and 2. Although the AMDA-base adopts a simpler architecture, the performance of the AMDA-base is almost different from that of other most advanced models in the video time positioning task, and the AMDA-base also provides excellent basic performance for the model (AMDA) provided by the invention in the cross-category video time positioning task.
TABLE 1 Performance of AMDA-base on ActiviTYNet Captions dataset
Figure BDA0003705949050000121
TABLE 2 AMDA-base Performance on Chardes-STA data set
Figure BDA0003705949050000122
Next, the present invention verifies the domain deviation between different categories of data based on AMDA-base, and the statistical results are shown in tables 3 and 4. The ActivityNet contexts dataset takes movement as source category data, and the other 4 categories as target category data. The Charrades-STA data set takes the kitchen as source category data, and the other 4 categories as target category data. And only the source class represents that the input data only has a source class with a label, only the target class represents that the input data only has a target class with a label, and the joint training input data comprises the source class with the label and the target class. The experimental results show that obvious field gaps exist among different types of data, and the combined training results are the best because of the labels with the source type data and the target type data.
TABLE 3 Performance of different data classes on ActiviTYNet Captions data set
Figure BDA0003705949050000131
TABLE 4 Performance of different data classes on Chardes-STA data set
Figure BDA0003705949050000132
Finally, the invention verifies the performance of the AMDA model in a sub-module mode, the input data of the AMDA model are labeled source class data and unlabeled target data, the activityNet Captions data set takes movement as the source class data, and the other 4 classes are the target class data. The Charrades-STA data set takes the kitchen as source category data, and the other 4 categories as target category data. The statistical results are shown in tables 5 and 6. The results show that the AMDA model has the best phenotypic performance, and the AMDA model provided by the invention has better generalization capability in the cross-category video time positioning task.
TABLE 5 Performance of the AMDA model on the ActivityNet Captions dataset
Figure BDA0003705949050000133
TABLE 6 Performance of AMDA models on Chardes-STA datasets
Figure BDA0003705949050000141
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the patent protection. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A cross-category video time positioning method based on adaptation against a multi-modal field is characterized by comprising the following steps:
s1: acquiring a source category video, a target category video and a query text of each video corresponding to the source category video and the target category video, extracting initial visual features of the videos and initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;
s2: performing semantic information calibration on the visual features and text features of the target category video obtained in the step S1 through a cross-modal feature calibrator;
s3: randomly masking the visual features of the target type video obtained in the step S1 by a video feature reconstructor, and performing visual feature reconstruction to obtain reconstructed visual features;
s4: fusing the video features and the text features obtained in the step S1 through a cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;
s5: respectively performing single mode field invariance feature expression learning on the video features and the text features obtained in the step S1 through a field discriminator, and performing cross-mode field invariance feature expression learning on the initial fusion features obtained in the step S4;
s6: and predicting the final fusion characteristics of the source type videos obtained in the step S4 through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.
2. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S1 includes:
s1-1: defining input data:
Figure FDA0003705949040000011
Figure FDA0003705949040000012
T t =φ
wherein, V s Is a source class video, Q s Is a query text set of source category videos, T s Is a labelset of source category data, V i s
Figure FDA0003705949040000013
Representing the ith source video in the source category videos and its corresponding query text,
Figure FDA0003705949040000014
representing that the ith query text in the source category video matches the real time boundary of the ith source video; v t Is object class video, Q t Is a query text set, T, of a target category video t Is a tag set of target category data, which is null in this embodiment; v i t
Figure FDA0003705949040000015
Representing the ith target video in the target category videos and the corresponding query text; b represents the batch size;
s1-2: extracting source category video V s And object class video V t To obtain initial visual features
Figure FDA0003705949040000021
Query text Q for extracting source category video s Query text Q for videos of target category t To obtain initial text features
Figure FDA0003705949040000022
S1-3: for the initial visual characteristics obtained in step S1-2
Figure FDA0003705949040000023
And initial text features
Figure FDA0003705949040000024
Respectively coding to obtain coded visual features and text features as final features; the coded visual features and text features are expressed as:
Figure FDA0003705949040000025
Figure FDA0003705949040000026
Figure FDA0003705949040000027
Figure FDA0003705949040000028
wherein n is the number of frames in each video, m is the query text length, d is the hidden dimension,
Figure FDA0003705949040000029
as a visual characteristic of the source category video,
Figure FDA00037059490400000210
for visual features of the ith source category video,
Figure FDA00037059490400000211
for the visual characteristics of the jth frame of the ith source category video,
Figure FDA00037059490400000212
for the visual characteristics of the target category of video,
Figure FDA00037059490400000213
is the ith targetThe visual characteristics of the category video are,
Figure FDA00037059490400000214
for the visual characteristics of the jth frame of the ith target class video,
Figure FDA00037059490400000215
the textual features of the query text for the source category video,
Figure FDA00037059490400000216
the text features of the query text for the ith source category video,
Figure FDA00037059490400000217
for the text feature of the jth word in the query text of the ith source category video,
Figure FDA00037059490400000218
the text features of the query text for the target category of video,
Figure FDA00037059490400000219
text characteristics of the query text for the ith target category video,
Figure FDA00037059490400000220
text characteristics of the jth word in the query text of the ith target category video.
3. The method for cross-category video temporal localization according to claim 2, wherein in step S3, the initial visual feature of the target category video obtained in step S1 is determined
Figure FDA00037059490400000225
And carrying out random mask and visual feature reconstruction.
4. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 3, wherein the step S3 includes:
s3-1: randomly comparing the initial visual features of the target category videos obtained in the step S1
Figure FDA00037059490400000221
Masking operation is carried out according to the beta probability, and coding is carried out through a visual coder to obtain the coded mask visual characteristics
Figure FDA00037059490400000222
S3-2: performing cross-modal feature fusion on the mask visual features obtained in the step S3-1
Figure FDA00037059490400000223
And the text characteristics of the query text of the target category videos obtained in the step S1
Figure FDA00037059490400000224
Fusing to obtain the initial fusion characteristic F of the mask video m
S3-3: the mask visual characteristics obtained according to step S3-1
Figure FDA0003705949040000031
And the initial fusion feature F obtained in step S3-2 m Reconstructing video characteristics to obtain reconstructed visual characteristics V recon The calculation process is as follows:
Figure FDA0003705949040000032
wherein Conv1D is a 1-dimensional convolutional layer, ReLU is an activation function,
Figure FDA0003705949040000033
represents an add-on-element operation;
the training loss of the cross-modal feature fusion device adopts the mean square error loss.
5. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S2 includes:
s2-1: calculating the average value of the visual characteristics and the text characteristics of the target category videos along a time axis:
Figure FDA0003705949040000034
wherein,
Figure FDA0003705949040000035
visual features representing all frames of the ith object class video
Figure FDA0003705949040000036
The average value of (a) of (b),
Figure FDA0003705949040000037
visual features representing all frames of the ith object class video
Figure FDA0003705949040000038
The set of mean values of (a) and (b),
Figure FDA0003705949040000039
represents the average of the text features of all words in the query text of the ith source category video,
Figure FDA00037059490400000310
a set of mean values of text features representing all words in a query text of an ith source category video;
s2-2: by using
Figure FDA00037059490400000311
Constructing positive and negative samples:
will be provided with
Figure FDA00037059490400000312
As a positive sample, the positive sample,
Figure FDA00037059490400000313
as negative examples, we denote:
Figure FDA00037059490400000314
Figure FDA00037059490400000315
s2-3: training a cross-modal characteristic calibrator by using positive and negative samples to calibrate semantic information of visual characteristics and text characteristics of a target category video; the loss function of the cross-modal characteristic calibrator is as follows:
Figure FDA00037059490400000316
Figure FDA00037059490400000317
wherein,
Figure FDA00037059490400000318
representing the cross-modal characteristic aligner loss,
Figure FDA00037059490400000319
representing ternary loss, B representing batch size, Δ being a boundary, l (,) being used to compute cosine similarity between vectors, Z V -a set of negative examples representing visual features, Z Q -negative examples representing text featuresAnd (4) collecting.
6. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 1, wherein the step S4 includes:
s4-1: calculating step S1 to obtain visual characteristics
Figure FDA0003705949040000041
And text features
Figure FDA0003705949040000042
A cross-mode similarity matrix S between the two, where k belongs to S, t, and k is S,
Figure FDA0003705949040000043
a visual characteristic representing the video of the source category,
Figure FDA0003705949040000044
the text characteristics of the query text of the source category video, when k equals t,
Figure FDA0003705949040000045
for the visual characteristics of the target category of video,
Figure FDA0003705949040000046
text features of query text for target category videos;
s4-2: respectively normalizing the cross-modal similarity matrix S obtained in the step S4-1 along rows and columns to obtain a similarity density matrix S r And S c
S4-3: the similarity density matrix S obtained according to the step S4-2 r And S c And calculating to obtain a video-query text attention matrix A v And query the text-to-video attention matrix A q The calculation process is as follows:
Figure FDA0003705949040000047
Figure FDA0003705949040000048
s4-4: the video-query text attention matrix A calculated according to the step S4-3 v And query the text-to-video attention matrix A q Calculating to obtain initial fusion characteristics F of source type videos k The calculation process is as follows:
Figure FDA0003705949040000049
k∈{s,t}
wherein, W f 、b f Is a learnable parameter;
Figure FDA00037059490400000410
when k is s, F s An initial fusion feature representing the source category of the video,
Figure FDA00037059490400000411
representing the initial fusion characteristics of the ith source class video,
Figure FDA00037059490400000412
representing the initial fusion characteristics of the jth frame of the ith source category video; when k is t, F t An initial fusion feature representing the target category of video,
Figure FDA00037059490400000413
representing the initial fusion characteristics of the ith object class video,
Figure FDA00037059490400000414
representing the initial fusion characteristic of the jth frame of the ith target category video;
s4-5: the initial fusion feature F obtained according to step S4-4 k Obtaining final fusion features through a multi-head attention mechanism
Figure FDA00037059490400000415
Wherein, when k is s,
Figure FDA00037059490400000416
representing the final fusion features of the ith source class video,
Figure FDA00037059490400000417
representing the final fusion characteristics of the jth frame of the ith source category video; when k is equal to t, the total weight of the alloy is t,
Figure FDA00037059490400000418
representing the final fusion features of the ith object class video,
Figure FDA00037059490400000419
and representing the final fusion characteristics of the jth frame of the ith target class video.
7. The method according to claim 1, wherein the domain discriminator comprises a visual feature discriminator D v Text feature discriminator D q And a fusion feature discriminator D f (ii) a The step S5 includes:
the visual characteristic discriminator D v Visual characteristics for identifying the source category video obtained in step S1
Figure FDA0003705949040000051
And visual characteristics of target category videos
Figure FDA0003705949040000052
The calculation process is as follows:
Figure FDA0003705949040000053
the text characteristic discriminator D q Text characteristics of query text for identifying the source category video obtained in step S1
Figure FDA0003705949040000054
And text features of query text for target category videos
Figure FDA0003705949040000055
The calculation process is as follows:
Figure FDA0003705949040000056
the fusion characteristic discriminator D f Initial fusion feature F for identifying the source category video obtained in step S4 s Initial fusion feature F with target class video t The calculation process is as follows:
D f (F k )=σ(MLP f (GRL(F k ))),k∈{s,t}
wherein, MLP v 、MLP p 、MLP f The method comprises the steps that a multi-layer perceptron is respectively aimed at visual features, text features and initial fusion features, GRL is a gradient inversion layer, and sigma is a sigmod function;
the video feature discriminator D v Text feature discriminator D q And a fusion feature discriminator D f Output [0,1 ]]A scalar quantity between to represent the probability that the input feature is from the target class.
8. The method for cross-category video temporal localization based on countering multi-modal domain adaptation according to claim 6, wherein the step S6 includes:
s6-1: defining a candidate fragment P (fs, fe) matched with the query text of each video in the source type videos, wherein fs is a starting frame number, fe is an ending frame number, fs and fe are e [1, n ], and n is the frame number in the videos;
the final fusion feature of each video obtained from step S4
Figure FDA0003705949040000057
Extracting the corresponding fusion characteristics of the starting frame and the ending frame of the candidate segment
Figure FDA0003705949040000058
Generation of hidden features r of start and end frames by linear layers fs 、r fe The calculation process is as follows:
Figure FDA0003705949040000059
Figure FDA00037059490400000510
wherein, W 1 、W 2 、b 1 、b 2 Parameters can be learned for the linear layer;
since each video in the source category video consists of n video frames, all candidate segments can be represented as P ═ fs, fs ∈ [1, n ], fe ∈ [1, n ], fs ≦ fe;
s6-2: introducing the hidden feature r obtained in the step S6-1 fs 、r fe Obtaining a matrix containing the prediction probabilities of all candidate video segments P through a double affine mechanism
Figure FDA0003705949040000061
Combining the frames corresponding to the maximum value of the matrix M into a final prediction result;
M P the calculation process is as follows:
Figure FDA0003705949040000062
wherein, U m 、W m 、b m To learnLearning parameter, M P Representing the probability that the video clip P is a video clip corresponding to the query text, the superscript T representing the transpose,
Figure FDA0003705949040000063
a splice is indicated.
9. A cross-category video time positioning system based on multi-modal domain adaptation for implementing the cross-category video time positioning method based on multi-modal domain adaptation as claimed in any one of claims 1 to 8, the system comprising:
the data acquisition and feature extraction module is used for acquiring the source type videos, the target type videos and the query texts of the videos corresponding to the source type videos and the target type videos, extracting the initial visual features of the videos and the initial text features of the query texts, and coding the initial visual features and the initial text features to be used as final visual features and text features;
the cross-modal characteristic calibration module is used for performing semantic information calibration on the obtained visual characteristics and text characteristics of the target category video through the cross-modal characteristic calibrator;
the video feature reconstruction module is used for randomly masking the visual features of the target type videos through the video feature reconstructor and reconstructing the visual features to obtain reconstructed visual features;
the cross-modal feature fusion module is used for fusing the obtained video features and the text features through the cross-modal feature fusion device to obtain initial fusion features and final fusion features of the source type videos and initial fusion features and final fusion features of the target type videos;
the domain identification module is used for respectively carrying out single-mode domain invariance feature expression learning on the obtained video features and text features through the domain identifier and carrying out cross-mode domain invariance feature expression learning on the obtained initial fusion features;
and the double affine prediction module is used for predicting the final fusion characteristics of the obtained source type videos through a double affine predictor to obtain the prediction probability of all possible results corresponding to each query text, wherein the maximum prediction probability is the final prediction result.
10. A computer-readable storage medium, on which a program is stored, which, when being executed by a processor, is configured to implement the cross-category video temporal localization method based on countering multi-modal domain adaptation of any one of claims 1 to 8.
CN202210707517.9A 2022-06-21 2022-06-21 Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation Pending CN115035455A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210707517.9A CN115035455A (en) 2022-06-21 2022-06-21 Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
LU502690A LU502690B1 (en) 2022-06-21 2022-08-22 Cross-category video time locating method, system and storage medium based on adversarial multi-modal domain adaptation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210707517.9A CN115035455A (en) 2022-06-21 2022-06-21 Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation

Publications (1)

Publication Number Publication Date
CN115035455A true CN115035455A (en) 2022-09-09

Family

ID=83124190

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210707517.9A Pending CN115035455A (en) 2022-06-21 2022-06-21 Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation

Country Status (2)

Country Link
CN (1) CN115035455A (en)
LU (1) LU502690B1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052040A (en) * 2022-12-16 2023-05-02 广东工业大学 Multi-modal query vector and confidence coefficient-based reference video segmentation method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116052040A (en) * 2022-12-16 2023-05-02 广东工业大学 Multi-modal query vector and confidence coefficient-based reference video segmentation method

Also Published As

Publication number Publication date
LU502690B1 (en) 2023-02-23

Similar Documents

Publication Publication Date Title
Kim et al. Domain adaptation without source data
CN111680217B (en) Content recommendation method, device, equipment and storage medium
CN113792113A (en) Visual language model obtaining and task processing method, device, equipment and medium
CN110309321B (en) Knowledge representation learning method based on graph representation learning
Cascianelli et al. Full-GRU natural language video description for service robotics applications
CN110888980A (en) Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN112507039A (en) Text understanding method based on external knowledge embedding
CN112651940B (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
Chou et al. Learning to Recognize Transient Sound Events using Attentional Supervision.
CN109933682B (en) Image hash retrieval method and system based on combination of semantics and content information
CN115695950B (en) Video abstract generation method based on content perception
CN115831377A (en) Intra-hospital death risk prediction method based on ICU (intensive care unit) medical record data
CN117558270B (en) Voice recognition method and device and keyword detection model training method and device
Alshehri et al. Generative adversarial zero-shot learning for cold-start news recommendation
CN115146068A (en) Method, device and equipment for extracting relation triples and storage medium
CN111259264A (en) Time sequence scoring prediction method based on generation countermeasure network
CN115035455A (en) Cross-category video time positioning method, system and storage medium based on multi-modal domain resisting self-adaptation
Ma et al. Distilling knowledge from self-supervised teacher by embedding graph alignment
Jin et al. Sequencepar: Understanding pedestrian attributes via a sequence generation paradigm
CN115861902B (en) Unsupervised action migration and discovery method, system, device and medium
CN115964497A (en) Event extraction method integrating attention mechanism and convolutional neural network
CN113936333A (en) Action recognition algorithm based on human body skeleton sequence
Wu et al. Deep Hybrid Neural Network With Attention Mechanism for Video Hash Retrieval Method
CN111552827A (en) Labeling method and device, and behavior willingness prediction model training method and device
Duan et al. Deep Hashing Based Fusing Index Method for Large‐Scale Image Retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination