WO2023040506A1 - 一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 - Google Patents

一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 Download PDF

Info

Publication number
WO2023040506A1
WO2023040506A1 PCT/CN2022/110247 CN2022110247W WO2023040506A1 WO 2023040506 A1 WO2023040506 A1 WO 2023040506A1 CN 2022110247 W CN2022110247 W CN 2022110247W WO 2023040506 A1 WO2023040506 A1 WO 2023040506A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
frame
temporal relationship
model
action recognition
Prior art date
Application number
PCT/CN2022/110247
Other languages
English (en)
French (fr)
Inventor
王菡子
王光格
祁仲昂
单瀛
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2023040506A1 publication Critical patent/WO2023040506A1/zh
Priority to US18/199,528 priority Critical patent/US20230353828A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/45Management operations performed by the client for facilitating the reception of or the interaction with the content or administrating data related to the end-user or to the client device itself, e.g. learning user preferences for recommending movies, resolving scheduling conflicts
    • H04N21/462Content or additional data management, e.g. creating a master electronic program guide from data received from the Internet and a Head-end, controlling the complexity of a video stream by scaling the resolution or bit-rate based on the client capabilities
    • H04N21/4627Rights management associated to the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/71Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/44Event detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/482End-user interface for program selection
    • H04N21/4826End-user interface for program selection using recommendation lists, e.g. of programs or channels sorted out according to their score
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/23418Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving operations for analysing video streams, e.g. detecting features or characteristics

Definitions

  • the present application relates to image processing technology in the field of video, and in particular to a model-based data processing method, device, electronic equipment, computer-readable storage medium and computer program product.
  • the embodiments of the present application provide a model-based data processing method, device, electronic equipment, computer-readable storage medium, and computer program product, which can enhance the generalization of the action recognition model and improve the accuracy of the action recognition model .
  • the embodiment of the present application provides a model-based data processing method, including:
  • the first frame feature sequence is processed to obtain the first temporal relationship descriptor
  • Adjust model parameters of the action recognition model according to the first temporal relationship descriptor and the second temporal relationship descriptor, and the adjusted action recognition model is used to identify actions in the video to be recognized.
  • the embodiment of the present application also provides a model-based data processing device, including:
  • the sample acquisition module is configured to extract the first training sample set to obtain a second training sample set and query video, wherein the first training sample set includes different types of video samples;
  • the feature extraction module is configured to process the second training sample set through the embedding layer network in the action recognition model to obtain the first frame feature sequence; and process the query video through the embedding layer network to obtain The second frame feature sequence;
  • the timing processing module is configured to process the feature sequence of the first frame through the timing relationship network in the action recognition model to obtain a first timing relationship descriptor; through the timing relationship network to process the second frame The feature sequence is processed to obtain the second temporal relationship descriptor;
  • the model training module is configured to adjust the model parameters of the action recognition model according to the first temporal relationship descriptor and the second temporal relationship descriptor, and the adjusted action recognition model is used for the video to be recognized Actions in .
  • An embodiment of the present application provides an electronic device for data processing based on a model, and the electronic device includes:
  • the processor is configured to implement the model-based data processing method provided by the embodiment of the present application when running the executable instructions stored in the memory.
  • An embodiment of the present application provides a computer-readable storage medium storing executable instructions, and when the executable instructions are executed by a processor, the model-based data processing method provided in the embodiment of the present application is implemented.
  • An embodiment of the present application provides a computer program product, including a computer program or an instruction.
  • the computer program or instruction is executed by a processor, the model-based data processing method provided in the embodiment of the present application is implemented.
  • the embodiment of the present application has the following beneficial effects: the embodiment of the present application extracts the second training sample set and the query video as training data from the first training sample set including different types of video samples, and then uses the second training sample set
  • the first frame feature sequence obtains the first temporal relationship descriptor, and obtains the second temporal relationship descriptor by querying the second frame feature sequence of the video, and finally, according to the first temporal relationship descriptor and the second temporal relationship descriptor, the action
  • the model parameters of the recognition model are adjusted; since the first temporal relationship descriptor and the second temporal relationship descriptor used in the adjustment process represent the temporal relationship between video frame sequences, and because the action occurs in the video corresponding to a certain timing, Therefore, by mining the temporal relationship between video frame sequences and adjusting the parameters of the action recognition model through the temporal relationship descriptor, the adjusted action recognition model can accurately recognize the actions in the video, thereby enhancing the generality of the model. Improve the accuracy of the action recognition model.
  • FIG. 1 is a schematic diagram of an application scenario of an exemplary model-based data processing method provided by an embodiment of the present application
  • FIG. 2 is a schematic diagram of the composition and structure of the electronic device provided by the embodiment of the present application.
  • FIG. 3 is an optional schematic flowchart of a model-based data processing method provided in an embodiment of the present application
  • FIG. 4 is an optional schematic diagram of extracting small sample action video frames in the embodiment of the present application.
  • FIG. 5 is another optional schematic flowchart of the model-based data processing method provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of an optional process of video similarity judgment in the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a usage scenario of the model-based data processing method provided by the embodiment of the present application.
  • FIG. 8 is a schematic diagram of an exemplary video recognition process provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of a process of a video object recognition method provided by an embodiment of the present application.
  • the video to be identified refers to various forms of video information available on the Internet, such as video files and multimedia information presented in the client or smart devices.
  • the carrier that implements specific functions in the terminal for example, the mobile client (APP) is the carrier of specific functions in the mobile terminal, for example, the function of performing online live broadcast (video streaming) or the function of playing online video client.
  • APP mobile client
  • neural network in the field of machine learning and cognitive science, artificial neural network is a mathematical model or computational model that imitates the structure and function of biological neural networks, and is used to perform functions estimated or approximated.
  • Downsampling is to sample the sample value sequence at intervals, that is to say, in the sample value sequence, several samples are sampled once at intervals, and the new sequence obtained in this way is the downsampling result of the original sequence; for example: for a An image I with a size of M*X is downsampled by s times to obtain an image with a size of (M/s)*(X/s), where s is a common divisor of M and X.
  • Meta-Learning also known as Learning to Learn, refers to the process of learning how to learn. Traditional machine learning is to learn a mathematical model for prediction from scratch, which is far from the process of human learning and accumulating historical experience (also known as meta-knowledge) to guide new learning tasks. Meta-learning is the learning and training process of learning different machine learning tasks, and learning how to train a model faster and better.
  • Few-Shot Learning which is used to quickly and efficiently train the prediction model in the case of a small number (less than the specified number) of labeled samples.
  • Few-shot learning is an application of meta-learning in the field of supervised learning.
  • the training of the action recognition model is a small-sample learning process.
  • the training setting information (N-Way K-Shot) of small sample learning in the field of classification refers to that during the training phase, N types are extracted from the training set, each type corresponds to K samples, and there are a total of N*K samples.
  • the N*K samples constitute a meta-task, which is called the support set (Support Set) of the model.
  • a batch of samples are drawn from the remaining data sets except the support set as the prediction object of the model (Query Set).
  • the model training and testing unit (Task) of meta-learning consists of a support set and a query set; for example, when N-Way K-Shot is 5-Way 5-Shot, 5 types are randomly selected from the data set , and then randomly select 5 samples for each type to form a support set, and draw a certain number (for example, 15) of the same type and samples of the same type to form a query set; thus, the support set composed of 5*5 samples and A query machine consisting of 15 samples forms a model training and testing unit for meta-learning.
  • Model parameters are parameters that use general variables to establish the relationship between functions and variables; in artificial neural networks, model parameters are usually real number matrices.
  • Cloud computing is a computing model that enables various application systems to obtain computing power, storage space and information services as needed by distributing computing tasks on the resource pools of a large number of computing institutions.
  • the network that provides resources is called “cloud”
  • the resources in “cloud” can be expanded infinitely from the user's point of view, and can be obtained at any time, used on demand, expanded at any time, and paid according to use.
  • cloud platform As the basic capability provider of cloud computing, a cloud computing resource pool platform will be established, referred to as cloud platform, generally known as infrastructure as a service (IaaS, Infrastructure as a Service).
  • IaaS Infrastructure as a Service
  • various types of virtual resources are deployed in the resource pool for external customers to choose and use; the cloud computing resource pool includes: computer equipment (which can be a virtualized machine, including an operating system), storage equipment, and network equipment.
  • FIG. 1 is a schematic diagram of an exemplary application scenario of a model-based data processing method provided by an embodiment of the present application; referring to FIG. 1, a terminal (exemplarily showing a terminal 10-1 and a terminal 10-2) is provided with Clients capable of performing different functions, wherein the client on the terminal utilizes different business processes to send a video playback request to the server 200 (referred to as an electronic device for data processing based on the model) through the network 300, so as to obtain from the corresponding Obtain different videos in the server 200 to browse, and the terminal connects to the server 200 through the network 300.
  • the server 200 referred to as an electronic device for data processing based on the model
  • the terminal connects to the server 200 through the network 300.
  • the network 300 can be a wide area network or a local area network, or a combination of the two, and uses a wireless link to realize data transmission; and, the terminal passes through the network 300 from the corresponding
  • the types of video obtained from the server 200 are different.
  • the terminal can obtain the video from the corresponding server 200 through the network 300 (that is, the video carries video information or a corresponding video link), or it can obtain the video from the corresponding server 200 through the network 300.
  • the corresponding video that only includes text or images is obtained from the server 200 of the system.
  • Different types of videos can be stored in the server 200 . Wherein, in the embodiment of the present application, no distinction is made between compilation environments of different types of videos.
  • the motion recognition model can be used to judge whether the video pushed to the user's client is a copyright-compliant video; in addition, the motion recognition model can also be used to identify the video to form an action preview barrage or an action preview in the progress bar information.
  • the action recognition model provided by this application can be applied to short video playback.
  • short video playback different short videos from different sources are usually processed, and finally presented on the user interface (UI) If the recommended video is a pirated video or other copyright-incompatible video, it will directly affect the user experience.
  • UI user interface
  • the background database for video playback receives a large number of videos from different sources every day, and the different videos obtained from it for video recommendation to target users can also be called by other applications (for example, migrating the recommendation results of the short video recommendation process to long video recommendation process or news recommendation process), of course, the action recognition model matched with the corresponding target user can also be migrated to different video recommendation processes (for example, web page video recommendation process, small program video recommendation process or long video client video recommended process).
  • the model-based data processing method provided by the embodiment of the present application is realized based on artificial intelligence (AI), artificial intelligence is to use a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perception
  • AI artificial intelligence
  • artificial intelligence technology is a comprehensive subject that involves a wide range of fields, including both hardware-level technology and software-level technology.
  • the basic technologies of artificial intelligence technology generally include technologies such as sensors, special artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction system and mechatronics.
  • the software technology of artificial intelligence technology includes several major directions such as computer vision technology, speech processing technology, natural language processing technology, and machine learning (Machine learning, ML)/deep learning.
  • the artificial intelligence software technology involved includes machine learning and other directions.
  • machine learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis and algorithm complexity theory, etc. It is used to study the process of computer simulation or realization of human learning behavior to obtain New knowledge or skills, reorganize the existing knowledge structure so that it can continuously improve its own performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent, and its application pervades all fields of artificial intelligence.
  • Machine learning usually includes deep learning (Deep Learning) and other technologies, deep learning includes artificial neural network (Artificial Neural Network), such as convolutional neural network (Convolutional Neural Network, CNN), recurrent neural network (Recurrent Neural Network, RNN), depth Neural Network (Deep Neural Network, DNN), etc.
  • Artificial Neural Network such as convolutional neural network (Convolutional Neural Network, CNN), recurrent neural network (Recurrent Neural Network, RNN), depth Neural Network (Deep Neural Network, DNN), etc.
  • FIG. 2 is a schematic diagram of the composition and structure of the electronic device provided by the embodiment of the present application. It can be understood that FIG. 2 shows an exemplary structure of the server but not the entire structure, and implements part or all of the structures shown in FIG. 2 as required.
  • the electronic device includes: at least one processor 201 , a memory 202 , a user interface 203 and at least one network interface 204 .
  • Various components in the electronic device 20 are coupled together through the bus system 205 .
  • the bus system 205 is used to realize connection and communication between these components.
  • the bus system 205 also includes a power bus, a control bus and a status signal bus.
  • the various buses are labeled as bus system 205 in FIG. 2 .
  • the user interface 203 may include a display, a keyboard, a mouse, a trackball, a click wheel, keys, buttons, a touch panel or a touch screen, and the like.
  • the memory 202 may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memories.
  • the memory 202 in the embodiment of the present application can store data to support the operation of the terminal (such as the terminal 10-1). Examples of such data include: any computer programs, such as operating systems and application programs, for operating on a terminal such as terminal 10-1.
  • the operating system includes various system programs, such as framework layer, core library layer, driver layer, etc., for realizing various basic services and processing tasks based on hardware.
  • Applications can contain various applications.
  • the model-based data processing device provided by the embodiment of the present application can be realized by combining software and hardware.
  • the model-based data processing device provided by the embodiment of the present application can be implemented by using a hardware decoding processor
  • the processor in the form is programmed to execute the model-based data processing method provided by the embodiment of the present application.
  • a processor in the form of a hardware decoding processor can adopt one or more application-specific integrated circuits (ASIC, Application Specific Integrated Circuit), digital signal processor (DSP, Digital Signal Processor), programmable logic device (PLD, Programmable Logic Device), Complex Programmable Logic Device (CPLD, Complex Programmable Logic Device), Field Programmable Gate Array (FPGA, Field-Programmable Gate Array) or other electronic components.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processor
  • PLD programmable logic device
  • CPLD Complex Programmable Logic Device
  • FPGA Field-Programmable Gate Array
  • the model-based data processing device provided by the embodiment of the present application can be directly embodied as a combination of software modules executed by the processor 201, and the software module It may be located in a storage medium, the storage medium is located in the memory 202, the processor 201 reads the executable instructions included in the software module in the memory 202, and combines necessary hardware (for example, including the processor 201 and other components connected to the bus 205) to complete the present invention.
  • the model-based data processing method provided by the embodiment of the application.
  • the processor 201 may be an integrated circuit chip with signal processing capabilities, such as a general-purpose processor, DSP, or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., wherein the general-purpose processing
  • the processor can be a microprocessor or any conventional processor or the like.
  • the memory 202 in the embodiment of the present application is used to store various types of data to support the operation of the electronic device 20 .
  • Examples of these data include: any executable instructions for operating on the electronic device 20 , where the program for implementing the model-based data processing method of the embodiment of the present application may be included in the executable instructions.
  • the model-based data processing device provided by the embodiment of the present application can be realized by software.
  • FIG. 2 shows a model-based data processing device 2020 stored in the memory 202, which can be a program, a plug-in, etc. software in the form of a model, and includes a series of modules, as an example of a program stored in the memory 202, it may include a model-based data processing device 2020, and the model-based data processing device 2020 includes the following software modules: sample acquisition module 2081, Feature extraction module 2082 , timing processing module 2083 , model training module 2084 and model application module 2085 .
  • model-based data processing device 2020 When the software modules in the model-based data processing device 2020 are read into the RAM by the processor 201 and executed, the model-based data processing method provided by the embodiment of the present application will be implemented. The function of each software module in the action recognition model training device 2020 will be introduced below.
  • the sample acquisition module 2081 is configured to extract the first training sample set to obtain a second training sample set and query video, wherein the first training sample set includes different types of video samples;
  • the feature extraction module 2082 is configured to process the second training sample set through the embedding layer network in the action recognition model to obtain the first frame feature sequence; process the query video through the embedding layer network, Obtain the second frame feature sequence;
  • the timing processing module 2083 is configured to process the first frame feature sequence through the timing relationship network in the action recognition model to obtain a first timing relationship descriptor; through the timing relationship network, process the second The frame feature sequence is processed to obtain the second temporal relationship descriptor;
  • the model training module 2084 is configured to adjust the model parameters of the action recognition model according to the first temporal relationship descriptor and the second temporal relationship descriptor, and the adjusted action recognition model is used for recognition Actions in videos are recognized.
  • the sample acquisition module 2081 is further configured to determine the use environment identifier of the action recognition model; according to the use environment identifier, determine historical data that matches the use environment of the action recognition model ; Different types of video samples screened from the historical data are used as the first training sample set.
  • the sample acquisition module 2081 is further configured to extract N types of video information from the first training sample set, where N is a positive integer; extract from each type of video information K video samples, wherein K is a positive integer; all video samples of the N types are combined to obtain the second training sample set, wherein all video samples of the N types include N*K video samples; at least one video sample is extracted from the unextracted video information of the N types of video information, and the extracted at least one video sample is used as the query video.
  • the feature extraction module 2082 is further configured to extract each type of video frame set from the second training sample set through the embedding layer network in the action recognition model, and Extracting the first frame level feature vector corresponding to the video frame set; determining the first channel number corresponding to the first frame level feature vector; based on the first channel number, determining the first frame level feature vector The corresponding first frame-level feature vector set, and the similarity matrix matched with the first frame-level feature vector set; the first frame-level feature vector set and the similarity matrix are fused to obtain the second A frame-level feature vector set; the first frame feature sequence is obtained by linearly transforming the second frame-level feature vector set.
  • the feature extraction module 2082 is further configured to extract a third frame-level feature vector from the query video through the embedded layer network; determine the third frame-level feature vector corresponding to The second channel number; based on the second channel number, determine a third frame-level feature vector set corresponding to the third frame-level feature vector, and perform a linear transformation on the third frame-level feature vector set to obtain The second frame feature sequence corresponding to the query video.
  • the feature extraction module 2082 is further configured to obtain the downsampling result of the video frame set; normalize the downsampling result through the fully connected layer of the embedded layer network , and perform depth decomposition on the normalized results of different image frames in the video frame set to obtain the first frame-level feature vector.
  • the feature extraction module 2082 is further configured to determine the number of video frames, number of feature channels, video frame height, and video frame width corresponding to the first frame feature sequence;
  • the sequence corresponds to the number of video frames, the number of feature channels, the height of the video frame, and the width of the video frame, and the spatiotemporal motion enhancement is performed on each frame of video in the first frame feature sequence, and the spatiotemporal motion enhancement is used to enhance the first The motion features of each frame of video in the sequence of frame features.
  • the feature extraction module 2082 is further configured to determine the number of video frames, the number of video channels, the height of a video frame, and the width of a video frame corresponding to the second frame feature sequence;
  • the video frame number parameter, the video channel parameter, the height parameter of the video frame and the width parameter of the video frame corresponding to the sequence perform spatiotemporal motion enhancement processing on each frame of video in the second frame feature sequence, and the spatiotemporal motion enhancement is used To enhance the motion feature of each frame of video in the second frame feature sequence.
  • the timing processing module 2083 is further configured to determine the first frame index parameter of the first frame feature sequence and the different subsequences of the first frame feature sequence; The timing relationship network in the model, and using the first frame index parameter to determine the timing relationship descriptors corresponding to the different subsequences respectively; combine the timing relationship descriptors corresponding to the different subsequences respectively , to obtain the first temporal relationship descriptor.
  • the timing processing module 2083 is further configured to determine the second frame index parameter of the second frame feature sequence; through the timing relationship network and using the second frame index parameter, determine The second temporal relationship descriptor.
  • the model training module 2084 is further configured to compare the first temporal relationship descriptor with the second temporal relationship descriptor to obtain the first temporal relationship descriptor and the The similarity of the second temporal relationship descriptor; according to the similarity between the first temporal relationship descriptor and the second temporal relationship descriptor, determine different types of temporal relationship descriptors in the first temporal relationship descriptor
  • the weight parameter According to the weight parameter of the described timing relationship descriptor, determine the sample prototype of different types of video samples; Calculate the measurement score of the sample prototype of the query video and each type of video sample; Calculate the maximum measurement score
  • the type of the corresponding video sample is determined as a small-sample action type corresponding to the query video, and model parameters of the action recognition model are adjusted based on the small-sample action type.
  • the training device further includes a model application module 2085 configured to determine a video frame sequence to be recognized in the video to be recognized; Sequentially performing action recognition to obtain an action recognition result; determining a copyright video corresponding to the video to be identified; based on the action recognition result, determining a set of similarity parameters between frames corresponding to the video to be identified and the copyright video; Obtaining the number of video frames reaching the similarity threshold in the inter-frame similarity parameter set; and determining the similarity between the video to be identified and the copyrighted video based on the number of video frames.
  • a model application module 2085 configured to determine a video frame sequence to be recognized in the video to be recognized; Sequentially performing action recognition to obtain an action recognition result; determining a copyright video corresponding to the video to be identified; based on the action recognition result, determining a set of similarity parameters between frames corresponding to the video to be identified and the copyright video; Obtaining the number of video frames reaching the similarity threshold in the inter-frame similarity parameter set
  • the model application module 2085 is further configured to acquire the Copyright information of the video to be identified; obtaining a comparison result between the copyright information of the video to be identified and the copyright information of the copyrighted video, the comparison result is used to determine the compliance of the video to be identified; when the comparison result When it indicates that the copyright information of the video to be identified is inconsistent with the copyright information of the copyrighted video, a warning message is generated.
  • the model application module 2085 is further configured to, when it is determined that the video to be recognized is not similar to the copyrighted video based on the similarity between the video to be recognized and the copyrighted video, the The video to be identified is determined as the video to be recommended in the video source, wherein the video to be recommended carries a small sample action recognition result; the recall order of all videos to be recommended in the video source is sorted; The target corresponds to the recommended video.
  • an embodiment of the present application further provides a computer program product, where the computer program product includes computer instructions or computer programs, and the computer instructions or computer programs are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instruction or computer program from the computer-readable storage medium, and the processor executes the computer instruction or computer program, so that the electronic device executes the model-based data processing method provided by the embodiment of the present application.
  • FIG. 3 is an optional schematic flow chart of the model-based data processing method provided by the embodiment of the present application; an optional flow process of the model-based data processing method is composed of an action recognition model for training Execution by electronic equipment; understandably, the steps shown in FIG. 3 can be performed by various electronic equipment that operates a model-based data processing device to perform data processing based on the model, for example, it can be a dedicated terminal with video processing function, server or server cluster.
  • the model-based data processing method provided by the embodiment of the present application can be used for the training of non-real-time action recognition models, such as content analysis (including various video types such as TV dramas, movies, and short videos), action recognition of target characters, etc. .
  • content analysis including various video types such as TV dramas, movies, and short videos
  • action recognition of target characters etc.
  • Step 301 Obtain a first training sample set.
  • the first training sample set includes different types of video samples acquired through historical data.
  • the use environment identification of the small-sample action recognition model can be determined at first; according to the use environment identification, determine the historical data that matches the use environment of the action recognition model; Types of video samples are used as the first training sample set. Since the source of the video in the first training sample set is uncertain (it can be a video resource on the Internet or a local video file saved by an electronic device), by obtaining historical data that matches the usage environment, it can be realized. Acquisition of small-sample actions, wherein FIG. 4 is an optional schematic diagram of extracting small-sample action video frames in the embodiment of the present application.
  • the video screen displayed by the time axis during the video playback process As shown in Figure 4, the video screen displayed by the time axis during the video playback process, as shown in Figure 4, there are different target objects in the displayed video screen, by identifying the target object in the video screen , it is possible to determine the area where the target object is located in different video frames of the video to be recognized. Since the three different short videos shown in Fig. Ball", and action 4-3 "playing football", through the action recognition model trained by the model-based data processing method provided in the embodiment of this application, the action 4-3 that appears in three different short videos can be respectively 1 "playing badminton", action 4-2 “playing table tennis", and action 4-3 "playing football" for recognition.
  • the action recognition results of the target object it can be determined whether the video to be recognized complies with the regulations, or whether it meets the requirements of copyright information, so as to prevent the videos uploaded by users from being pirated, and also prevent the recommendation and playback of infringing videos.
  • Step 302 Extract the first training sample set to obtain the second training sample set and query video.
  • the number of videos and the number of video types in the second training sample set are at least one, for example, a random number can be used to determine the number of videos or the number of video types;
  • the number of query videos is at least one; here, it can be Extract N types of video information from the first training sample set; and extract K video samples from each type of video information; combine all video samples of N types to obtain the second training sample set; and At least one video sample is extracted from unextracted video information of N types of video information, and the extracted at least one video sample is used as a query video; wherein, N is a positive integer, and K is a positive integer.
  • the N-Way K-Shot training method can be used to train the action recognition model.
  • N types are extracted from the video types of the training data, and K video samples are extracted for each type. From the N* The K video samples constitute the second set of samples. Then select one or more video samples from the remaining video samples corresponding to the N types as the query video.
  • each video sample in the second sample set and the query video is loosely sampled to divide the video sequence into T segments, and one frame is extracted from each segment as the segment summary, so each Video samples are represented by a sequence of T frames.
  • the T-frame frame sequence is input into the embedding layer network to perform frame feature extraction processing and motion enhancement processing, and the frame feature extraction processing and motion enhancement processing will continue to be described later.
  • N and K are positive integers, and all video samples of N types include N*K video samples.
  • Step 303 Process the second training sample set through the embedding layer network in the action recognition model to obtain the feature sequence of the first frame.
  • processing the second training sample set (referring to feature extraction processing) to obtain the first frame feature sequence can be achieved in the following manner: through the embedding layer network in the action recognition model, from the second Extract each type of video frame set in the training sample set, and extract the first frame-level feature vector corresponding to the video frame set; determine the first channel number corresponding to the first frame-level feature vector; based on the first channel number, determine and The first frame-level feature vector set corresponding to the first frame-level feature vector, and the similarity matrix matching the first frame-level feature vector set; the first frame-level feature vector set and the similarity matrix are fused to obtain the second A set of frame-level feature vectors; the feature sequence of the first frame is obtained by linearly transforming the second set of frame-level feature vectors.
  • T frame frame sequence a group of video frames (called T frame frame sequence) in the second sample set
  • a feature extraction network can be used in T frames (including small samples corresponding to each video sample of each type)
  • F i ⁇ F represents the frame-level features extracted on the i-th frame. Since each feature in F has d channels (called the number of first channels), each feature in F can be expanded by channel, and T*d channel-level features can be obtained
  • a similarity matrix s F of Fc is calculated to represent the apparent similarity between each feature in Fc . Then, for the i-th feature F i c in F c , according to s F , all the features in F c are fused into , to generate the corresponding enhanced features
  • the generated enhanced features can be expressed as where, the i-th enhanced feature in F e is calculated by Equation 1, which is shown below.
  • ⁇ ( ) represents a linear transformation function implemented by a fully connected layer; express and The apparent similarity between is calculated as formula 2.
  • Equation 3 exp is the activation function; a i*d, f is and The result of the dot product between is shown in Equation 3.
  • ⁇ ( ) and are two linear transformation functions that have the same function as ⁇ ( ⁇ ).
  • the i-th feature The information in is propagated to other features in F e , so each feature in F e can obtain frame-level features from other frames, making the obtained features rich in information.
  • Step 304 Process the query video through the embedding layer network to obtain the feature sequence of the second frame.
  • the third frame-level feature vector can be extracted from the query video through the embedding layer network; the second channel number corresponding to the third frame-level feature vector is determined; based on the second channel number, determine and A third frame-level feature vector set corresponding to the third frame-level feature vector, and performing a linear transformation on the third frame-level feature vector set to obtain a second frame feature sequence corresponding to the query video.
  • feature extractors such as deep residual network ResNet
  • ResNet can also be directly used to extract video frame sequences as frame-level features.
  • video frame image features of short videos can be used based on depth
  • the pre-trained convolutional neural network of the residual network ResNet50 performs feature extraction, and extracts the video frame image information of the short video into a 2048-dimensional feature vector.
  • ResNet is beneficial to the representation of video frame image information of short videos in image feature extraction.
  • the video frame image information of the short video has great eye appeal before users watch it, and a reasonable and appropriate video frame image of the short video can greatly improve the click-through rate of the video playback.
  • VLAD Vector of Locally Aggregated Descriptors
  • video frame information reflects the specific content and video quality of the video, and is directly related to the user's viewing time.
  • the frame-level feature vector can be flexibly configured according to different usage requirements. method of obtaining.
  • Step 305 Process the feature sequence of the first frame through the temporal relationship network in the action recognition model to obtain the first temporal relationship descriptor.
  • the acquired frame-level The feature vector (called the first frame feature sequence) is processed for spatio-temporal motion enhancement.
  • the embedding layer network of the action recognition model includes a feature extractor and a spatio-temporal motion enhancement (eg, STME) module, and the embedding layer network of the action recognition model is used to map the input video to a A new feature space, so that the sequential relational network can continue to process.
  • a spatio-temporal motion enhancement eg, STME
  • the number of video frames, the number of video channels, the height of video frames and the width of video frames corresponding to the first frame feature sequence can be determined; Height and video frame width, perform spatio-temporal motion enhancement processing on each frame of video in the first frame feature sequence, so as to enhance the motion feature of each frame of video in the first frame feature sequence.
  • the motion information can be measured by the content displacement of two consecutive frames
  • the information from all spatio-temporal content displacement positions is used to enhance the motion of each region of the sample feature information. For example, given an input feature S ⁇ R T ⁇ C ⁇ H ⁇ W (the first frame feature sequence), where T refers to the number of video frames, C refers to the number of feature channels, H and W refer to the video frame height and video frame width respectively .
  • Equation 4 The mapped feature content displacement can be expressed as Equation 4, which is shown below.
  • d(t) ⁇ R T ⁇ C/k ⁇ H ⁇ W , k is the reduction ratio of the number of feature channels, such as 8
  • d(t) represents the content displacement information at time t
  • conv 2 and conv 3 are respectively Two 1*1*1 spatiotemporal convolutions
  • S t+1 represents the frame feature of frame t+1 in S
  • S t represents the frame feature of frame t in S.
  • the temporal self-attention of each position in the motion matrix can be calculated by formula 5:
  • a p,ji represents the correlation between each position p in D on the jth frame and the i-th frame
  • D p,j represents the feature content displacement of each position p in D on the jth frame
  • D p, i represents the feature content displacement of each position p in D on the i-th frame
  • Z represents the transposition process
  • conv 1 (S) the attention mechanism is applied on conv 1 (S) to obtain the transformed feature map of S in the conv 1 (S) feature space, where conv 1 is a 1*1*1 spatiotemporal convolution.
  • S p,i and S p,j represent the information of the position p in S in the i-th frame and the j-th frame respectively
  • V p,j represents the information of the j-th frame after the position p is enhanced
  • the spatio-temporal motion enhancement module The final output is the frame feature V after spatio-temporal motion enhancement, V ⁇ R T ⁇ C ⁇ H ⁇ W .
  • the number of video frames, the number of video channels, the height of video frames and the width of video frames corresponding to the second frame feature sequence can also be determined; according to the number of video frames corresponding to the second frame feature sequence , the number of video channels, the height of the video frame and the width of the video frame, the spatio-temporal motion enhancement processing is performed on each frame of video in the second frame feature sequence, so as to realize the motion feature of each frame of video in the second frame feature sequence. So far, after the spatiotemporal motion enhancement processing, each frame feature in V has achieved motion enhancement. After the motion enhancement processing, based on the first frame feature sequence after motion enhancement processing and the second frame after motion enhancement processing For the feature sequence, perform step 305 to calculate the temporal relationship descriptor corresponding to the segment.
  • the g ⁇ (n) function is used to learn the corresponding timing relationship from n-frame subsequences.
  • the g ⁇ (n) function is implemented by a fully connected layer, which maps n-frame subsequences into a vector.
  • l groups of temporal relationships can be accumulated to obtain the final temporal relationship descriptor R n (called the first temporal relationship descriptor). Since the timing relationship needs to be captured from at least two frames, the minimum value of n is 2.
  • temporal relationships can be captured at multiple time scales.
  • Step 306 Process the second frame feature sequence through the temporal relationship network to obtain a second temporal relationship descriptor.
  • the second frame index parameter of the second frame feature sequence may be determined; and the second temporal relationship descriptor is determined through the temporal relationship network and using the second frame index parameter.
  • the acquisition process of the second temporal relationship descriptor is similar to the acquisition process of the first temporal relationship descriptor, and will not be described repeatedly in this embodiment of the present application.
  • Step 307 Adjust the model parameters of the action recognition model according to the first temporal relationship descriptor and the second temporal relationship descriptor, and the adjusted action recognition model is used to identify actions in the video to be recognized.
  • the model parameters of the action recognition model are adjusted, so that the actions in the video can be recognized through the adjusted action recognition model; wherein, the process of adjusting the model parameters can be realized in the following manner: The first sequence relationship descriptor is compared with the second sequence relationship descriptor to obtain the similarity between the first sequence relationship descriptor and the second sequence relationship descriptor; according to the similarity between the first sequence relationship descriptor and the second sequence relationship descriptor degree, determine the weight parameters of different types of temporal relationship descriptors in the first temporal relationship descriptor; determine the sample prototypes of different types of video samples according to the weight parameters of the temporal relationship descriptor; calculate the query video and each type of video The measurement score of the sample prototype of the sample; the type of the video sample corresponding to the maximum measurement score is determined as the small-sample action type corresponding to the query video, and the model parameters of the action recognition model are adjusted based on the small-sample action type.
  • each new type of learning is task-related, thus, a corresponding attention prototype can be generated for each task.
  • the discriminative power of the temporal relationship descriptor of each video sample is measured by the similarity with the second temporal relationship descriptor of the query video, calculated by the cosine (Cosine) similarity function g, so, according to the temporal sequence of each video sample.
  • the discriminative power of relational descriptors can be corrected for weighted prototypes.
  • the temporal relationship descriptor corresponding to the hth (1 ⁇ h ⁇ N) type is ⁇ x h1 , x h2 ,....x hK ⁇ , K represents For the number of video samples of the h-th type and the weight of the temporal relationship descriptor of each video sample, refer to Formula 8, which is shown below.
  • the set of weighted descriptors for n frames of all video samples of type h constitutes the final type prototype for n frames of type h.
  • the comparison process can be expressed by formula 10, and formula 10 is shown below.
  • P ⁇ (h pre h
  • q) is the prototype q n of the query video and the type prototype of n frames of the second training sample set similarity.
  • the prototype q n of the query video and the type prototypes of each group (group 2 to T) The sum of the similarities is the metric score of the type, and the type corresponding to the highest metric score is the predicted type.
  • the metric score of the sample prototype of the video sample reaches the highest, the type corresponding to the highest metric score is determined as the small-sample action type corresponding to the query video, and the model parameters of the action recognition model are adjusted based on the small-sample action type corresponding to the query video, so as to Complete the training of the action recognition model, and realize the recognition of the actions in the video through the trained action recognition model.
  • FIG. 5 is another optional flow of the model-based data processing method provided by the embodiment of the present application Schematic diagram; It can be understood that the steps shown in Figure 5 can be executed by various servers running video processing functions, wherein the video processing functions are implemented by deploying the trained action recognition model in the server, so that the similarity of the uploaded video Identify the nature of the video, and then carry out compliance identification on the copyright information of the video.
  • the trained action recognition model it also includes the training process of the action recognition model.
  • the training process of the action recognition model includes steps 501 to 506, Each step will be described separately below.
  • Step 501 Obtain a first training sample set, wherein the first training sample set is video samples with noise obtained through historical data.
  • Step 502 Perform denoising processing on the first training sample set to form a corresponding second training sample set.
  • Step 503 Process the second training sample set through the motion recognition model to determine initial parameters of the motion recognition model.
  • Step 504 In response to the initial parameters of the motion recognition model, process the second training sample set through the motion recognition model to obtain updated parameters of the motion recognition model.
  • the convergence condition may be reaching an accuracy index threshold, or reaching a training times threshold, or reaching a training duration threshold, or a combination of the above, etc., which is not limited in this embodiment of the present application.
  • Step 505 According to the update parameters of the action recognition model, iteratively update the network parameters of the action recognition model through the second training sample set.
  • the loss function such as cross entropy is used to approach the correct trend until the loss function reaches the corresponding convergence condition.
  • the embedded layer network in the action recognition model can also use the ResNet-101 model or a lightweight network model (such as the ResNext-101 model);
  • the user-labeled images are used as a pre-training data set, which can reduce the resource consumption of obtaining data labels and improve the efficiency of data label acquisition; moreover, through fine-tuning during the training process, the performance of the model can exceed the highest ( State Of The Art, SOTA) level, which can improve the scope of application of the action recognition model.
  • SOTA State Of The Art
  • Step 506 Deploy the trained action recognition model (called adjusted action recognition model).
  • the deployed trained action recognition model (for example, it can be deployed in the server or cloud server of the video client operator) performs corresponding action recognition to realize the recognition of the video uploaded by the user. identify.
  • FIG. 6 is a schematic diagram of an optional process of video similarity judgment in the embodiment of the present application; as shown in FIG. 6, an optional process of this video similarity judgment includes steps 601 to 607, and each The steps are explained separately.
  • Step 601 Determine the copyright video corresponding to the video to be identified.
  • Step 602 Perform motion recognition on the video to be recognized through the adjusted motion recognition model to obtain the motion recognition result.
  • Step 603 Based on the action recognition result, determine a set of inter-frame similarity parameters corresponding to the video to be recognized and the copyrighted video.
  • Step 604 Determine the number of image frames reaching the similarity threshold based on the inter-frame similarity parameter set, and determine the similarity between the video to be identified and the copyrighted video based on the number of image frames.
  • Step 605 Based on the similarity between the video to be identified and the copyrighted video and the set similarity threshold, determine whether the video to be identified is similar to the copyrighted video; if so, perform step 606; otherwise, perform step 607.
  • Step 606 Determine that the video to be identified is similar to the copyrighted video.
  • the copyright information of the video to be identified is obtained; the compliance of the video to be identified is determined through the copyright information of the video to be identified and the copyright information of the copyright video; the video to be identified is When the copyright information of the video to be identified is inconsistent with the copyright information of the copyrighted video, a warning message will be issued; and when the copyright information of the video to be identified is consistent with the copyright information of the copyrighted video, it will be determined that the video to be identified is compliant. Therefore, by identifying the area where the video target is located in different video frames of the video to be identified, it is judged whether the copyrighted video is pirated.
  • Step 607 Determine that the video to be identified is different from the copyrighted video.
  • the video to be identified is added to the video source as the video to be recommended; the recall order of all videos to be recommended in the video source is sorted; The ranking results of the recall order of the recommended videos are used to recommend videos to the target audience. Therefore, by identifying the area where the video target is located in different video frames of the video to be identified, the corresponding copyrighted video is determined and recommended to the user, thereby enriching the user's video viewing options.
  • the identification information corresponding to the video to be identified it is also possible to determine the identification information corresponding to the video to be identified; based on the location of the video target in different video frames of the video to be identified, determine the matching degree of the video to be identified and the identification information; when When the matching degree of the video to be recognized and the recognition information is lower than the alarm threshold, it is determined that the video to be recognized is compliant, so as to automatically identify the compliance of the area where the video target is located in different video frames of the video to be recognized, thereby reducing Manual participation in the video review process improves the efficiency of video compliance identification, reduces the cost of identification, and reduces the waiting time of users.
  • Cloud Technology refers to the unification of hardware, software and network resources in a wide area network or local area network to realize data calculation
  • a hosting technology for storage, processing and sharing can also be understood as a general term for network technology, information technology, integration technology, management platform technology and application technology based on cloud computing business model applications; in addition, because background services require a lot of computing , Storage resources, such as video websites, image websites and more portal websites, so cloud technology is supported by cloud computing.
  • FIG. 7 is a schematic diagram of a usage scenario of the model-based data processing method provided by the embodiment of the present application; as shown in FIG.
  • Clients with long videos for example, clients or plug-ins that play long videos, can obtain barrage information (obtained through barrage information requests) and progress bar information (obtained by triggering progress bar reminders) through corresponding clients.
  • the long video is displayed; the terminal is connected to the long video server 200-1 (the example of the server 200 in FIG. 1 ) through the network 300 .
  • users can also upload videos through the terminal for other users in the network to watch.
  • the operator’s video server uses the action recognition model to The displayed actions form the action preview barrage or the action preview in the progress bar information.
  • FIG. 8 is a schematic diagram of an exemplary video recognition process provided by an embodiment of the present application; as shown in FIG. 8, the exemplary video recognition process includes the following steps 801 to 807, and each step is described below Described separately.
  • Step 801 Extract a second training sample set from video frames of N segments of long videos to be recognized.
  • the second training sample set includes at least: action 1 "playing badminton" in the first video, action 2 in the second video Video frames of "playing table tennis” and action 3 "playing basketball” in the third video.
  • Step 802 Extract the second training sample set and the video frame sequence of the query video respectively through the embedding layer network in the action recognition model.
  • the video frame sequence includes video frame sequences corresponding to video samples of N types (C 1 to C N ) and the video frame sequence of the query video.
  • Step 803 Using the embedded layer network in the action recognition model, perform spatiotemporal motion enhancement processing on the sequence of video frames.
  • the embedding layer network includes a residual network (ResNet) and a spatio-temporal motion enhancement module (STME).
  • ResNet residual network
  • STME spatio-temporal motion enhancement module
  • spatio-temporal motion enhancement process is implemented to enhance the motion feature of each frame of video in the first frame feature sequence.
  • Step 804 Through the temporal relationship network in the action recognition model, process different video frame sequences to obtain corresponding temporal relationship descriptors.
  • Step 805 Adjust model parameters of the action recognition model according to different temporal relationship descriptors.
  • Step 806 Recognize the actions in the video information through the adjusted action recognition model, and obtain the recognition results of small sample actions in different videos.
  • Step 807 Use the action recognition model to identify the action in the video, and form an action preview barrage or an action preview in the progress bar information based on the identified action.
  • the action in the video is recognized by the adjusted action recognition model to form an action preview barrage (the barrage information 9-1 shown in Figure 9), which can be displayed on the video playback interface show.
  • the adjusted action recognition model obtained by the model-based data processing method provided in the embodiment of the present application can robustly and accurately identify small-sample actions in the video.
  • Test the adjusted action recognition model data set for example, data set MiniKinetics, data set UCF101 and data set HMDB51
  • test results refer to Table 1 and Table 2; where, Table 1 is the baseline model 1 to baseline model 10, And the adjusted action recognition model, on the data set (data set MiniKinetics), the results obtained by using one learning to five learning methods to test;
  • Table 2 shows the baseline model 1, baseline model 8, baseline model 10, baseline Model 11 and the adjusted action recognition model are tested on the data set (data set UCF101 and data set HMDB51) by means of one-time learning, three-time learning and five-time learning respectively. It can be seen from Table 1 and Table 2 that, compared with the baseline model 1 to the baseline model 10, the adjusted action recognition model provided by the embodiment of the present application has achieved the highest recognition accuracy on the three data sets. Table 1 and Table 2 are shown below.
  • the embodiment of the present application extracts the second training sample set and the query video as training data from the first training sample set including different types of video samples, and then uses the first frame feature sequence of the second training sample set Obtain the first temporal relationship descriptor, and obtain the second temporal relationship descriptor by querying the second frame feature sequence of the video, and finally, according to the first temporal relationship descriptor and the second temporal relationship descriptor, model parameters of the action recognition model Adjust; because the first temporal relationship descriptor and the second temporal relationship descriptor used in the adjustment process represent the temporal relationship between video frame sequences, and because the action occurs in the video corresponding to a certain timing, therefore, by mining the video The temporal relationship between the frame sequences and the parameters of the action recognition model are adjusted through the temporal relationship descriptor, so that the adjusted action recognition model can accurately identify the actions in the video, thereby enhancing the generalization of the model and improving the performance of the action. The accuracy of the recognition model.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Computer Security & Cryptography (AREA)
  • Image Analysis (AREA)

Abstract

一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,方法包括:对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,第一训练样本集合包括不同类型的视频样本;通过动作识别模型中的嵌入层网络,对第二训练样本集合进行处理,得到第一帧特征序列;通过嵌入层网络,对查询视频进行处理,得到第二帧特征序列;通过动作识别模型中的时序关系网络,对第一帧特征序列进行处理,得到第一时序关系描述子;通过时序关系网络,对第二帧特征序列进行处理,得到第二时序关系描述子;根据第一时序关系描述子和第二时序关系描述子,对动作识别模型的模型参数进行调整。

Description

一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
相关申请的交叉引用
本申请基于申请号为202111087467.0、申请日为2021年09月16日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及视频领域中的图像处理技术,尤其涉及一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品。
背景技术
基于深度学习所进行的各视频类型识别,一直以来都是各应用场景下进行大量数据分析的重要工具。例如,在图像、自然语言处理等应用场景中,对大量数据所实现的分类和识别,以此来快速准确的获得相关的分类预测结果,加速所在应用场景的功能实现。但是进行分类和识别的过程中,通常需要对大量数据实现分类和识别,以此来快速准确的获得相关的动作识别结果,但是实际应用中,针对视频中人物的动作,往往难以收集足够的标记样本以供传统机器学习提取运动模式特征,从而容易出现模型过拟合现象,影响动作识别模型的准确度。
发明内容
有鉴于此,本申请实施例提供一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品,能够增强动作识别模型的泛化性,提高动作识别模型的准确度。
本申请实施例的技术方案是这样实现的:
本申请实施例提供了一种基于模型的数据处理方法,包括:
对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,其中,所述第一训练样本集合包括不同类型的视频样本;
通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列;
通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列;
通过所述动作识别模型中的时序关系网络,对所述第一帧特征序列进 行处理,得到第一时序关系描述子;
通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子;
根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,调整后的所述动作识别模型用于对待识别视频中的动作进行识别。
本申请实施例还提供了一种基于模型的数据处理装置,包括:
样本获取模块,配置为对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,其中,所述第一训练样本集合包括不同类型的视频样本;
特征提取模块,配置为通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列;通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列;
时序处理模块,配置为通过所述动作识别模型中的时序关系网络,对所述第一帧特征序列进行处理,得到第一时序关系描述子;通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子;
模型训练模块,配置为根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,调整后的所述动作识别模型用于对待识别视频中的动作进行识别。
本申请实施例提供了一种用于基于模型进行数据处理的电子设备,所述电子设备包括:
存储器,用于存储可执行指令;
处理器,用于运行所述存储器存储的可执行指令时,实现本申请实施例提供的基于模型的数据处理方法。
本申请实施例提供了一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时,实现本申请实施例提供的基于模型的数据处理方法。
本申请实施例提供了一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现本申请实施例提供的基于模型的数据处理方法。
本申请实施例具有以下有益效果:本申请实施例先通过从包括不同类型视频样本的第一训练样本集合中,抽取第二训练样本集合和查询视频作为训练数据,再通过第二训练样本集合的第一帧特征序列获取第一时序关系描述子、以及通过查询视频的第二帧特征序列获取第二时序关系描述子,最后通过根据第一时序关系描述子和第二时序关系描述子,对动作识别模型的模型参数进行调整;由于调整过程中所采用的第一时序关系描述子和第二时序关系描述子表征视频帧序列之间的时序关系,又由于动作的发生在视频中对应一定时序,因此,通过挖掘视频帧序列之间的时序关系并通 过时序关系描述子调整动作识别模型的参数,使得调整后的动作识别模型能够准确地对视频中的动作进行识别,从而,能够增强模型的泛化性,提升动作识别模型的准确度。
附图说明
图1为本申请实施例提供的一种示例性的基于模型的数据处理方法的应用场景示意图;
图2为本申请实施例提供的电子设备的组成结构示意图;
图3为本申请实施例提供的基于模型的数据处理方法的一个可选的流程示意图;
图4为本申请实施例中小样本动作视频帧抽取的一个可选的示意图;
图5为本申请实施例提供的基于模型的数据处理方法的另一个可选的流程示意图;
图6为本申请实施例中视频相似判断的一个可选的过程示意图;
图7为本申请实施例提供的基于模型的数据处理方法的使用场景示意图;
图8为本申请实施例提供的一种示例性的视频识别过程的示意图;
图9为本申请实施例提供的视频目标识别方法的过程示意图。
具体实施方式
为了使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请进行详细描述,所描述的实施例不应视为对本申请的限制,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本申请保护的范围。
在以下的描述中,涉及到“一些实施例”,其描述了所有可能实施例的子集,但是可以理解,“一些实施例”可以是所有可能实施例的相同子集或不同子集,并且可以在不冲突的情况下相互结合。
对本申请实施例进行详细说明之前,对本申请实施例中涉及的名词和术语进行说明,本申请实施例中涉及的名词和术语适用于如下的解释。
1)响应于,用于表示所执行的操作所依赖的条件或者状态,当满足所依赖的条件或状态时,所执行的一个或多个操作可以是实时的,也可以具有设定的延迟;在没有特别说明的情况下,所执行的多个操作不存在执行先后顺序的限制。
2)待识别视频,互联网中可获取的各种形式的视频信息,如客户端或者智能设备中呈现的视频文件、多媒体信息等。
3)客户端,终端中实现特定功能的载体,例如,移动客户端(APP)是移动终端中特定功能的载体,例如,执行线上直播(视频推流)的功能 或者是在线视频的播放功能的客户端。
4)人工神经网络,简称神经网络(Neural Network,NN),在机器学习和认知科学领域,人工神经网络是一种模仿生物神经网络结构和功能的数学模型或计算模型,用于对函数进行估计或近似。
5)下采样,即为对样值序列进行间隔采样,也就是说,在样值序列中,间隔几个样值取样一次,如此得到的新序列就是原序列的下采样结果;例如:对于一幅图像I,尺寸为M*X,对其进行s倍下采样,能够得到尺寸为(M/s)*(X/s)的图像,其中,s是M和X的公约数。
6)元学习(Meta-Learning),也称学会学习(Learning to Learn),是指学习如何学习的过程。传统的机器学习是从头开始学习一个用于预测的数学模型,与人类学习、积累历史经验(也称为元知识)指导新的学习任务的过程相差较远。元学习则是学习不同的机器学习任务的学习训练过程,以及学习如何更快更好地训练一个模型。
7)小样本学习(Few-Shot Learning),用于在少量(低于指定数量)标记样本情况下,快速高效地训练预测模型。小样本学习是元学习在监督学习领域的应用。在本申请实施例中,动作识别模型的训练为小样本学习的过程。
8)小样本学习在分类领域的训练设置信息(N-Way K-Shot),是指在训练阶段,从训练集中抽取N个类型,每个类型对应K个样本,一共N*K个样本,该N*K个样本构成一个元任务,该元任务称为模型的支撑集(Support Set),另外,再从除支撑集之外的剩余数据集中抽取一批样本来作为模型的预测对象(Query Set)。
9)元学习的模型训练与测试单元(Task),由支撑集和查询集组成;举例来说,当N-Way K-Shot为5-Way 5-Shot时,从数据集中随机选取5个类型,再针对每个类型随机选取5个样本以组成支撑集,并相同类型再抽取一定数量(例如15个)、且相同类型的样本组成查询集;从而,5*5个样本组成的支撑集和15个样本组成的查询机组成一个元学习的模型训练与测试单元。
10)模型参数,是使用通用变量来建立函数和变量之间关系的参数;在人工神经网络中,模型参数通常是实数矩阵。
11)云计算,是一种计算模式,通过将计算任务分布在大量计算机构的资源池上,使各种应用***能够根据需要获取计算力、存储空间和信息服务。其中,提供资源的网络被称为“云”,“云”中的资源在使用者看来是可以无限扩展的,并且可以随时获取,按需使用,随时扩展,按使用付费。作为云计算的基础能力提供商,会建立云计算资源池平台,简称云平台,一般称为基础设施即服务(IaaS,Infrastructure as a Service)。另外,在资源池中部署多种类型的虚拟资源,供外部客户选择使用;云计算资源池中包括:计算机设备(可为虚拟化机器,包含操作***)、存储设备和网络 设备。
图1为本申请实施例提供的一种示例性的基于模型的数据处理方法的应用场景示意图;参见图1,终端(示例性地示出了终端10-1和终端10-2)上设置有能够执行不同功能的客户端,其中,终端上的客户端利用不同的业务进程,通过网络300向服务器200(称为用于基于模型进行数据处理的电子设备)发送视频播放请求,以从相应的服务器200中获取不同的视频进行浏览,终端通过网络300连接服务器200,网络300可以是广域网或者局域网,又或者是二者的组合,使用无线链路实现数据传输;并且,终端通过网络300从相应的服务器200中所获取的视频类型并不相同,例如:终端既可以通过网络300从相应的服务器200中获取视频(即视频中携带视频信息或相应的视频链接),也可以通过网络300从相应的服务器200中获取仅包括文字或图像的相应视频。服务器200中可以保存有不同类型的视频。其中,本申请实施例中不再对不同类型的视频的编译环境进行区分。对于数量众多(大于指定数量)的用户上传视频(包括但不限于短视频(视频时长小于指定时长的视频)和长视频(视频时长大于或等于指定时长的视频)),需要判断出相似视频,并对相似视频的版权信息进行合规识别,在这一过程中,可以通过动作识别模型判断向用户的客户端推送的视频是否为版权合规的视频;另外,也可以通过动作识别模型识别视频中的动作,以形成动作预告弹幕或者进度条信息中的动作预告。
以短视频为例,本申请所提供的动作识别模型可以应用于短视频播放,在短视频播放中通常会对不同来源的不同短视频进行处理,最终在用户界面(User Interface,UI)上呈现出与相应的用户相对应的待推荐视频,如果推荐的视频是盗播视频等版权不合规的视频,将直接影响用户体验。视频播放的后台数据库每天都会收到大量不同来源的视频,从中所得到与向目标用户进行视频推荐的不同视频还可以供其他应用程序调用(例如,将短视频推荐进程的推荐结果迁移至长视频推荐进程或者新闻推荐进程中),当然,与相应的目标用户相匹配的动作识别模型也可以迁移至不同的视频推荐进程(例如,网页视频推荐进程、小程序视频推荐进程或者长视频客户端的视频推荐进程)。
其中,本申请实施例所提供的基于模型的数据处理方法是基于人工智能(Artificial Intelligence,AI)实现的,人工智能是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。也就是说,人工智能是计算机科学的一个综合技术,用于获取智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。因此,人工智能是研究各种智能机器的设计原理与实现方法,以使机器具有感知、推理与决策的功能。
还需要说明的是,人工智能技术是一门综合学科,涉及领域广泛,既 有硬件层面的技术也有软件层面的技术。人工智能技术的基础技术一般包括传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***和机电一体化等技术。人工智能技术的软件技术包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习(Machine learning,ML)/深度学习等几大方向。
在本申请实施例中,涉及的人工智能软件技术包括机器学习等方向。其中,机器学习是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析和算法复杂度理论等多门学科,用于研究计算机模拟或实现人类的学习行为的过程,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习通常包括深度学习(Deep Learning)等技术,深度学习包括人工神经网络(Artificial Neural Network),例如卷积神经网络(Convolutional Neural Network,CNN)、循环神经网络(Recurrent Neural Network,RNN)、深度神经网络(Deep Neural Network,DNN)等。
下面对本申请实施例的电子设备的结构做详细说明,电子设备可以采用各种形式来实施,可以为带有视频处理功能的专用终端,例如网关,也可以为带有视频处理功能的服务器,例如图1中的服务器200。图2为本申请实施例提供的电子设备的组成结构示意图,可以理解,图2示出了服务器的示例性结构而非全部结构,根据需要实施图2示出的部分结构或全部结构。
本申请实施例提供的电子设备包括:至少一个处理器201、存储器202、用户接口203和至少一个网络接口204。电子设备20中的各个组件通过总线***205耦合在一起。可以理解,总线***205用于实现这些组件之间的连接通信。总线***205除包括数据总线之外,还包括电源总线、控制总线和状态信号总线。但是为了清楚说明起见,在图2中将各种总线都标为总线***205。
其中,用户接口203可以包括显示器、键盘、鼠标、轨迹球、点击轮、按键、按钮、触感板或者触摸屏等。
可以理解,存储器202可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。本申请实施例中的存储器202能够存储数据以支持终端(如终端10-1)的操作。这些数据的示例包括:用于在终端(如终端10-1)上操作的任何计算机程序,如操作***和应用程序。其中,操作***包含各种***程序,例如框架层、核心库层、驱动层等,用于实现各种基础业务以及处理基于硬件的任务。应用程序可以包含各种应用程序。
在一些实施例中,本申请实施例提供的基于模型的数据处理装置可以采用软硬件结合的方式实现,作为示例,本申请实施例提供的基于模型的 数据处理装置可以是采用硬件译码处理器形式的处理器,其被编程以执行本申请实施例提供的基于模型的数据处理方法。例如,硬件译码处理器形式的处理器可以采用一个或多个应用专用集成电路(ASIC,Application Specific Integrated Circuit)、数字信号处理器(DSP,Digital Signal Processor)、可编程逻辑器件(PLD,Programmable Logic Device)、复杂可编程逻辑器件(CPLD,Complex Programmable Logic Device)、现场可编程门阵列(FPGA,Field-Programmable Gate Array)或其他电子元件。
作为本申请实施例提供的基于模型的数据处理装置采用软硬件结合实施的示例,本申请实施例所提供的基于模型的数据处理装置可以直接体现为由处理器201执行的软件模块组合,软件模块可以位于存储介质中,存储介质位于存储器202,处理器201读取存储器202中软件模块包括的可执行指令,结合必要的硬件(例如,包括处理器201以及连接到总线205的其他组件)完成本申请实施例提供的基于模型的数据处理方法。
作为示例,处理器201可以是一种集成电路芯片,具有信号的处理能力,例如通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等,其中,通用处理器可以是微处理器或者任何常规的处理器等。
本申请实施例中的存储器202用于存储各种类型的数据以支持电子设备20的操作。这些数据的示例包括:用于在电子设备20上操作的任何可执行指令,其中,用于实现本申请实施例的基于模型的数据处理方法的程序可以包含在可执行指令中。
在一些实施例中,本申请实施例提供的基于模型的数据处理装置可以采用软件方式实现,图2示出了存储在存储器202中的基于模型的数据处理装置2020,其可以是程序和插件等形式的软件,并包括一系列的模块,作为存储器202中存储的程序的示例,可以包括基于模型的数据处理装置2020,基于模型的数据处理装置2020中包括以下的软件模块:样本获取模块2081、特征提取模块2082、时序处理模块2083、模型训练模块2084和模型应用模块2085。当基于模型的数据处理装置2020中的软件模块被处理器201读取到RAM中并执行时,将实现本申请实施例提供的基于模型的数据处理方法。下面对动作识别模型的训练装置2020中各个软件模块的功能进行介绍。
样本获取模块2081,配置为对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,其中,所述第一训练样本集合包括不同类型的视频样本;
特征提取模块2082,配置为通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列;通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列;
时序处理模块2083,配置为通过所述动作识别模型中的时序关系网络, 对所述第一帧特征序列进行处理,得到第一时序关系描述子;通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子;
模型训练模块2084,配置为根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,调整后的所述动作识别模型用于对待识别视频中的动作进行识别。
在本申请实施例中,所述样本获取模块2081,还配置为确定所述动作识别模型的使用环境标识;根据所述使用环境标识,确定与所述动作识别模型的使用环境相匹配的历史数据;将从所述历史数据中筛选出的不同类型的视频样本,作为所述第一训练样本集合。
在本申请实施例中,所述样本获取模块2081,还配置为从所述第一训练样本集合中抽取N个类型的视频信息,其中,N为正整数;从每一个类型的视频信息中抽取K个视频样本,其中,K为正整数;将所述N个类型的所有视频样本进行组合,得到所述第二训练样本集合,其中,所述N个类型中的所有视频样本包括N*K个视频样本;从所述N个类型的视频信息中未被抽取的视频信息中,抽取至少一个视频样本,并将抽取的至少一个视频样本作为所述查询视频。
在本申请实施例中,所述特征提取模块2082,还配置为通过所述动作识别模型中的所述嵌入层网络,从所述第二训练样本集合中提取每种类型的视频帧集合,并提取所述视频帧集合对应的第一帧级别特征向量;确定所述第一帧级别特征向量所对应的第一通道数量;基于所述第一通道数量,确定与所述第一帧级别特征向量对应的第一帧级别特征向量集合,以及与所述第一帧级别特征向量集合相匹配的相似度矩阵;对所述第一帧级别特征向量集合和所述相似度矩阵进行融合,得到第二帧级别特征向量集合;通过对所述第二帧级别特征向量集合进行线性转换,得到所述第一帧特征序列。
在本申请实施例中,所述特征提取模块2082,还配置为通过所述嵌入层网络,从所述查询视频中提取第三帧级别特征向量;确定所述第三帧级别特征向量所对应的第二通道数量;基于所述第二通道数量,确定与所述第三帧级别特征向量对应的第三帧级别特征向量集合,并通过对所述第三帧级别特征向量集合进行线性转换,得到所述查询视频对应的所述第二帧特征序列。
在本申请实施例中,所述特征提取模块2082,还配置为获取所述视频帧集合的降采样结果;通过所述嵌入层网络的全连接层,对所述降采样结果进行归一化处理,并对所述视频帧集合中的不同图像帧的归一化结果,进行深度分解,得到所述第一帧级别特征向量。
在本申请实施例中,所述特征提取模块2082,还配置为确定所述第一帧特征序列对应的视频帧数、特征通道数、视频帧高度和视频帧宽度;根据所述第一帧特征序列对应的视频帧数、特征通道数、视频帧高度和视频 帧宽度,对所述第一帧特征序列中的每一帧视频进行时空运动增强,所述时空运动增强用于增强所述第一帧特征序列中的每一帧视频的运动特征。
在本申请实施例中,所述特征提取模块2082,还配置为确定所述第二帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度;根据所述第二帧特征序列对应的视频帧数参数、视频通道参数、视频帧的高度参数和视频帧的宽度参数,对所述第二帧特征序列中的每一帧视频进行时空运动增强处理,所述时空运动增强用于增强所第二帧特征序列中的每一帧视频的运动特征。
在本申请实施例中,所述时序处理模块2083,还配置为确定所述第一帧特征序列的第一帧索引参数、以及所述第一帧特征序列的不同子序列;通过所述动作识别模型中的所述时序关系网络,并利用所述第一帧索引参数,确定所述不同子序列所分别对应的时序关系描述子;对所述不同子序列所分别对应的时序关系描述子进行组合,得到所述第一时序关系描述子。
在本申请实施例中,所述时序处理模块2083,还配置为确定所述第二帧特征序列的第二帧索引参数;通过所述时序关系网络,并利用所述第二帧索引参数,确定所述第二时序关系描述子。
在本申请实施例中,所述模型训练模块2084,还配置为对所述第一时序关系描述子和所述第二时序关系描述子进行比较,得到所述第一时序关系描述子和所述第二时序关系描述子的相似度;根据所述第一时序关系描述子和所述第二时序关系描述子的相似度,确定所述第一时序关系描述子中的不同类型的时序关系描述子的权重参数;根据所述时序关系描述子的权重参数,确定不同类型的视频样本的样本原型;计算所述查询视频与每一个类型的视频样本的样本原型的度量分数;将最大的度量分数所对应的视频样本的类型,确定为所述查询视频对应的小样本动作类型,并基于所述小样本动作类型调整所述动作识别模型的模型参数。
在本申请实施例中,所述训练装置还包括模型应用模块2085,配置为确定所述待识别视频中的待识别视频帧序列;通过调整后的所述动作识别模型对所述待识别视频帧序列进行动作识别,得到动作识别结果;确定与所述待识别视频相对应的版权视频;基于所述动作识别结果,确定所述待识别视频和所述版权视频对应的帧间相似度参数集合;获取所述帧间相似度参数集合中达到相似度阈值的视频帧数量;基于所述视频帧数量,确定所述待识别视频与所述版权视频的相似度。
在本申请实施例中,所述模型应用模块2085,还配置为当基于所述待识别视频与所述版权视频的相似度,确定所述待识别视频与所述版权视频相似时,获取所述待识别视频的版权信息;获取所述待识别视频的版权信息和所述版权视频的版权信息的比较结果,所述比较结果用于确定所述待识别视频的合规性;当所述比较结果表示所述待识别视频的版权信息和所述版权视频的版权信息不一致时,生成警示信息。
在本申请实施例中,所述模型应用模块2085,还配置为当基于所述待识别视频与所述版权视频的相似度,确定所述待识别视频与所述版权视频不相似时,将所述待识别视频确定为视频源中的待推荐视频,其中,所述待推荐视频携带有小样本动作识别结果;对所述视频源中的所有待推荐视频的召回顺序进行排序;基于排序结果向目标对应推荐视频。
根据图2所示的电子设备,本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括计算机指令或计算机程序,该计算机指令或计算机程序存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令或计算机程序,处理器执行该计算机指令或计算机程序,使得该电子设备执行本申请实施例提供的基于模型的数据处理方法。
下面结合图2示出的电子设备20说明本申请实施例提供的基于模型的数据处理方法。首先对相关技术的缺陷进行说明,相关技术在实现基于帧级别的小样本动作识别时,结合深度信息进行多模态特征融合学习,并且将学习到的特征在计算机可读存储介质中进行额外存储,同时还利用游戏引擎中的虚拟人物构造虚拟动作数据集;但是实际使用中,针对视频中人物的动作信息,往往难以收集足够的标记样本以供传统机器学习从数据中提取运动模式特征,从而容易出现模型过拟合现象,数据形变等数据增强操作还容易引入新的噪声,影响动作识别模型的数据处理效果,同时虚拟动作数据集的收集,提升了训练标记成本,导致训练样本的资源消耗较大,从而训练动作识别模型的资源消耗较大。
基于此,参见图3,图3为本申请实施例提供的基于模型的数据处理方法的一个可选的流程示意图;该基于模型的数据处理方法一个可选的流程由用于训练动作识别模型的电子设备执行;可以理解地,图3所示的步骤可以由运行基于模型的数据处理装置,以基于模型进行数据处理的各种电子设备执行,例如,可以是带有视频处理功能的专用终端、服务器或者服务器集群。本申请实施例提供的基于模型的数据处理方法可以用于非实时性的动作识别模型的训练,例如(包括电视剧、电影、短视频等各种视频类型)的内容分析、目标人物的动作识别等。下面针对图3示出的步骤分别进行说明。
步骤301:获取第一训练样本集合。
在本申请实施例中,第一训练样本集合包括通过历史数据所获取的不同类型的视频样本。在获取第一训练样本集合时,可以首先确定小样本动作识别模型的使用环境标识;根据使用环境标识,确定与动作识别模型的使用环境相匹配的历史数据;将从历史数据中筛选出的不同类型的视频样本作为第一训练样本集合。由于第一训练样本集合中的***具有不确定性(可以是互联网中的视频资源,也可以是电子设备所保存的本地视频文件),通过获取与使用环境相匹配的历史数据,可以实现对小样本动作的 获取,其中,图4为本申请实施例中小样本动作视频帧抽取的一个可选的示意图。如图4所示,视频在播放过程中随着时间轴推移所显示的视频画面,如图4所示,所显示的视频画面中有不同的目标对象,通过对视频画面中的目标对象进行识别,可以确定目标对象在待识别视频的不同视频帧中的所在区域,由于图4所示的3个不同的短视频中分别出现了动作4-1“打羽毛球”、动作4-2“打乒乓球”、以及动作4-3“踢足球”,通过本申请实施例所提供的基于模型的数据处理方法所训练的动作识别模型,可以分别对3个不同的短视频中所出现的动作4-1“打羽毛球”、动作4-2“打乒乓球”、以及动作4-3“踢足球”进行识别。进而,可以通过对目标对象的动作识别结果,确定待识别视频是否合规,或者是否符合版权信息要求,避免用户上传的视频被盗播,也可以阻止侵权视频的推荐与播放。
步骤302:对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频。
在本申请实施例中,第二训练样本集合中的视频数量与视频类型数量均为至少一个,比如,可以将随机数确定视频数量或视频类型数量;查询视频的数量为至少一个;这里,可以从第一训练样本集合中抽取N个类型的视频信息;并从每一个类型的视频信息中抽取K个视频样本;将N个类型的所有视频样本进行组合,得到第二训练样本集合;以及从N个类型的视频信息中未被抽取的视频信息中抽取至少一个视频样本,并将抽取的至少一个视频样本作为查询视频;其中,N为正整数,K为正整数。
需要说明的是,可以采用N-Way K-Shot的训练方式对动作识别模型进行训练,从训练数据的视频类型里面抽取出N个类型,每个类型抽取出K个视频样本,从由N*K个视频样本构成第二样本集合。再从N个类型对应的剩余的视频样本中挑选出1个或多个视频样本作为查询视频。这里,对第二样本集合和查询视频中的每个视频样本进行松散采样,以将视频序列分为T个片段,并在每个片段中抽取出一帧作为该段的摘要,因此,每个视频样本由T帧帧序列表示。T帧帧序列被输入到嵌入层网络中,以进行帧特征提取处理和运动增强处理,后续将继续对帧特征提取处理和运动增强处理进行说明。
需要说明的是,抽取的方式可以是随机抽取方式,也可以是按指定间隔进行抽取的方式,又可以上述两者的结合,等等,本申请实施例对此不作限定。另外,N和K为正整数,N个类型的所有视频样本包括N*K个视频样本。
步骤303:通过动作识别模型中的嵌入层网络,对第二训练样本集合进行处理,得到第一帧特征序列。
在本申请的一些实施例中,对第二训练样本集合进行处理(是指特征提取处理),得到第一帧特征序列可以通过以下方式实现:通过动作识别模型中的嵌入层网络,从第二训练样本集合中提取每种类型的视频帧集合, 并提取视频帧集合对应的第一帧级别特征向量;确定第一帧级别特征向量所对应的第一通道数量;基于第一通道数量,确定与第一帧级别特征向量对应的第一帧级别特征向量集合,以及与第一帧级别特征向量集合相匹配的相似度矩阵;对第一帧级别特征向量集合和相似度矩阵进行融合,得到第二帧级别特征向量集合;通过对第二帧级别特征向量集合进行线性转换,得到第一帧特征序列。
需要说明的是,给定第二样本集合中的一组视频帧(称为T帧帧序列)时,可以利用一个特征提取网络在T帧(包括每个类型的每个视频样本对应的小样本动作的视频帧集合)上提取一系列帧级别的特征F{F 1,F 2.....F T},其中,F i∈F代表了在第i帧上提取的帧级别特征。由于在F中的每一个特征都有d个(称为第一通道数量)通道,可以将F中的每个特征都按通道展开,可以得到T*d个通道级别的特征
Figure PCTCN2022110247-appb-000001
需要说明的是,在帧级别特征的融合阶段,通过计算F c的一个相似度矩阵s F来表示F c中每个特征之间的表观相似度。然后,对于F c中的第i个特征F i c,根据s F来将F c中所有的特征都融合到
Figure PCTCN2022110247-appb-000002
中,以生成对应的增强后的特征
Figure PCTCN2022110247-appb-000003
这里,可以将生成的增强后的特征表示为
Figure PCTCN2022110247-appb-000004
其中,F e中的第i个增强后的特征
Figure PCTCN2022110247-appb-000005
是由公式1计算得到的,公式1如下所示。
Figure PCTCN2022110247-appb-000006
其中,θ(·)表示一个由全连接层实现的线性转换函数;
Figure PCTCN2022110247-appb-000007
表示
Figure PCTCN2022110247-appb-000008
Figure PCTCN2022110247-appb-000009
之间的表观相似度,计算方式如公式2。
Figure PCTCN2022110247-appb-000010
其中,exp为激活函数;a i*d,f
Figure PCTCN2022110247-appb-000011
Figure PCTCN2022110247-appb-000012
之间的点乘结果,如公式3所示。
Figure PCTCN2022110247-appb-000013
φ(·)和
Figure PCTCN2022110247-appb-000014
是两个和θ(·)拥有同样功能的线性转换函数。经过帧级别的特征融合之后,第i个特征
Figure PCTCN2022110247-appb-000015
中的信息被传播到F e中的其他特征中,因此在F e中的每个特征可以获得来自其他帧的帧级别的特征,使得所获得特征包括的信息丰富。
步骤304:通过嵌入层网络,对查询视频进行处理,得到第二帧特征序列。
在本申请的一些实施例中,可以通过嵌入层网络,从查询视频中提取第三帧级别特征向量;确定第三帧级别特征向量所对应的第二通道数量; 基于第二通道数量,确定与第三帧级别特征向量对应的第三帧级别特征向量集合,并通过对第三帧级别特征向量集合进行线性转换,得到查询视频对应的第二帧特征序列。当然,对于短视频处理环境来说,也可以直接使用特征提取器(比如,深度残差网络ResNet),将视频帧序列提取为帧级别特征,例如,短视频的视频帧图像特征可以使用基于深度残差网络ResNet50的预训练卷积神经网络进行特征抽取,把短视频的视频帧图像信息提取为2048维特征向量。ResNet在图像特征提取中有利于短视频的视频帧图像信息的表示。短视频的视频帧图像信息在用户观看前有这很大的眼球吸引力,合理贴切的短视频的视频帧图像可以很好地提升视频的的播放点击率。
在本申请的一些实施例中,还可以使用局部聚合向量(Vector of Locally Aggregated Descriptors,NetVLAD)进行特征抽取,以将视频帧图像生成128维的特征向量。在视频观看中,视频帧信息反映出视频的具体内容和视频质量,对用户观看时长有直接关联,其中,在视频服务器配置动作识别模型时,可以根据不同的使用需求灵活配置帧级别特征向量的获取方式。
步骤305:通过动作识别模型中的时序关系网络,对第一帧特征序列进行处理,得到第一时序关系描述子。
在本申请实施例中,在对第一帧特征序列进行处理(是指时序关系描述子获取),得到第一时序关系描述子之前,为了增强样本的运动特征,还可以对所获取的帧级别特征向量(称为第一帧特征序列)进行时空运动增强处理。
需要说明的是,在进行时空运动增强处理时,动作识别模型的嵌入层网络包括特征提取器和时空运动增强(比如,STME)模块,动作识别模型的嵌入层网络用于将输入视频映射到一个新的特征空间,以便于时序关系网络继续进行处理。
在本申请实施例中,可以确定第一帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度;根据第一帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度,对第一帧特征序列中的每一帧视频进行时空运动增强处理,以实现增强第一帧特征序列中的每一帧视频的运动特征。
需要说明的是,由于运动信息可以通过两个连续帧的内容位移来测量得到,因此,在进行时空运动增强处理时,利用来自所有时空内容位移位置的信息,来增强样本特征各个区域位置的运动信息。例如,给定一个输入特征S∈R T×C×H×W(第一帧特征序列),其中T指视频帧数,C指特征通道数,H和W分别指视频帧高度和视频帧宽度。
首先,分别使用不同的可学习卷积层将输入特征映射到不同的空间,同时减少特征通道数以进行高效计算,经映射后的特征内容位移可以表述为公式4,公式4如下所示。
d(t)=conv 2(S t+1)-conv 3(S t),1≤t≤T-1  公式4;
其中,d(t)∈R T×C/k×H×W,k是特征通道数的减少比,比如为8,d(t)代表t时刻的内容位移信息,conv 2和conv 3分别为两个1*1*1的时空卷积,S t+1表示S中t+1帧的帧特征,S t表示S中t帧的帧特征。设置t=T(最后时刻)的内容位移信息为0,即为d(T)=0,则所有的特征内容位移沿时序维度拼接,能够得到最终的运动矩阵D=[d(1),.....d(T)]。从而,运动矩阵中各个位置的时序自注意力可由以公式5计算得到:
Figure PCTCN2022110247-appb-000016
其中,a p,ji表示D中每个位置p在第j帧和第i帧上的相关性,D p,j表示D中每个位置p在第j帧上的特征内容位移,D p,i表示D中每个位置p在第i帧上的特征内容位移,Z表示转置处理。
然后,在conv 1(S)上应用注意力机制,得到S在conv 1(S)特征空间中的变换特征图,其中,conv 1为一个1*1*1时空卷积。
最后,将注意力机制对应的输出乘以标量参数λ,之后加上原始输入特征以保留背景信息,因此,时空运动增强处理过程可以表示为公式6,公式6如下所示。
Figure PCTCN2022110247-appb-000017
其中,S p,i和S p,j分别代表S中位置p在第i帧和第j帧上的信息,V p,j代表位置p增强后在第j帧的信息,时空运动增强模块的最终输出为时空运动增强后的帧特征V,V∈R T×C×H×W
同理,参考公式4至公式6的处理过程,还可以确定第二帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度;根据第二帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度,对第二帧特征序列中的每一帧视频进行时空运动增强处理,以实现增强所第二帧特征序列中的每一帧视频的运动特征。至此,经时空运动增强处理后,V中的每一帧特征都实现了运动增强,在实现运动增强处理后,则基于运动增强处理后的第一帧特征序列和运动增强处理后的第二帧特征序列,执行步骤305以计算分部对应的时序关系描述子。
下面说明获取时序关系描述子的过程。
首先,先确定n(称为帧索引参数,2≤n≤T)帧间的时间关系描述子,之后从帧特征序列中获取多组n帧子序列;继续从多组n帧子序列中随机抽出l组n帧子序列(称为不同子序列),并将l组n帧子序列映射为向量进行相加处理,最终得到n帧子序列的时间关系描述子,参考公式7,对于时空运动增强后的帧特征序列V,它的长度为T,可以通过公式7确定n帧子序列的时间关系描述子,公式7如下所示。
Figure PCTCN2022110247-appb-000018
其中,(V n) l={v a,v b......} l,是从V中采样的第l组n帧子序列,由n个按时间排序的帧特征组成,a和b是帧索引。gφ(n)函数用于从n帧子序列中学习到相应时序关系,这里,gφ(n)函数由一个全连接层实现,将n帧子序列映射为一个向量。为增强学习到的时序关系,可以将l组时序关系累加,得到最终的时序关系描述子R n(称为第一时序关系描述子)。由于时序关系至少需要从两帧中捕获,因此n最小可取2。
需要说明的是,为了充分地提取视频样本中的动态性,可以在多个时间尺度上捕获时序关系。对于长度为T的视频帧序列对应的帧特征序列,可以从中生成多组时序关系描述子,从而最终的样本级特征X(称为第一时序关系描述子)由所有时序关系描述子构成,即X={R 2,R 3......R n},n小于等于T。通过这种方式,能够以多时间尺度方式捕获视频中的动作信息,并将这些捕获到的动态信息编码为特征,以一种鲁棒的方式表示动作特征。
步骤306:通过时序关系网络,对第二帧特征序列进行处理,得到第二时序关系描述子。
在本申请的一些实施例中,可以确定第二帧特征序列的第二帧索引参数;通过时序关系网络,并利用第二帧索引参数,确定第二时序关系描述子。另外,第二时序关系描述子的获取过程与第一时序关系描述子的获取过程类似,本申请实施例在此不再重复描述。
步骤307:根据第一时序关系描述子和第二时序关系描述子,对动作识别模型的模型参数进行调整,调整后的动作识别模型用于对待识别视频中的动作进行识别。
在本申请的一些实施例中,对动作识别模型的模型参数进行调整,以实现通过调整后的动作识别模型对视频中的动作进行识别;其中,模型参数调整的过程可以通过以下方式实现:对第一时序关系描述子和第二时序关系描述子进行比较,得到第一时序关系描述子和第二时序关系描述子的相似度;根据第一时序关系描述子和第二时序关系描述子的相似度,确定第一时序关系描述子中的不同类型的时序关系描述子的权重参数;根据时序关系描述子的权重参数,确定不同类型的视频样本的样本原型;计算查询视频与每一个类型的视频样本的样本原型的度量分数;将最大的度量分数所对应的视频样本的类型,确定为查询视频对应的小样本动作类型,并基于小样本动作类型调整动作识别模型的模型参数。
需要说明的是,由于在同一类视频中存在动作形变,比如,在类型所提供的的视频样本的数量小于阈值的情况下,类型内的差异容易导致类间判别错误。为了减少这种情况的发生,可以确定同一类型中不同视频样本的时序关系描述子重要性,如此,可以赋予同一类型中判别力更强的视频样本的时序关系描述子更大的权重,以此得到最终的类型原型。
需要说明的是,在元学习过程下,每个新类型的学习是任务相关的,从而,可以对每一个任务都生成相应的注意力原型。每个视频样本的时序关系描述子的判别力由与查询视频的第二时序关系描述子的相似性来衡量,由余弦(Cosine)相似性函数g计算得到,如此,根据每个视频样本的时序关系描述子的判别力,可以得到校正后的加权原型。
第二训练样本集合对应的第一时序关系描述子中,第h(1≤h≤N)个类型对应的时序关系描述子为{x h1,x h2,....x hK},K代表第h个类型的视频样本的数量,每个视频样本的时序关系描述子的权重的计算参考公式8,公式8如下所示。
Figure PCTCN2022110247-appb-000019
其中,
Figure PCTCN2022110247-appb-000020
代表类型h的第r个视频样本的n帧的时序关系描述子。然后,可以计算出第h个类型的视频样本r的n帧的时序关系描述子的权重为
Figure PCTCN2022110247-appb-000021
对于类型h,对应的原型是由一系列时序关系描述子的加权求和结果(称为加权描述子)构成;类型h的n帧的加权描述子
Figure PCTCN2022110247-appb-000022
可以通过公式9表示,公式9如下所示。
Figure PCTCN2022110247-appb-000023
因此,类型h的所有视频样本的n帧的加权描述子的集合,构成了类型h的n帧的最终类型原型。将查询视频的n帧的原型q n与第二训练样本集合的n帧的类型原型
Figure PCTCN2022110247-appb-000024
(称为加权描述子)进行比较,该比较过程可以通过公式10表示,公式10如下所示。
Figure PCTCN2022110247-appb-000025
其中,P θ(h pre=h|q)为查询视频的原型q n与第二训练样本集合的n帧的类型原型
Figure PCTCN2022110247-appb-000026
的相似性。
需要说明的是,查询视频的原型q n与各组(2至T组)类型原型
Figure PCTCN2022110247-appb-000027
的相似性之和,就是该类型的度量分数,其中,最高度量分数对应的类型即为预测类型。当视频样本的样本原型的度量分数达到最高时,将最高度量分数对应的类型确定为查询视频对应的小样本动作类型,并基于查询视频对应的小样本动作类型调整动作识别模型的模型参数,以完成对动作识别模型的训练,实现通过训练后的动作识别模型对视频中的动作进行识别。
继续结合图2示出的电子设备20说明本申请实施例提供的基于模型的数据处理方法,参见图5,图5为本申请实施例提供的基于模型的数据处理方法的另一个可选的流程示意图;可以理解地,图5所示的步骤可以由运 行视频处理功能的各种服务器执行,其中,视频处理功能通过将训练后的动作识别模型部署在服务器中实现,以对上传的视频的相似性进行识别,进而对视频的版权信息进行合规识别,当然,在部署训练后的动作识别模型之前还包括对动作识别模型的训练过程,动作识别模型的训练过程包括以步骤501至步骤506,下面对各步骤分别进行说明。
步骤501:获取第一训练样本集合,其中,第一训练样本集合为通过历史数据所获取的带有噪声的视频样本。
步骤502:对第一训练样本集合进行去噪处理,以形成相应的第二训练样本集合。
步骤503:通过动作识别模型对第二训练样本集合进行处理,以确定动作识别模型的初始参数。
步骤504:响应于动作识别模型的初始参数,通过动作识别模型对第二训练样本集合进行处理,得到动作识别模型的更新参数。
需要说明的是,可以将第二训练样本集合中不同的视频样本,代入由动作识别模型所对应的损失函数;确定损失函数满足相应的收敛条件时获得动作识别模型的更新参数。其中,收敛条件可以是达到准确度指标阈值,也可以是达到训练次数阈值,还可以是达到训练时长阈值,又可以是以上的结合,等等,本申请实施例对此不作限定。
步骤505:根据动作识别模型的更新参数,通过第二训练样本集合对动作识别模型的网络参数进行迭代更新。
其中,在动作识别模型训练时,通过交叉熵等损失函数向正确趋势逼近,损失函数直至达到相应的收敛条件。
在本申请的一些实施例中,动作识别模型中的嵌入层网络还可以使用ResNet-101模型或者轻量级网络模型(比如,ResNext-101模型);其中,ResNext-101模型,利用社交应用上的用户标记图像作为预训练数据集,能够降低获取数据标签的资源消耗,提升数据标签的获取效率;而且,训练过程中通过微调,模型的性能能够超越基线模型(比如,ImageNet模型)的最高(State Of The Art,SOTA)水平,能够提升动作识别模型的适用范围。
步骤506:部署经过训练的动作识别模型(称为调整后的动作识别模型)。
在本申请实施例中,可以通过所部署的经过训练的动作识别模型(比如,可以部署在视频客户端运营商的服务器或者云服务器中)执行相应的动作识别,实现对用户所上传的视频的识别。
参见图6,图6为本申请实施例中视频相似判断的一个可选的过程示意图;如图6所示,该视频相似判断的一个可选的过程包括步骤601至步骤607,下面对各步骤分别进行说明。
步骤601:确定与待识别视频相对应的版权视频。
步骤602:通过调整后的动作识别模型对待识别视频进行动作识别,得 到动作识别结果。
步骤603:基于动作识别结果,确定待识别视频和版权视频对应的帧间相似度参数集合。
步骤604:基于帧间相似度参数集合确定达到相似度阈值的图像帧数量,并基于图像帧数量,确定待识别视频与版权视频的相似度。
步骤605:基于待识别视频与版权视频的相似度、以及所设定的相似度阈值,判断待识别视频与版权视频是否相似;如果是执行步骤606,否则执行步骤607。
步骤606:确定待识别视频与版权视频相似。
需要说明的是,当确定待识别视频与版权视频相似时,获取待识别视频的版权信息;通过待识别视频的版权信息和版权视频的版权信息,确定待识别视频的合规性;待识别视频的版权信息和版权视频的版权信息不一致时,发出警示信息;而待识别视频的版权信息和版权视频的版权信息一致时,确定待识别视频合规。由此通过识别视频目标在待识别视频的不同视频帧中的所在区域,来判断版权视频是否被盗播。
步骤607:确定待识别视频与版权视频不同。
需要说明的是,当确定待识别视频与版权视频不相似时,将待识别视频添加至视频源,以作为待推荐视频;对视频源中的所有待推荐视频的召回顺序进行排序;基于所有待推荐视频的召回顺序的排序结果向目标对象进行视频推荐。由此通过识别视频目标在待识别视频的不同视频帧中的所在区域,确定相应的版权视频,并向用户推荐,丰富用户的视频观看选择。
在本申请的一些实施例中,还可以确定与待识别视频相对应的识别信息;基于视频目标在待识别视频的不同视频帧中的所在区域,确定待识别视频和识别信息的匹配程度;当待识别视频和识别信息的匹配程度低于报警阈值时,确定待识别视频合规,以对视频目标在待识别视频的不同视频帧中的所在区域的合规性进行自动识别,由此可以减少视频审核过程中的人工参与,提升视频合规识别的效率,减少识别的成本,同时减少用户的等待时间。
需要说明的是,由于视频服务器中的视频数量是不断增加的,因此,可以将视频的版权信息保存在区块链网络或者云服务器中,实现对视频相似性的判断。其中,该相似性的判断过程可结合云技术(Cloud Technology)或区块链网络技术实现,云技术是指在广域网或局域网内将硬件、软件及网络等系列资源统一起来,实现数据的计算、储存、处理和共享的一种托管技术,也可理解为基于云计算商业模式应用的网络技术、信息技术、整合技术、管理平台技术及应用技术等的总称;另外,由于后台服务需要大量的计算、存储资源,如视频网站、图像类网站和更多的门户网站,因此云技术以云计算作为支撑。下面以对长视频的动作预告弹幕和进度条信息中的动作预告实施环境为例,对本申请实施例提供的基于模型的数据处理 方法进行说明。参见图7,图7为本申请实施例提供的基于模型的数据处理方法的使用场景示意图;如图7所示,终端(比如,终端10-1和终端10-2)上设置有能够播放相应长视频的客户端,例如,长视频播放的客户端或插件,通过相应的客户端可以获得带有弹幕信息(通过弹幕信息请求获得)和进度条信息(通过触发进度条提醒获得)的长视频并进行展示;终端通过网络300连接长视频服务器200-1(图1中服务器200的示例)。当然,用户也可以通过终端上传视频以供网络中的其他用户观看,这一过程中运营商的视频服务器通过动作识别模型对所提供的视频进行识别,以通过识别视频中的动作,并将识别出的动作形成动作预告弹幕或者进度条信息中的动作预告。
参见图8,图8为本申请实施例提供的一种示例性的视频识别过程的示意图;如图8所示,该示例性的视频识别过程包括以下步骤801至步骤807,下面对各步骤分别进行说明。
步骤801:从N段待识别的长视频的视频帧中,抽取第二训练样本集合。
需要说明的是,当N段待识别的长视频为3段待识别的长视频时,第二训练样本集合至少包括:第一视频中的动作1“打羽毛球”、第二视频中的动作2“打乒乓球”、以及第三视频中的动作3“打篮球”的视频帧。
步骤802:通过动作识别模型中的嵌入层网络分别提取第二训练样本集合和查询视频的视频帧序列。
需要说明的是,视频帧序列包括N个类型(C 1至C N)的视频样本对应的视频帧序列和查询视频的视频帧序列。
步骤803:利用动作识别模型中的嵌入层网络,对视频帧序列进行时空运动增强处理。
需要说明的是,嵌入层网络包括残差网络(ResNet)和时空运动增强模块(STME)。
需要说明的是,时空运动增强处理以实现增强第一帧特征序列中的每一帧视频的运动特征。
步骤804:通过动作识别模型中的时序关系网络,对不同视频帧序列进行处理,得到相应的时序关系描述子。
步骤805:根据不同时序关系描述子,对动作识别模型的模型参数进行调整。
步骤806:通过调整后的动作识别模型对视频信息中的动作进行识别,得到不同视频中小样本动作的识别结果。
步骤807:通过动作识别模型识别视频中的动作,并基于识别出的动作形成动作预告弹幕或者进度条信息中的动作预告。
如图9所示,通过调整后的动作识别模型识别视频中的动作,以形成动作预告弹幕(如图9示出的弹幕信息9-1),该动作预告弹幕可以在视频 播放界面显示。
本申请实施例所提供的基于模型的数据处理方法所获得的调整后的动作识别模型,能够鲁棒并精确地将视频中的的小样本动作识别出来。将调整后的动作识别模型数据集(比如,数据集MiniKinetics,数据集UCF101和数据集HMDB51)上进行测试,测试结果参考表1和表2;其中,表1为基线模型1至基线模型10、以及调整后的动作识别模型,在数据集(数据集MiniKinetics)上分别采用一次学习至五次学习的方式进行测试所获得的结果;表2为基线模型1、基线模型8、基线模型10、基线模型11、以及调整后的动作识别模型,在数据集(数据集UCF101和数据集HMDB51)上分别采用一次学习、三次学习和五次学习的方式进行测试所获得的结果。由表1和表2可知,相比于基线模型1至基线模型10,本申请实施例提供的调整后的动作识别模型的在这三个数据集上都获得了最高的识别精确度。表1和表2如下所示。
表1
Figure PCTCN2022110247-appb-000028
Figure PCTCN2022110247-appb-000029
表2
Figure PCTCN2022110247-appb-000030
有益技术效果:本申请实施例先通过从包括不同类型视频样本的第一训练样本集合中,抽取第二训练样本集合和查询视频作为训练数据,再通过第二训练样本集合的第一帧特征序列获取第一时序关系描述子、以及通过查询视频的第二帧特征序列获取第二时序关系描述子,最后通过根据第一时序关系描述子和第二时序关系描述子,对动作识别模型的模型参数进行调整;由于调整过程中所采用的第一时序关系描述子和第二时序关系描述子表征视频帧序列之间的时序关系,又由于动作的发生在视频中对应一定时序,因此,通过挖掘视频帧序列之间的时序关系并通过时序关系描述子调整动作识别模型的参数,使得调整后的动作识别模型能够准确地对视频中的动作进行识别,从而,能够增强模型的泛化性,提升动作识别模型的准确度。
可以理解的是,在本申请实施例中,涉及到视频等相关的数据,当本申请实施例运用到具体产品或技术中时,需要获得用户许可或者同意,且相关数据的收集、使用和处理需要遵守相关国家和地区的相关法律法规和标准。
以上所述,仅为本申请的实施例而已,并非用于限定本申请的保护范围,凡在本申请的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本申请的保护范围之内。

Claims (18)

  1. 一种基于模型的数据处理方法,所述方法由电子设备执行,所述方法包括:
    对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,其中,所述第一训练样本集合包括不同类型的视频样本;
    通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列;
    通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列;
    通过所述动作识别模型中的时序关系网络,对所述第一帧特征序列进行处理,得到第一时序关系描述子;
    通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子;
    根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,调整后的所述动作识别模型用于对待识别视频中的动作进行识别。
  2. 根据权利要求1所述的方法,其中,所述对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频之前,所述方法还包括:
    确定所述动作识别模型的使用环境标识;
    根据所述使用环境标识,确定与所述动作识别模型的使用环境相匹配的历史数据;
    将从所述历史数据中筛选出的不同类型的视频样本,作为所述第一训练样本集合。
  3. 根据权利要求1所述的方法,其中,所述对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,包括:
    从所述第一训练样本集合中抽取N个类型的视频信息,其中,N为正整数;
    从每一个类型的视频信息中抽取K个视频样本,其中,K为正整数;
    将所述N个类型的所有视频样本进行组合,得到所述第二训练样本集合,其中,所述N个类型中的所有视频样本包括N*K个视频样本;
    从所述N个类型的视频信息中未被抽取的视频信息中,抽取至少一个视频样本,并将抽取的至少一个视频样本作为所述查询视频。
  4. 根据权利要求1所述的方法,其中,所述通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列,包括:
    通过所述动作识别模型中的所述嵌入层网络,从所述第二训练样本集合中提取每种类型的视频帧集合,并提取所述视频帧集合对应的第一帧级 别特征向量;
    确定所述第一帧级别特征向量所对应的第一通道数量;
    基于所述第一通道数量,确定与所述第一帧级别特征向量对应的第一帧级别特征向量集合,以及与所述第一帧级别特征向量集合相匹配的相似度矩阵;
    对所述第一帧级别特征向量集合和所述相似度矩阵进行融合,得到第二帧级别特征向量集合;
    通过对所述第二帧级别特征向量集合进行线性转换,得到所述第一帧特征序列。
  5. 根据权利要求1所述的方法,其中,所述通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列,包括:
    通过所述嵌入层网络,从所述查询视频中提取第三帧级别特征向量;
    确定所述第三帧级别特征向量所对应的第二通道数量;
    基于所述第二通道数量,确定与所述第三帧级别特征向量对应的第三帧级别特征向量集合,并通过对所述第三帧级别特征向量集合进行线性转换,得到所述查询视频对应的所述第二帧特征序列。
  6. 根据权利要求4所述的方法,其中,所述提取所述视频帧集合对应的第一帧级别特征向量,包括:
    获取所述视频帧集合的降采样结果;
    通过所述嵌入层网络的全连接层,对所述降采样结果进行归一化处理,并对所述视频帧集合中的不同图像帧的归一化结果,进行深度分解,得到所述第一帧级别特征向量。
  7. 根据权利要求1所述的方法,其中,所述方法还包括:
    确定所述第一帧特征序列对应的视频帧数、特征通道数、视频帧高度和视频帧宽度;
    根据所述第一帧特征序列对应的视频帧数、特征通道数、视频帧高度和视频帧宽度,对所述第一帧特征序列中的每一帧视频进行时空运动增强,所述时空运动增强用于增强所述第一帧特征序列中的每一帧视频的运动特征。
  8. 根据权利要求1所述的方法,其中,所述方法还包括:
    确定所述第二帧特征序列对应的视频帧数、视频通道数、视频帧高度和视频帧宽度;
    根据所述第二帧特征序列对应的视频帧数参数、视频通道参数、视频帧的高度参数和视频帧的宽度参数,对所述第二帧特征序列中的每一帧视频进行时空运动增强处理,所述时空运动增强用于增强所第二帧特征序列中的每一帧视频的运动特征。
  9. 根据权利要求1所述的方法,其中,所述通过所述动作识别模型中的时序关系网络,对所述第一帧特征序列进行处理,得到第一时序关系描 述子,包括:
    确定所述第一帧特征序列的第一帧索引参数、以及所述第一帧特征序列的不同子序列;
    通过所述动作识别模型中的所述时序关系网络,并利用所述第一帧索引参数,确定所述不同子序列所分别对应的时序关系描述子;
    对所述不同子序列所分别对应的时序关系描述子进行组合,得到所述第一时序关系描述子。
  10. 根据权利要求1所述的方法,其中,所述通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子,包括:
    确定所述第二帧特征序列的第二帧索引参数;
    通过所述时序关系网络,并利用所述第二帧索引参数,确定所述第二时序关系描述子。
  11. 根据权利要求1至10任一项所述的方法,其中,所述根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,包括:
    对所述第一时序关系描述子和所述第二时序关系描述子进行比较,得到所述第一时序关系描述子和所述第二时序关系描述子的相似度;
    根据所述第一时序关系描述子和所述第二时序关系描述子的相似度,确定所述第一时序关系描述子中的不同类型的时序关系描述子的权重参数;
    根据所述时序关系描述子的权重参数,确定不同类型的视频样本的样本原型;
    计算所述查询视频与每一个类型的视频样本的样本原型的度量分数;
    将最大的度量分数所对应的视频样本的类型,确定为所述查询视频对应的小样本动作类型,并基于所述小样本动作类型调整所述动作识别模型的模型参数。
  12. 根据权利要求1所述的方法,其中,所述方法还包括:
    确定所述待识别视频中的待识别视频帧序列;
    通过调整后的所述动作识别模型对所述待识别视频帧序列进行动作识别,得到动作识别结果;
    确定与所述待识别视频相对应的版权视频;
    基于所述动作识别结果,确定所述待识别视频和所述版权视频对应的帧间相似度参数集合;
    获取所述帧间相似度参数集合中达到相似度阈值的视频帧数量;
    基于所述视频帧数量,确定所述待识别视频与所述版权视频的相似度。
  13. 根据权利要求12所述的方法,其中,所述方法还包括:
    当基于所述待识别视频与所述版权视频的相似度,确定所述待识别视频与所述版权视频相似时,获取所述待识别视频的版权信息;
    获取所述待识别视频的版权信息和所述版权视频的版权信息的比较结 果,所述比较结果用于确定所述待识别视频的合规性;
    当所述比较结果表示所述待识别视频的版权信息和所述版权视频的版权信息不一致时,生成警示信息。
  14. 根据权利要求12所述的方法,其中,所述方法还包括:
    当基于所述待识别视频与所述版权视频的相似度,确定所述待识别视频与所述版权视频不相似时,将所述待识别视频确定为视频源中的待推荐视频,其中,所述待推荐视频携带有小样本动作识别结果;
    对所述视频源中的所有待推荐视频的召回顺序进行排序;
    基于排序结果向目标对应推荐视频。
  15. 一种基于模型的数据处理装置,所述数据处理装置包括:
    样本获取模块,配置为对第一训练样本集合进行抽取,得到第二训练样本集合和查询视频,其中,所述第一训练样本集合包括不同类型的视频样本;
    特征提取模块,配置为通过动作识别模型中的嵌入层网络,对所述第二训练样本集合进行处理,得到第一帧特征序列;通过所述嵌入层网络,对所述查询视频进行处理,得到第二帧特征序列;
    时序处理模块,配置为通过所述动作识别模型中的时序关系网络,对所述第一帧特征序列进行处理,得到第一时序关系描述子;通过所述时序关系网络,对所述第二帧特征序列进行处理,得到第二时序关系描述子;
    模型训练模块,配置为根据所述第一时序关系描述子和所述第二时序关系描述子,对所述动作识别模型的模型参数进行调整,调整后的所述动作识别模型用于对待识别视频中的动作进行识别。
  16. 一种用于基于模型进行数据处理的电子设备,所述电子设备包括:
    存储器,用于存储可执行指令;
    处理器,用于运行所述存储器存储的可执行指令时,实现权利要求1至14任一项所述的基于模型的数据处理方法。
  17. 一种计算机可读存储介质,存储有可执行指令,所述可执行指令被处理器执行时,实现权利要求1至14任一项所述的基于模型的数据处理方法。
  18. 一种计算机程序产品,包括计算机程序或指令,所述计算机程序或指令被处理器执行时,实现权利要求1至14任一项所述的基于模型的数据处理方法。
PCT/CN2022/110247 2021-09-16 2022-08-04 一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品 WO2023040506A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/199,528 US20230353828A1 (en) 2021-09-16 2023-05-19 Model-based data processing method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111087467.0A CN114282047A (zh) 2021-09-16 2021-09-16 小样本动作识别模型训练方法、装置、电子设备及存储介质
CN202111087467.0 2021-09-16

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/199,528 Continuation US20230353828A1 (en) 2021-09-16 2023-05-19 Model-based data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2023040506A1 true WO2023040506A1 (zh) 2023-03-23

Family

ID=80868596

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/110247 WO2023040506A1 (zh) 2021-09-16 2022-08-04 一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品

Country Status (3)

Country Link
US (1) US20230353828A1 (zh)
CN (1) CN114282047A (zh)
WO (1) WO2023040506A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116580343A (zh) * 2023-07-13 2023-08-11 合肥中科类脑智能技术有限公司 小样本行为识别方法、存储介质、控制器
CN117097946A (zh) * 2023-10-19 2023-11-21 广东视腾电子科技有限公司 一种视频一体机及用于视频一体机的控制方法
TWI841435B (zh) 2023-06-30 2024-05-01 國立勤益科技大學 基於卷積神經網路的時間序列資料融合演算法及居家照護系統

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114282047A (zh) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 小样本动作识别模型训练方法、装置、电子设备及存储介质
CN115527152A (zh) * 2022-11-10 2022-12-27 南京恩博科技有限公司 一种小样本视频动作分析方法、***及装置
CN115797606B (zh) * 2023-02-07 2023-04-21 合肥孪生宇宙科技有限公司 基于深度学习的3d虚拟数字人交互动作生成方法及***
CN117710777B (zh) * 2024-02-06 2024-06-04 腾讯科技(深圳)有限公司 模型训练方法、关键帧抽取方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921087A (zh) * 2018-06-29 2018-11-30 国家计算机网络与信息安全管理中心 视频理解方法
CN110532911A (zh) * 2019-08-19 2019-12-03 南京邮电大学 协方差度量驱动小样本gif短视频情感识别方法及***
CN111831852A (zh) * 2020-07-07 2020-10-27 北京灵汐科技有限公司 一种视频检索方法、装置、设备及存储介质
CN113111842A (zh) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
US20210264261A1 (en) * 2020-02-21 2021-08-26 Caci, Inc. - Federal Systems and methods for few shot object detection
CN114282047A (zh) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 小样本动作识别模型训练方法、装置、电子设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108921087A (zh) * 2018-06-29 2018-11-30 国家计算机网络与信息安全管理中心 视频理解方法
CN110532911A (zh) * 2019-08-19 2019-12-03 南京邮电大学 协方差度量驱动小样本gif短视频情感识别方法及***
US20210264261A1 (en) * 2020-02-21 2021-08-26 Caci, Inc. - Federal Systems and methods for few shot object detection
CN111831852A (zh) * 2020-07-07 2020-10-27 北京灵汐科技有限公司 一种视频检索方法、装置、设备及存储介质
CN113111842A (zh) * 2021-04-26 2021-07-13 浙江商汤科技开发有限公司 一种动作识别方法、装置、设备及计算机可读存储介质
CN114282047A (zh) * 2021-09-16 2022-04-05 腾讯科技(深圳)有限公司 小样本动作识别模型训练方法、装置、电子设备及存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BEN-ARI RAMI; SHPIGEL NACSON MOR; AZULAI OPHIR; BARZELAY UDI; ROTMAN DANIEL: "TAEN: Temporal Aware Embedding Network for Few-Shot Action Recognition", 2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), IEEE, 19 June 2021 (2021-06-19), pages 2780 - 2788, XP033967872, DOI: 10.1109/CVPRW53098.2021.00313 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI841435B (zh) 2023-06-30 2024-05-01 國立勤益科技大學 基於卷積神經網路的時間序列資料融合演算法及居家照護系統
CN116580343A (zh) * 2023-07-13 2023-08-11 合肥中科类脑智能技术有限公司 小样本行为识别方法、存储介质、控制器
CN117097946A (zh) * 2023-10-19 2023-11-21 广东视腾电子科技有限公司 一种视频一体机及用于视频一体机的控制方法
CN117097946B (zh) * 2023-10-19 2024-02-02 广东视腾电子科技有限公司 一种视频一体机及用于视频一体机的控制方法

Also Published As

Publication number Publication date
US20230353828A1 (en) 2023-11-02
CN114282047A (zh) 2022-04-05

Similar Documents

Publication Publication Date Title
WO2023040506A1 (zh) 一种基于模型的数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
US20210256320A1 (en) Machine learning artificialintelligence system for identifying vehicles
CN109104620B (zh) 一种短视频推荐方法、装置和可读介质
CN111400591B (zh) 资讯信息推荐方法、装置、电子设备及存储介质
CN111062871B (zh) 一种图像处理方法、装置、计算机设备及可读存储介质
CN109344884B (zh) 媒体信息分类方法、训练图片分类模型的方法及装置
US8533134B1 (en) Graph-based fusion for video classification
WO2021139191A1 (zh) 数据标注的方法以及数据标注的装置
US10685236B2 (en) Multi-model techniques to generate video metadata
CN112119388A (zh) 训练图像嵌入模型和文本嵌入模型
CN112215171B (zh) 目标检测方法、装置、设备及计算机可读存储介质
CN113761153B (zh) 基于图片的问答处理方法、装置、可读介质及电子设备
CN111859149A (zh) 资讯信息推荐方法、装置、电子设备及存储介质
WO2023273628A1 (zh) 一种视频循环识别方法、装置、计算机设备及存储介质
CN113010703A (zh) 一种信息推荐方法、装置、电子设备和存储介质
CN111046275A (zh) 基于人工智能的用户标签确定方法及装置、存储介质
CN112074828A (zh) 训练图像嵌入模型和文本嵌入模型
CN113761253A (zh) 视频标签确定方法、装置、设备及存储介质
CN111783712A (zh) 一种视频处理方法、装置、设备及介质
CN113052039A (zh) 一种交通路网行人密度检测的方法、***及服务器
CN110489613B (zh) 协同可视数据推荐方法及装置
CN116935170A (zh) 视频处理模型的处理方法、装置、计算机设备和存储介质
CN115935049A (zh) 基于人工智能的推荐处理方法、装置及电子设备
CN112801053B (zh) 视频数据处理方法、装置
CN117150053A (zh) 多媒体信息推荐模型训练方法、推荐方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22868883

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE