CN113627354A

CN113627354A - Model training method, video processing method, device, equipment and storage medium

Info

Publication number: CN113627354A
Application number: CN202110926860.8A
Authority: CN
Inventors: 吴文灏; 黄登
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2021-08-12
Filing date: 2021-08-12
Publication date: 2021-11-09
Anticipated expiration: 2041-08-12
Also published as: CN113627354B

Abstract

The disclosure provides a model training method, a video processing method, a device, equipment and a storage medium, relates to the field of artificial intelligence, in particular to the technical field of computer vision and deep learning, and can be particularly used in smart cities and intelligent traffic scenes. The specific implementation scheme is as follows: extracting a first video clip, a second video clip and a third video clip from the sample video set, wherein the first video clip is similar to the second video clip in appearance, and the playing speeds of the second video clip and the third video clip are the same; respectively extracting the characteristics of the first video clip, the second video clip and the third video clip by using the target model to obtain a first characteristic, a second characteristic and a third characteristic; determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature; and training the target model according to the loss function. The implementation mode can improve the quality of the extracted features and improve the performance of downstream tasks.

Description

Model training method, video processing method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, and more particularly to the field of computer vision and deep learning technologies, and more particularly to methods, apparatuses, devices, and storage media for model training and video processing, which can be used in smart cities and smart traffic scenarios.

Background

Video representation learning, a technique that helps a system automatically learn discriminative features from original video. With the rise of smart phones, it becomes unprecedented easy to record videos. Video analytics has become one of the most active research hotspots today. However, to obtain high-quality video tags, a lot of manual labeling work is required, and a lot of manpower, material resources and financial resources are required. In contrast, millions of untagged videos are available for free on the internet. Therefore, learning meaningful video representations from unlabeled video is crucial for video content understanding.

Disclosure of Invention

The present disclosure provides a model training method, a video processing method, a device, an apparatus and a storage medium.

According to a first aspect, there is provided a model training method comprising: extracting a first video clip, a second video clip and a third video clip from the sample video set, wherein the similarity of the appearances of the first video clip and the second video clip is greater than a first preset threshold value, and the playing speeds of the second video clip and the third video clip are the same; respectively extracting the characteristics of the first video clip, the second video clip and the third video clip by using the target model to obtain the first characteristic of the first video clip, the second characteristic of the second video clip and the third characteristic of the third video clip; determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature; and training the target model according to the loss function.

According to a second aspect, there is provided a video processing method comprising: acquiring a target video; extracting the features of the target video by using the target model obtained by training through the model training method described in the first aspect, and determining the target features of the target video; and processing the target video according to the target characteristics.

According to a third aspect, there is provided a model training apparatus comprising: the video clip extraction unit is configured to extract a first video clip, a second video clip and a third video clip from the sample video set, wherein the similarity of the appearances of the first video clip and the second video clip is greater than a first preset threshold value, and the playing speeds of the second video clip and the third video clip are the same; the video feature extraction unit is configured to extract features of the first video segment, the second video segment and the third video segment respectively by using the target model to obtain a first feature of the first video segment, a second feature of the second video segment and a third feature of the third video segment; a loss function determination unit configured to determine a loss function according to a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature; an object model training unit configured to train an object model according to the loss function.

According to a fourth aspect, there is provided a video processing apparatus comprising: a video acquisition unit configured to acquire a target video; a feature extraction unit configured to extract features of a target video using a target model trained by the model training method as described in the first aspect, and determine target features of the target video; and the video processing unit is configured to process the target video according to the target characteristics.

According to a fifth aspect, there is provided an electronic device comprising: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect or the method as described in the second aspect.

According to a sixth aspect, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect or the method as described in the second aspect.

According to a seventh aspect, a computer program product comprising a computer program which, when executed by a processor, implements the method as described in the first aspect or the method as described in the second aspect.

According to the technology disclosed by the invention, the model can be trained in the feature space, so that more related information of the video can be reserved, the quality of the features learned from the label-free video data is improved, and the performance of downstream tasks is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram of one embodiment of a model training method according to the present disclosure;

FIG. 3 is a flow diagram of another embodiment of a model training method according to the present disclosure;

FIG. 4 is a schematic diagram of one embodiment of a model training method according to the present disclosure;

FIG. 5 is a flow diagram for one embodiment of a video processing method according to the present disclosure;

FIG. 6 is a schematic diagram of an application scenario of a model training method, a video processing method according to the present disclosure;

FIG. 7 is a schematic block diagram of one embodiment of a model training apparatus according to the present disclosure;

FIG. 8 is a schematic block diagram of one embodiment of a video processing apparatus according to the present disclosure;

fig. 9 is a block diagram of an electronic device for implementing a model training method and a video processing method according to an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the model training method, video processing method, or embodiments for the model training apparatus, video processing apparatus of the present disclosure may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various communication client applications, such as a video playing application, a video processing application, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, car computers, laptop portable computers, desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services) or as a single piece of software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing language models on the

terminal devices

101, 102, 103. The background server may train the model by using the training samples to obtain a target model, and feed back the target model to the

terminal devices

101, 102, and 103.

The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the model training method provided by the embodiment of the present disclosure is generally executed by the server 105, and the video processing method may be executed by the

terminal devices

101, 102, and 103, or may be executed by the server 105. Accordingly, the model training apparatus is generally provided in the server 105, and the video processing apparatus may be provided in the

terminal devices

101, 102, and 103, or may be provided in the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a model training method according to the present disclosure is shown. The model training method of the embodiment comprises the following steps:

step 201, a first video segment, a second video segment and a third video segment are extracted from a sample video set.

In this embodiment, an executing agent (e.g., the server 105 shown in fig. 1) of the model training method may first obtain a sample video set. A plurality of sample videos may be included in the sample video set. The execution subject may extract a first video clip, a second video clip, and a third video clip from the sample video set. Here, the number of video frames included in the first video clip, the second video clip, and the third video clip may be small. And the similarity of the appearances of the first video clip and the second video clip is greater than a first preset threshold value, and the playing speeds of the second video clip and the third video clip are the same. Video segments with similar appearance are understood to contain substantially identical elements and similar relative positions between the elements. The play speed may be understood as the speed at which the video clip is played back, which is related to the number of display Frames Per Second (FPS).

Step 202, respectively extracting the features of the first video segment, the second video segment and the third video segment by using the target model to obtain the first feature of the first video segment, the second feature of the second video segment and the third feature of the third video segment.

In this embodiment, the executing entity may extract features of the first video segment, the second video segment, and the third video segment respectively by using the target model. Here, the target model may be a model to be trained, which may be used to extract features of the video segment. The execution subject may input the first video segment, the second video segment, and the third video segment into the target model, respectively, and the obtained output is the feature of each video segment. Here, the feature of the first video segment is referred to as a first feature, the feature of the second video segment is referred to as a second feature, and the feature of the third video segment is referred to as a third feature.

Step 203, determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature.

The executing entity may calculate a first distance between the first feature and the second feature and a second distance between the second feature and the third feature, respectively. And determining a loss function according to the first distance and the second distance. Specifically, the execution agent may weight the first distance and the second distance to determine the loss function. The weighting coefficient can be set according to the actual application scenario.

And step 204, training a target model according to the loss function.

After the execution subject determines the loss function, the execution subject may iteratively adjust parameters of the target model according to the loss function value until a training termination condition is satisfied. It will be appreciated that the smaller the loss function value, the higher the performance of the target model.

The model training method provided by the embodiment of the disclosure can train the model in the feature space, so that more related information of the video can be reserved, the quality of the features learned from the unlabeled video data is improved, and the performance of the downstream task is improved.

With continued reference to FIG. 3, a flow 300 of another embodiment of a model training method according to the present disclosure is shown. As shown in fig. 3, the method of the present embodiment may include the following steps:

step 301, a first sample video and a second sample video are selected from a sample video set.

In this embodiment, the execution subject may first select a first sample video and a second sample video from the sample video set. Here, the first sample video and the second sample video are two different videos. And the appearance similarity of the first sample video and the second sample video is greater than a second preset threshold. For example, the first sample video and the second sample video may be videos taken from different angles for the same object. Specifically, the execution subject may select the first sample video and the second sample video from the sample video set according to information such as a title, a shooting time, and an element in the video of the sample video.

Step 302, a first video segment and a second video segment are extracted from a first sample video.

In this embodiment, the execution subject may extract the first video clip and the second video clip from the first sample video. Specifically, the execution subject may extract a plurality of consecutive video frames from the first sample video as the first video clip and the second video clip, respectively. For example, the execution subject may take the 1 st to 10 th video frames of the first sample video as the first video clip and the 21 st to 30 th video frames as the second video clip.

In some optional implementations of this embodiment, the executing subject may obtain the first sample video and the second sample video by:

in step 3021, a plurality of consecutive video frames are selected from the first sample video.

Step 3022, dividing the plurality of video frames into two video segments with the same number to obtain a first video segment and a second video segment.

In this implementation, the execution subject may select a plurality of consecutive video frames from the first sample video. For example, the 1 st to 16 th video frames are selected. Then, the plurality of video frames are divided into two video segments with the same number, and a first video segment and a second video segment are obtained. For example, the 1 st to 8 th video frames are used as the first video segment, and the 9 th to 16 th video frames are used as the second video segment. Or, the execution subject can take the odd video frames in the 1 st to 16 th video frames as the first video segment and the even video frames as the second video segment.

Step 303, a third video segment is extracted from the second sample video.

In this embodiment, the execution subject may extract the third video segment from the second sample video. Specifically, the execution subject may set the number of video frames in the third video segment to be the same as the number of video frames in the first video segment or the second video segment.

In some optional implementations of this embodiment, the executing subject may obtain the third sample video by:

step 3031, determining the number of display frames per second of the second video segment.

Step 3032, sampling the second sample video by the display frame number per second to obtain a third video segment.

In this implementation, the execution subject may first determine the number of display frames per second for the second video segment. Then, the second sample video is sampled by the number of display frames per second, resulting in a third video segment. It will be appreciated that the playback speed of two video segments sampled at the same number of frames per second displayed is the same.

Step 304, performing data enhancement on the first video clip and the second video clip; and performing feature extraction on the first video segment after the data enhancement, the second video segment after the data enhancement and the third video segment by using the target model to obtain a first feature, a second feature and a third feature.

And after the execution main body obtains the three video clips, respectively performing data enhancement on the first video clip and the second video clip. Here, data enhancement may include, but is not limited to: random cutting, random color disturbance, random blurring and random turning. Then, the executing subject may perform feature extraction on the data-enhanced first video segment, the data-enhanced second video segment, and the third video segment by using the target model to obtain a first feature, a second feature, and a third feature. In this embodiment, the capability of the model to extract features may be enhanced by performing data enhancement on the first video segment and the second video segment.

Step 305, determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature.

In this embodiment, the first distance represents a distance between two video clips in the appearance feature space, and the second distance represents a distance between two video clips in the play speed feature space. The first distance and the second distance may be represented by L2 distances. The executive body may reduce the distance of the first feature and the second feature in the appearance feature space by minimizing the first distance. The distance of the second feature and the third feature in the play-speed feature space can be reduced by minimizing the second distance.

And step 306, training the target model according to the loss function.

In this embodiment, the executing entity may optimize the network by using a Stochastic Gradient Descent (SGD), and continuously update the network weight until the loss function convergence training is stopped.

Fig. 4 shows a schematic diagram of the processing for the first video segment (a), the second video segment (b) and the third video segment (c). The first video segment (a) and the second video segment (b) are similar in appearance, and the playing speeds of the second video segment (b) and the third video segment (c) are the same. The execution subject calculates a distance between the first video segment (a) and the second video segment (b) and a distance between the second video segment (b) and the third video segment (c), respectively. And generating a loss function according to the data, and training the target model.

And 307, fine-tuning the trained target model according to the sample data of the downstream task.

In this embodiment, the execution subject may further perform fine tuning on the trained target model according to sample data of a downstream task. Specifically, if the downstream task is a classification task, the executive agent may add a classifier after the trained object model and optimize the network on the data set of the downstream task with a smaller learning rate. Here, too large adjustment of the parameters of the target model at a high learning rate is avoided, and the performance of the target model is prevented from being deteriorated.

According to the model training method provided by the embodiment of the disclosure, the training process does not need manually labeled video labels, so that manpower and material resources are saved, and the model training method can be used for training large-scale label-free data sets; more video related information can be reserved in the feature space in the training process; the method does not depend on negative samples, saves memory space, reduces training cost, and further improves the performance of the model by fine tuning on downstream tasks.

With continued reference to fig. 5, a flow 500 of one embodiment of a video processing method according to the present disclosure is shown. As shown in fig. 5, the method of the present embodiment may include the following steps:

step 501, obtaining a target video.

In this embodiment, the execution subject may first obtain the target video. Here, the target video is a video to be processed.

Step 502, extracting the characteristics of the target video by using the target model obtained by training through a model training method, and determining the target characteristics of the target video.

In this embodiment, the executing subject may input the target video into the target model obtained by the model training method described in the embodiment of fig. 2 or fig. 3, to obtain the features of the target video, and record the features as the target features.

And 503, processing the target video according to the target characteristics.

The execution subject can continue to process the target video after obtaining the target feature. For example, a video search may be performed to determine whether the target video is similar to other videos. Alternatively, the target video may be classified to determine the category included in the target video.

According to the video processing method provided by the embodiment of the disclosure, the trained target model can be used for extracting the high-quality features of the target video, so that the accuracy of the downstream task result can be improved.

With continued reference to fig. 6, a schematic diagram of an application scenario of the model training method, video processing method according to the present disclosure is shown. In the application scenario of fig. 6, the server 601 performs the processing of steps 201 to 204 by using a plurality of sample videos, and then obtains a trained target model. The object model is then sent to the terminal 602. The terminal 602 may perform video retrieval by using the target model to obtain a plurality of similar videos for the user to view.

With further reference to fig. 7, as an implementation of the methods shown in the above-mentioned figures, the present disclosure provides an embodiment of a model training apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 7, the model training apparatus 700 of the present embodiment includes: a video segment extraction unit 701, a video feature extraction unit 702, a loss function determination unit 703, and an object model training unit 704.

A video segment extracting unit 701 configured to extract a first video segment, a second video segment, and a third video segment from the sample video set. The similarity of the appearances of the first video clip and the second video clip is larger than a first preset threshold value, and the playing speed of the second video clip is the same as that of the third video clip.

The video feature extraction unit 702 is configured to extract features of the first video segment, the second video segment, and the third video segment respectively by using the target model, so as to obtain a first feature of the first video segment, a second feature of the second video segment, and a third feature of the third video segment.

A loss function determining unit 703 configured to determine a loss function based on a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature.

An object model training unit 704 configured to train the object model according to the loss function.

In some optional implementations of this embodiment, the video segment extracting unit 701 may be further configured to: selecting a first sample video and a second sample video from the sample video set, wherein the appearance similarity of the first sample video and the second sample video is greater than a second preset threshold value; extracting a first video clip and a second video clip from a first sample video; a third video segment is extracted from the second sample video.

In some optional implementations of this embodiment, the video segment extracting unit 701 may be further configured to: selecting a plurality of continuous video frames from a first sample video; the method comprises the steps of dividing a plurality of video frames into two video segments with the same number to obtain a first video segment and a second video segment.

In some optional implementations of this embodiment, the video segment extracting unit 701 may be further configured to: determining the number of display frames per second of the second video segment; and sampling the second sample video by the display frame number per second to obtain a third video segment.

In some optional implementations of this embodiment, the video feature extraction unit 702 may be further configured to: performing data enhancement on the first video segment and the second video segment; and performing feature extraction on the first video segment after the data enhancement, the second video segment after the data enhancement and the third video segment by using the target model to obtain a first feature, a second feature and a third feature.

In some optional implementations of this embodiment, the apparatus 700 may further include a fine-tuning unit, not shown in fig. 7, configured to: and fine-tuning the trained target model according to the sample data of the downstream task.

It should be understood that the units 701 to 704 recited in the model training apparatus 700 correspond to the respective steps in the method described with reference to fig. 2. Thus, the operations and features described above with respect to the model training method are equally applicable to the apparatus 700 and the units included therein, and are not described in detail here.

With further reference to fig. 8, as an implementation of the methods shown in the above figures, the present disclosure provides an embodiment of a video processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which is particularly applicable to various electronic devices.

As shown in fig. 8, the video processing apparatus 800 of the present embodiment includes: a video acquisition unit 801, a feature extraction unit 802, and a video processing unit 803.

A video acquisition unit 801 configured to acquire a target video.

The feature extraction unit 802 is configured to extract features of the target video using the target model trained by the model training method described in the embodiment of fig. 2 or fig. 3, and determine target features of the target video.

A video processing unit 803 configured to process the target video according to the target feature.

It should be understood that units 801 to 803 recited in the video processing apparatus 800 correspond to respective steps in the method described with reference to fig. 4. Thus, the operations and features described above for the video processing method are also applicable to the apparatus 800 and the units included therein, and are not described herein again.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the personal information of the related user all accord with the regulations of related laws and regulations, and do not violate the good customs of the public order.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to an embodiment of the present disclosure.

FIG. 9 illustrates a block diagram of an electronic device 900 that performs a model training method, a video processing method, according to an embodiment of the disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a processor 901, which can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM)902 or a computer program loaded from a memory 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the electronic device 900 can also be stored. The processor 901, the ROM 902, and the RAM 903 are connected to each other through a bus 904. An I/O interface (input/output interface) 905 is also connected to the bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, and the like; an output unit 907 such as various types of displays, speakers, and the like; a memory 908, such as a magnetic disk, optical disk, or the like; and a communication unit 909 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

Processor 901 may be a variety of general and/or special purpose processing components with processing and computing capabilities. Some examples of processor 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, or the like. The processor 901 performs various methods and processes described above, such as a model training method, a video processing method. For example, in some embodiments, the model training method, the video processing method, may be implemented as a computer software program tangibly embodied in a machine-readable storage medium, such as the memory 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When loaded into RAM 903 and executed by processor 901, a computer program may perform one or more steps of the model training method, the video processing method described above. Alternatively, in other embodiments, the processor 901 may be configured to perform the model training method, the video processing method, by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. The program code described above may be packaged as a computer program product. These program code or computer program products may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program code, when executed by the processor 901, causes the functions/acts specified in the flowchart and/or block diagram block or blocks to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable storage medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable storage medium may be a machine-readable signal storage medium or a machine-readable storage medium. A machine-readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions of the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. A model training method, comprising:

extracting a first video clip, a second video clip and a third video clip from a sample video set, wherein the similarity of the appearances of the first video clip and the second video clip is greater than a first preset threshold value, and the playing speeds of the second video clip and the third video clip are the same;

respectively extracting the characteristics of the first video clip, the second video clip and the third video clip by using a target model to obtain a first characteristic of the first video clip, a second characteristic of the second video clip and a third characteristic of the third video clip;

determining a loss function according to a first distance between the first feature and the second feature and a second distance between the second feature and the third feature;

and training the target model according to the loss function.

2. The method of claim 1, wherein said extracting a first video segment, a second video segment, and a third video segment from a sample video set comprises:

selecting a first sample video and a second sample video from the sample video set, wherein the appearance similarity of the first sample video and the second sample video is greater than a second preset threshold value;

extracting the first video clip and the second video clip from the first sample video;

the third video segment is extracted from the second sample video.

3. The method of claim 2, wherein said extracting said first video segment and said second video segment from said first sample video comprises:

selecting a plurality of continuous video frames from the first sample video;

and dividing the plurality of video frames into two video segments with the same quantity to obtain the first video segment and the second video segment.

4. A method as claimed in claim 2 or 3, wherein said extracting said third video segment from said second sample video comprises:

determining the number of display frames per second of the second video segment;

and sampling the second sample video by the display frame number per second to obtain the third video segment.

5. The method of claim 1, wherein the extracting features of the first video segment, the second video segment and the third video segment respectively by using the object model to obtain a first feature of the first video segment, a second feature of the second video segment and a third feature of the third video segment comprises:

performing data enhancement on the first video segment and the second video segment;

and performing feature extraction on the first video segment after data enhancement, the second video segment after data enhancement and the third video segment by using the target model to obtain the first feature, the second feature and the third feature.

6. The method of claim 1, wherein the method further comprises:

and fine-tuning the trained target model according to the sample data of the downstream task.

7. A video processing method, comprising:

acquiring a target video;

extracting the characteristics of the target video by using a target model obtained by training through the model training method of any one of claims 1-6, and determining the target characteristics of the target video;

and processing the target video according to the target characteristics.

8. A model training apparatus comprising:

the video clip extraction unit is configured to extract a first video clip, a second video clip and a third video clip from a sample video set, wherein the similarity of the appearances of the first video clip and the second video clip is greater than a first preset threshold value, and the playing speed of the second video clip and the third video clip is the same;

a video feature extraction unit configured to extract features of the first video segment, the second video segment and the third video segment respectively by using a target model, so as to obtain a first feature of the first video segment, a second feature of the second video segment and a third feature of the third video segment;

a loss function determination unit configured to determine a loss function according to a first distance between the first feature and the second feature, and a second distance between the second feature and the third feature;

an object model training unit configured to train the object model according to the loss function.

9. The apparatus of claim 8, wherein the video segment extraction unit is further configured to:

the third video segment is extracted from the second sample video.

10. The apparatus of claim 9, wherein the video segment extraction unit is further configured to:

selecting a plurality of continuous video frames from the first sample video;

11. The apparatus of claim 9 or 10, wherein the video segment extraction unit is further configured to:

12. The apparatus of claim 8, wherein the video feature extraction unit is further configured to:

13. The apparatus of claim 8, wherein the apparatus further comprises a fine tuning unit configured to:

14. A video processing apparatus comprising:

a video acquisition unit configured to acquire a target video;

a feature extraction unit configured to extract features of the target video by using a target model trained by the model training method according to any one of claims 1 to 6, and determine target features of the target video;

a video processing unit configured to process the target video according to the target feature.

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6 or to perform the method of claim 7.

16. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6 or the method of claim 7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-6 or the method of claim 7.