TW202201285A

TW202201285A - Neural network training method, video recognition method, computer equipment and readable storage medium

Info

Publication number: TW202201285A
Application number: TW110115206A
Authority: TW
Inventors: 王子豪; 林宸; 邵婧; 盛律; 閆俊傑
Original assignee: 大陸商深圳市商湯科技有限公司
Priority date: 2020-06-19
Filing date: 2021-04-27
Publication date: 2022-01-01
Also published as: WO2021253938A1; TWI770967B; JP2022541712A; CN111767985A; CN111767985B; JP7163515B2; KR20220011208A

Abstract

The present disclosure provides a neural network training method, a video recognition method, a computer equipment, and a computer readable storage medium, including: acquiring a sample video and constructing a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to the plurality of operation methods, and each of the operating methods has a corresponding weight parameter; training the neural network based on the sample video and the event label corresponding to each sample video to obtain the weight parameter after training; based on the weight parameters after training, select a target operation method for each edge of the multiple directed acyclic graphs to obtain a trained neural network.

Description

一種神經網路的訓練方法、視頻識別方法及電腦設備和電腦可讀儲存介質A neural network training method, video recognition method, computer equipment and computer-readable storage medium

本發明關於電腦技術領域，關於一種神經網路的訓練方法、視頻識別方法及電腦設備和電腦可讀儲存介質。The present invention relates to the field of computer technology, and relates to a neural network training method, a video recognition method, computer equipment and a computer-readable storage medium.

視頻識別是指識別視頻中所發生的事件，相關技術中，一般是對進行圖片識別的神經網路進行簡單改造後用於視頻識別。Video recognition refers to recognizing events that occur in a video. In related technologies, a neural network for image recognition is generally used for video recognition after a simple transformation.

然而，由於進行圖片識別的神經網路是在圖像維度上進行目標識別的，這樣會忽略一些從圖像維度無法提取的視頻特徵，從而影響了神經網路進行視頻識別的精度。However, since the neural network for image recognition performs target recognition on the image dimension, some video features that cannot be extracted from the image dimension will be ignored, thus affecting the accuracy of the neural network for video recognition.

本發明實施例至少提供一種神經網路的訓練方法、視頻識別方法及電腦設備和電腦可讀儲存介質。Embodiments of the present invention provide at least a neural network training method, a video recognition method, a computer device, and a computer-readable storage medium.

第一方面，本發明實施例提供了一種神經網路的訓練方法，包括：獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。In a first aspect, an embodiment of the present invention provides a method for training a neural network, including: acquiring a sample video, and constructing a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include At least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, each of the The operation method has corresponding weight parameters; based on the sample video and the event label corresponding to each of the sample videos, the neural network is trained to obtain the weight parameters after training; based on the weight parameters after training, A target manipulation method is selected for each edge of the plurality of directed acyclic graphs to obtain a trained neural network.

上述方法中，所構建的神經網路中不僅包括用於提取空間特徵的有向無環圖，還包括用於提取時間特徵的有向無環圖，有向無環圖的每條邊對應多個操作方法；這樣在利用樣本視頻對神經網路進行訓練後，可以得到訓練後的操作方法的權重參數，進一步基於訓練後的操作方法的權重參數來得到訓練後的神經網路；這種方法訓練的神經網路不僅進行了圖像維度的空間特徵識別，還進行了時間維度的時間特徵識別，訓練出的神經網路對於視頻的識別精度較高。In the above method, the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to multiple Operation method; in this way, after using the sample video to train the neural network, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; this method trains The neural network not only performs the spatial feature recognition of the image dimension, but also the temporal feature recognition of the time dimension, and the trained neural network has high recognition accuracy for the video.

在一些可能的實施方式中，所述有向無環圖包括兩個輸入節點；所述神經網路的每個節點對應一個特徵圖；所述構建包括多個有向無環圖的神經網路，包括：將第N-1個有向無環圖輸出的特徵圖作為第N+1個有向無環圖的一個輸入節點的特徵圖，並將第N個有向無環圖輸出的特徵圖作為所述第N+1個有向無環圖的另一個輸入節點的特徵圖；N為大於1的整數；其中，所述神經網路的第一個有向無環圖中的目標輸入節點對應的特徵圖為對樣本視頻的採樣視頻幀進行特徵提取後的特徵圖，除所述目標輸入節點外的另一個輸入節點為空；所述神經網路的第二個有向無環圖中一個輸入節點的特徵圖為所述第一個有向無環圖輸出的特徵圖，另一個輸入節點為空。In some possible implementations, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature graph; and the constructing a neural network including a plurality of directed acyclic graphs , including: taking the feature map output by the N-1 th DAG as the feature map of an input node of the N+1 DAG, and using the N-th DAG output feature The graph is used as a feature graph of another input node of the N+1 th directed acyclic graph; N is an integer greater than 1; wherein, the target input in the first directed acyclic graph of the neural network The feature map corresponding to the node is the feature map after feature extraction is performed on the sampled video frame of the sample video, and the other input node except the target input node is empty; the second directed acyclic graph of the neural network The feature map of one input node is the feature map output by the first directed acyclic graph, and the other input node is empty.

在一些可能的實施方式中，根據以下方法確定有向無環圖輸出的特徵圖：將所述有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯，將串聯後的特徵圖作為所述有向無環圖輸出的特徵圖。In some possible implementations, the feature map output by the directed acyclic graph is determined according to the following method: the feature maps corresponding to other nodes except the input node in the directed acyclic graph are connected in series, and the feature maps after the series are connected in series. graph as the feature map of the directed acyclic graph output.

在一些可能的實施方式中，所述用於提取時間特徵的有向無環圖中的每條邊對應多個第一操作方法，所述用於提取空間特徵的有向無環圖中的每條邊對應多個第二操作方法；所述多個第一操作方法中包括所述多個第二操作方法以及至少一個區別於各所述第二操作方法的其他操作方法。In some possible implementations, each edge in the directed acyclic graph for extracting temporal features corresponds to a plurality of first operation methods, and each edge in the directed acyclic graph for extracting spatial features Corresponding to multiple second operation methods; the multiple first operation methods include the multiple second operation methods and at least one other operation method that is different from each of the second operation methods.

在一些可能的實施方式中，所述神經網路還包括與第一個有向無環圖連接的採樣層，所述採樣層用於對樣本視頻進行採樣，得到採樣視頻幀，並對所述採樣視頻幀進行特徵提取，得到所述採樣視頻幀對應的特徵圖，將所述採樣視頻幀對應的特徵圖輸入至第一個所述有向無環圖的目標輸入至節點；所述神經網路還包括與最後一個有向無環圖的輸出節點連接的全連接層；所述全連接層用於基於最後一個有向無環圖輸出的特徵圖確定所述樣本視頻對應的多種事件的發生概率；所述基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數，包括：基於所述全連接層計算的所述樣本視頻對應的多種事件的發生概率，以及每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數。In some possible implementations, the neural network further includes a sampling layer connected to the first directed acyclic graph, the sampling layer is used for sampling the sample video to obtain the sampled video frame, and the Perform feature extraction on the sampled video frame, obtain the feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame to the first target input of the directed acyclic graph to the node; the neural network The road also includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the occurrence of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph probability; the neural network is trained based on the sample video and the event label corresponding to each sample video to obtain weight parameters after training, including: the sample calculated based on the fully connected layer The occurrence probability of various events corresponding to the video and the event label corresponding to each of the sample videos are used to train the neural network to obtain weight parameters after training.

在一些可能的實施方式中，根據以下方法得到所述有向無環圖中除輸入節點外的每個節點對應的特徵圖：根據指向當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖。In some possible implementations, the feature map corresponding to each node in the directed acyclic graph except the input node is obtained according to the following method: according to the feature map corresponding to each upper-level node pointing to the current node, and all The weight parameter of the operation method corresponding to the edge between the current node and each upper-level node pointing to the current node is used to generate a feature map corresponding to the current node.

通過上述方法，通過權重參數，可以控制任一節點與該任一節點的上一節點之間的邊之間的操作方法對於該任一節點的特徵圖的影響，因此可以通過控制權重參數，來控制任一節點與任一節點的上一節點之間的邊對應的操作方法，進而改變該任一節點的特徵圖的取值。Through the above method, the influence of the operation method between any node and the edge between the previous node of any node on the feature map of any node can be controlled through the weight parameter. Therefore, by controlling the weight parameter, the Control the operation method of the edge between any node and the previous node of any node, and then change the value of the feature map of any node.

在一些可能的實施方式中，所述根據指向所述當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖，包括：針對所述當前節點與指向所述當前節點的每個上一級節點之間的當前邊，基於所述當前邊對應的各所述操作方法對所述當前邊對應的上一級節點的特徵圖進行處理，得到所述當前邊對應的各所述操作方法對應的第一中間特徵圖；所述當前邊對應的各所述操作方法對應的第一中間特徵圖按照各所述操作方法對應的權重參數進行加權求和，得到所述當前邊對應的第二中間特徵圖；將所述當前節點與指向所述當前節點的各個上一級節點之間的多條邊分別對應的第二中間特徵圖進行求和運算，得到所述當前節點對應的特徵圖。In some possible implementations, according to the feature map corresponding to each upper-level node pointing to the current node, and the corresponding edge between the current node and each upper-level node pointing to the current node The weight parameter of the operation method, generating a feature map corresponding to the current node, including: for the current edge between the current node and each upper-level node pointing to the current node, based on the current edge corresponding to the current edge. Each of the operation methods processes the feature map of the upper-level node corresponding to the current edge to obtain a first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; The first intermediate feature map corresponding to the operation method is weighted and summed according to the weight parameters corresponding to each of the operation methods to obtain the second intermediate feature map corresponding to the current edge; The second intermediate feature graphs corresponding to the multiple edges between the upper-level nodes are summed to obtain the feature graph corresponding to the current node.

通過這種方法，可以使得每種操作方法都在確定節點的特徵圖時加以運用，減少單一操作方法對於節點對應的特徵圖的影響，有利於提高神經網路的識別精度。Through this method, each operation method can be used in determining the feature map of the node, reducing the influence of a single operation method on the feature map corresponding to the node, which is beneficial to improve the recognition accuracy of the neural network.

在一些可能的實施方式中，所述基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，包括：針對所述有向無環圖的每一所述邊，將每一所述邊對應的權重參數最大的操作方法作為每一所述邊對應的目標操作方法。In some possible implementations, the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameter after training includes: for each of the directed acyclic graphs For the sides, the operation method with the largest weight parameter corresponding to each of the sides is used as the target operation method corresponding to each of the sides.

在一些可能的實施方式中，所述基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路，包括：針對每一所述節點，在指向所述節點的邊的個數大於目標個數的情況下，確定指向所述節點的每條邊對應的所述目標操作方法的權重參數；按照對應的所述權重參數由大到小的順序，對指向所述節點的每條邊進行排序，將除前K位的邊外的其餘邊刪除，其中，K為所述目標個數；將進行刪除處理後的神經網路作為所述訓練後的神經網路。In some possible implementations, selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameters after training, to obtain a trained neural network, including: for each edge of the directed acyclic graph 1. the node, in the case that the number of edges pointing to the node is greater than the target number, determine the weight parameter of the target operation method corresponding to each edge pointing to the node; according to the corresponding weight parameter by In the order from large to small, sort each edge pointing to the node, and delete the remaining edges except the edges of the first K bits, where K is the number of targets; the neural network after deletion is used as The trained neural network.

通過這種方法，一方面可以降低神經網路的尺寸，另一方面可以減少神經網路的計算步驟，提高神經網路的計算效率。Through this method, on the one hand, the size of the neural network can be reduced, and on the other hand, the computational steps of the neural network can be reduced, and the computational efficiency of the neural network can be improved.

第二方面，本發明實施例還提供了一種視頻識別方法，包括：獲取待識別視頻；將所述待識別視頻輸入至基於第一方面或第一方面的任一種可能的實施方式所述的神經網路的訓練方法訓練得到的神經網路中，確定所述待識別視頻對應的多種事件的發生概率；將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。In a second aspect, an embodiment of the present invention further provides a video recognition method, including: acquiring a video to be recognized; inputting the video to be recognized into the neural network based on the first aspect or any possible implementation manner of the first aspect In the neural network trained by the network training method, the occurrence probability of various events corresponding to the to-be-recognized video is determined; the event whose corresponding occurrence probability meets a preset condition is regarded as an event that occurs in the to-be-identified video.

第三方面，本發明實施例提供了一種神經網路的訓練裝置，包括：構建模組，被配置為獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；訓練模組，被配置為基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；選擇模組，被配置為基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。In a third aspect, an embodiment of the present invention provides a training device for a neural network, including: a building module configured to acquire a sample video and build a neural network including a plurality of directed acyclic graphs; the plurality of The directed acyclic graph includes at least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to multiple an operation method, each of the operation methods has a corresponding weight parameter; a training module is configured to train the neural network based on the sample video and the event label corresponding to each of the sample videos, and obtain the training and a selection module, configured to select a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameters, so as to obtain a trained neural network.

在一些可能的實施方式中，所述有向無環圖包括兩個輸入節點；所述神經網路的每個節點對應一個特徵圖；所述構建模組，還被配置為：將第N-1個有向無環圖輸出的特徵圖作為第N+1個有向無環圖的一個輸入節點的特徵圖，並將第N個有向無環圖輸出的特徵圖作為所述第N+1個有向無環圖的另一個輸入節點的特徵圖；N為大於1的整數；其中，所述神經網路的第一個有向無環圖中的目標輸入節點對應的特徵圖為對樣本視頻的採樣視頻幀進行特徵提取後的特徵圖，除所述目標輸入節點外的另一個輸入節點為空；所述神經網路的第二個有向無環圖中一個輸入節點的特徵圖為所述第一個有向無環圖輸出的特徵圖，另一個輸入節點為空。In some possible implementations, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map; the building module is further configured to: convert the Nth-th The feature map output by 1 DAG is used as the feature map of an input node of the N+1 DAG, and the feature map output by the N DAG is used as the N+ A feature map of another input node of a directed acyclic graph; N is an integer greater than 1; wherein, the feature graph corresponding to the target input node in the first directed acyclic graph of the neural network is a pair of The feature map of the sampled video frame of the sample video after feature extraction, the other input node except the target input node is empty; the feature map of an input node in the second directed acyclic graph of the neural network The feature map output for the first directed acyclic graph, the other input node is empty.

在一些可能的實施方式中，所述構建模組，還被配置為將所述有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯，將串聯後的特徵圖作為所述有向無環圖輸出的特徵圖。In some possible implementations, the building module is further configured to concatenate feature maps corresponding to other nodes in the directed acyclic graph except the input node, and use the concatenated feature maps as the Feature map of the directed acyclic graph output.

在一些可能的實施方式中，所述神經網路還包括與第一個有向無環圖連接的採樣層，所述採樣層用於對樣本視頻進行採樣，得到採樣視頻幀，並對所述採樣視頻幀進行特徵提取，得到所述採樣視頻幀對應的特徵圖，將所述採樣視頻幀對應的特徵圖輸入至第一個所述有向無環圖的目標輸入至節點；所述神經網路還包括與最後一個有向無環圖的輸出節點連接的全連接層；所述全連接層用於基於該輸出節點的特徵圖確定所述樣本視頻對應的多種事件的發生概率；所述訓練模組，還被配置為：基於所述全連接層計算的所述樣本視頻對應的多種事件的發生概率，以及每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數。In some possible implementations, the neural network further includes a sampling layer connected to the first directed acyclic graph, the sampling layer is used for sampling the sample video to obtain the sampled video frame, and the Perform feature extraction on the sampled video frame, obtain the feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame to the first target input of the directed acyclic graph to the node; the neural network The road also includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the probability of occurrence of various events corresponding to the sample video based on the feature map of the output node; the training The module is further configured to: train the neural network based on the probability of occurrence of various events corresponding to the sample videos calculated by the fully connected layer, and the event labels corresponding to each of the sample videos, to obtain Weight parameters after training.

在一些可能的實施方式中，所述構建模組，還被配置為根據指向當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖。In some possible implementations, the building module is further configured to, according to the feature map corresponding to each upper-level node pointing to the current node, and the current node and each upper-level node pointing to the current node The weight parameter of the operation method corresponding to the edge between them is used to generate the feature map corresponding to the current node.

在一些可能的實施方式中，所述構建模組，還被配置為針對所述當前節點與指向所述當前節點的每個上一級節點之間的當前邊，基於所述當前邊對應的各所述操作方法對所述當前邊對應的上一級節點的特徵圖進行處理，得到所述當前邊對應的各所述操作方法對應的第一中間特徵圖；所述當前邊對應的各所述操作方法對應的第一中間特徵圖按照各所述操作方法對應的權重參數進行加權求和，得到所述當前邊對應的第二中間特徵圖；將所述當前節點與指向所述當前節點的各個上一級節點之間的多條邊分別對應的第二中間特徵圖進行求和運算，得到所述當前節點對應的特徵圖。In some possible implementations, the building module is further configured to, for the current edge between the current node and each upper-level node pointing to the current node, based on each The operation method processes the feature map of the upper-level node corresponding to the current edge, and obtains the first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; each of the operation methods corresponding to the current edge is obtained. The corresponding first intermediate feature map is weighted and summed according to the weight parameters corresponding to each of the operation methods to obtain the second intermediate feature map corresponding to the current edge; A summation operation is performed on the second intermediate feature maps corresponding to the multiple edges between the nodes to obtain a feature map corresponding to the current node.

在一些可能的實施方式中，所述選擇模組分，還被配置為針對所述有向無環圖的每一所述邊，將每一所述邊對應的權重參數最大的操作方法作為每一所述邊對應的目標操作方法。In some possible implementations, the selection module component is further configured to, for each of the edges of the directed acyclic graph, use the operation method with the largest weight parameter corresponding to each of the edges as each edge. A target operation method corresponding to the edge.

在一些可能的實施方式中，所述選擇模組，還被配置為針對每一所述節點，在指向所述節點的邊的個數大於目標個數的情況下，確定指向所述節點的每條邊對應的所述目標操作方法的權重參數；按照對應的所述權重參數由大到小的順序，對指向所述節點的每條邊進行排序，將除前K位的邊外的其餘邊刪除，其中，K為所述目標個數；將進行刪除處理後的神經網路作為所述訓練後的神經網路。In some possible implementations, the selection module is further configured to, for each of the nodes, determine, for each of the nodes, when the number of edges pointing to the node is greater than the target number, determine every edge that points to the node. The weight parameter of the target operation method corresponding to the edge; according to the corresponding weight parameter in descending order, sort each edge pointing to the node, and delete the remaining edges except the edges of the top K bits, Wherein, K is the number of targets; the neural network after deletion is used as the trained neural network.

第四方面，本發明實施例還提供了一種視頻識別裝置，包括：獲取模組，被配置為獲取待識別視頻；第一確定模組，被配置為將所述待識別視頻輸入至基於第一方面或第一方面任一些可能的實施方式所述的神經網路的訓練方法訓練得到的神經網路中，確定所述待識別視頻對應的多種事件的發生概率；第二確定模組，被配置為將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。In a fourth aspect, an embodiment of the present invention further provides a video recognition device, comprising: an acquisition module configured to acquire a video to be recognized; a first determination module configured to input the to-be-recognized video to a video based on the first In the neural network obtained by training the neural network training method described in any possible embodiments of the aspect or the first aspect, the probability of occurrence of various events corresponding to the video to be identified is determined; the second determination module is configured In order to take an event whose occurrence probability conforms to a preset condition as an event occurring in the video to be identified.

第五方面，本發明實施例還提供一種電腦設備，包括：處理器、記憶體和匯流排，所述記憶體儲存有所述處理器可執行的機器可讀指令，當電腦設備運行時，所述處理器與所述記憶體之間通過匯流排通信，所述機器可讀指令被所述處理器執行時執行上述第一方面，或第一方面中任一種可能的實施方式中的步驟，或執行上述第二方面中的步驟。In a fifth aspect, an embodiment of the present invention further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device runs, the The processor and the memory communicate through a bus, and when the machine-readable instructions are executed by the processor, the above-mentioned first aspect or the steps in any possible implementation manner of the first aspect are performed, or Perform the steps of the second aspect above.

第六方面，本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述第一方面，或第一方面中任一種可能的實施方式中的步驟，或執行上述第二方面中的步驟。In a sixth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and the computer program executes the first aspect or any one of the first aspects when the computer program is run by the processor. steps in a possible implementation, or perform the steps in the second aspect above.

第七方面，本發明實施例還提供一種電腦程式，包括電腦可讀代碼，當所述電腦可讀代碼在電子設備中運行時，所述電子設備中的處理器執行上述第一方面，或第一方面中任一種可能的實施方式中的步驟，或執行上述第二方面中的步驟。In a seventh aspect, an embodiment of the present invention further provides a computer program, including computer-readable code, when the computer-readable code is executed in an electronic device, a processor in the electronic device executes the first aspect, or the first aspect. Steps in any of the possible implementations of one aspect, or perform the steps in the second aspect above.

為使本發明的上述目的、特徵和優點能更明顯易懂，下文特舉較佳實施例，並配合所附附圖，作詳細說明如下。In order to make the above-mentioned objects, features and advantages of the present invention more obvious and easy to understand, preferred embodiments are given below, and are described in detail as follows in conjunction with the accompanying drawings.

為使本發明實施例的目的、技術方案和優點更加清楚，下面將結合本發明實施例中附圖，對本發明實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本發明一部分實施例，而不是全部的實施例。通常在此處附圖中描述和示出的本發明實施例的元件可以以各種不同的配置來佈置和設計。因此，以下對在附圖中提供的本發明的實施例的詳細描述並非旨在限制要求保護的本發明的範圍，而是僅僅表示本發明的選定實施例。基於本發明的實施例，本領域技術人員在沒有做出創造性勞動的前提下所獲得的所有其他實施例，都屬於本發明保護的範圍。In order to make the purposes, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. Obviously, the described embodiments are only These are some embodiments of the present invention, but not all embodiments. The elements of the embodiments of the invention generally described and illustrated in the drawings herein may be arranged and designed in a variety of different configurations. Thus, the following detailed description of the embodiments of the invention provided in the accompanying drawings is not intended to limit the scope of the invention as claimed, but is merely representative of selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative work fall within the protection scope of the present invention.

相關技術中，在進行視頻識別的過程中，一般是對現有的圖像識別的神經網路加以改造，然而現有的進行圖像識別的神經網路是圖像維度上進行識別的，而忽略了一些從圖像維度上無法提取的視頻特徵，影響了神經網路的識別精度。In the related art, in the process of video recognition, the existing neural network for image recognition is generally transformed. However, the existing neural network for image recognition recognizes in the image dimension, and ignores the Some video features that cannot be extracted from the image dimension affect the recognition accuracy of the neural network.

另外，相關技術中還會採用基於進化的演算法搜索進行視頻識別的神經網路，然而這種方法每次需要對多個神經網路進行訓練完成之後，選擇性能最佳的神經網路再次進行調整，在神經網路的調整過程中的計算量較大，訓練效率較低。In addition, in the related art, an evolution-based algorithm is also used to search for a neural network for video recognition. However, this method needs to select the neural network with the best performance after training multiple neural networks each time. Adjustment, the amount of calculation in the adjustment process of the neural network is large, and the training efficiency is low.

針對以上方案所存在的缺陷，均是發明人在經過實踐並仔細研究後得出的結果，因此，上述問題的發現過程以及下文中本發明實施例針對上述問題所提出的解決方案，都應該是發明人對本發明實施例做出的貢獻。The defects existing in the above solutions are the results obtained by the inventor after practice and careful research. Therefore, the discovery process of the above problems and the solutions proposed in the following embodiments of the present invention for the above problems should be Contributions made by the inventors to the embodiments of the present invention.

基於此，本發明實施例提供了一種神經網路的訓練方法，所構建的神經網路中不僅包括用於提取空間特徵的有向無環圖，還包括用於提取時間特徵的有向無環圖，有向無環圖的每條邊對應多個操作方法；這樣在利用樣本視頻對神經網路進行訓練後，可以得到訓練後的操作方法的權重參數，進一步基於訓練後的操作方法的權重參數來得到訓練後的神經網路；這種方法訓練的神經網路不僅進行了圖像維度的空間特徵識別，還進行了時間維度的時間特徵識別，訓練出的神經網路對於視頻的識別精度較高。Based on this, an embodiment of the present invention provides a method for training a neural network. The constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features. In the graph, each edge of the directed acyclic graph corresponds to multiple operation methods; in this way, after using the sample video to train the neural network, the weight parameters of the operation method after training can be obtained, which is further based on the weight parameters of the operation method after training. to obtain the trained neural network; the neural network trained by this method not only performs the spatial feature recognition of the image dimension, but also the temporal feature recognition of the time dimension, and the trained neural network has a higher recognition accuracy for the video. high.

應注意到：相似的標號和字母在下面的附圖中表示類似項，因此，一旦某一項在一個附圖中被定義，則在隨後的附圖中不需要對其進行進一步定義和解釋。It should be noted that like numerals and letters refer to like items in the following figures, so once an item is defined in one figure, it does not require further definition and explanation in subsequent figures.

為便於對本實施例進行理解，首先對本發明實施例所公開的一種神經網路的訓練方法進行詳細介紹，本發明實施例所提供的神經網路的訓練方法的執行主體一般為具有一定計算能力的電腦設備，該電腦設備例如包括：終端設備或伺服器或其它處理設備，終端設備可以為使用者設備（User Equipment，UE）、移動設備、使用者終端、個人電腦等。此外，本發明實施例提出的方法還可以通過處理器執行電腦程式代碼實現。In order to facilitate the understanding of this embodiment, a method for training a neural network disclosed in the embodiment of the present invention is first introduced in detail. Computer equipment, for example, the computer equipment includes: terminal equipment or server or other processing equipment, and the terminal equipment may be user equipment (User Equipment, UE), mobile equipment, user terminal, personal computer and so on. In addition, the methods provided by the embodiments of the present invention may also be implemented by a processor executing computer program codes.

參見圖1所示，為本發明實施例提供的一種神經網路的訓練方法的流程圖，所述方法包括步驟101至步驟103，其中：步驟101、獲取樣本視頻，並構建包括多個有向無環圖的神經網路。其中，所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數。步驟102、基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數。步驟103、基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。Referring to FIG. 1, which is a flowchart of a method for training a neural network provided by an embodiment of the present invention, the method includes steps 101 to 103, wherein: Step 101: Obtain a sample video, and construct a neural network including multiple directed acyclic graphs. Wherein, the multiple directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and at least one directed acyclic graph for extracting spatial features; Each edge corresponds to a plurality of operation methods, and each of the operation methods has a corresponding weight parameter. Step 102: Train the neural network based on the sample videos and the event labels corresponding to each of the sample videos to obtain weight parameters after training. Step 103 , based on the weight parameters after training, select a target operation method for each edge of the multiple directed acyclic graphs to obtain a trained neural network.

以下是對上述步驟101至步驟103的詳細介紹。The following is a detailed introduction to the above steps 101 to 103 .

在一些可能的實施方式中，在構建神經網路的過程中，用於提取時間特徵的有向無環圖的個數和用於提取空間特徵的有向無環圖的個數是預先設置好的。有向無環圖的節點表示特徵圖，節點之間的邊表示操作方法。In some possible implementations, in the process of constructing the neural network, the number of DAGs for extracting temporal features and the number of DAGs for extracting spatial features are preset of. The nodes of the directed acyclic graph represent the feature graph, and the edges between the nodes represent the operation method.

在構建包括多個有向無環圖的神經網路的過程中，可以將第N-1個有向無環圖輸出的特徵圖作為第N+1個有向無環圖的一個輸入節點的特徵圖，並將第N個有向無環圖輸出的特徵圖作為所述第N+1個有向無環圖的另一個輸入節點的特徵圖；N為大於1的整數。In the process of constructing a neural network including multiple directed acyclic graphs, the feature map output by the N-1 th directed acyclic graph can be used as an input node of the N+1 th directed acyclic graph. feature map, and use the feature map output by the Nth directed acyclic graph as the feature map of another input node of the N+1th directed acyclic graph; N is an integer greater than 1.

在一些可能的實現方式中，每個有向無環圖包括兩個輸入節點，可以將神經網路的第一個有向無環圖的任意一個輸入節點作為目標輸入節點，目標輸入節點的輸入為對樣本視頻的採樣視頻幀進行特徵提取後的特徵圖，所述神經網路的第一個有向無環圖中除所述目標輸入節點外的另一個輸入節點為空；將神經網路的第二個有向無環圖的一個輸入節點對應的特徵圖為所述第一個有向無環圖輸出的特徵圖，另一個輸入節點為空。在其他實施例中，有向無環圖也可以包括一個、三個或更多個輸入節點。In some possible implementations, each directed acyclic graph includes two input nodes, and any input node of the first directed acyclic graph of the neural network can be used as the target input node, and the input of the target input node For the feature map after feature extraction is performed on the sampled video frames of the sample video, the other input nodes other than the target input node in the first directed acyclic graph of the neural network are empty; The feature map corresponding to one input node of the second DAG is the feature map output by the first DAG, and the other input node is empty. In other embodiments, the directed acyclic graph may also include one, three, or more input nodes.

其中，在確定任一有向無環圖輸出的特徵圖的過程中，可以將該有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯（contact），並將串聯後的特徵圖作為該有向無環圖輸出的特徵圖。Among them, in the process of determining the feature map output by any directed acyclic graph, the feature maps corresponding to other nodes except the input node in the directed acyclic graph can be connected in series (contact), and the concatenated feature maps can be connected. The feature map is used as the feature map output by this directed acyclic graph.

示例性的，構建的包括有向無環圖的神經網路的網路結構可以如圖2所示，圖2中包括三個有向無環圖，白色圓點表示輸入節點，黑色圓點表示將有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯後的特徵圖，第一個有向無環圖的一個輸入節點對應樣本視頻的採樣視頻幀的特徵圖，另一個輸入節點為空，第一個有向無環圖的輸出節點對應的特徵圖作為第二個有向無環圖的其中一個輸入節點，第二個有向無環圖的輸入節點為空，第二個有向無環圖的輸出的特徵圖和第一個有向無環圖的輸出的特徵圖分別作為第三個有向無環圖的兩個輸入節點對應的特徵圖，以此類推。Exemplarily, the network structure of the constructed neural network including the directed acyclic graph can be shown in Figure 2, which includes three directed acyclic graphs, the white dots represent the input nodes, and the black dots represent the input nodes. The feature map after concatenating the feature maps corresponding to other nodes in the directed acyclic graph except the input node, one input node of the first directed acyclic graph corresponds to the feature map of the sampled video frame of the sample video, and the other The input node is empty, the feature map corresponding to the output node of the first directed acyclic graph is used as one of the input nodes of the second directed acyclic graph, the input node of the second directed acyclic graph is empty, and the first directed acyclic graph is empty. The output feature map of the two DAGs and the output feature map of the first DAG are respectively used as the feature maps corresponding to the two input nodes of the third DAG, and so on.

在一種實施方式中，用於提取時間特徵的有向無環圖中的每條邊對應多個第一操作方法，用於提取空間特徵的有向無環圖中的每條邊對應多個第二操作方法，所述多個第一操作方法中包括所述多個第二操作方法以及至少一個區別於各所述第二操作方法的其他操作方法。In one embodiment, each edge in the directed acyclic graph for extracting temporal features corresponds to multiple first operation methods, and each edge in the directed acyclic graph for extracting spatial features corresponds to multiple second operations In the method, the plurality of first operation methods include the plurality of second operation methods and at least one other operation method that is different from each of the second operation methods.

示例性的，用於提取時間特徵的有向無環圖中的每條邊對應的多個第一操作方法可以包括平均池化操作（如1×3×3的平均池化）、最大值池化操作（如1×3×3的最大值池化）、離散卷積操作（如1×3×3的離散卷積）、帶洞離散卷積（如1×3×3的帶洞離散卷積）；用於提取空間特徵的有向無環圖中的每條邊對應的多個第二操作方法可以包括平均池化操作、最大值池化操作、離散卷積操作、帶洞離散卷積、以及不同的時間卷積。Exemplarily, the multiple first operation methods corresponding to each edge in the directed acyclic graph for extracting temporal features may include average pooling operations (such as 1×3×3 average pooling), maximum pooling Operations (eg 1×3×3 max pooling), discrete convolution operations (eg 1×3×3 discrete convolution), discrete convolution with holes (eg 1×3×3 discrete convolution with holes) ); the plurality of second operation methods corresponding to each edge in the directed acyclic graph for extracting spatial features may include an average pooling operation, a maximum pooling operation, a discrete convolution operation, a discrete convolution with holes, and Different time convolutions.

其中，所述時間卷積用於提取時間特徵。示例性的，時間卷積可以是3+3×3尺寸的時間卷積，3+3×3尺寸的時間卷積表示在時間維度上的卷積核的大小是3，在空間維度上卷積核的大小是3×3，其處理過程示例性的可以如圖3a所示，C_in 表示輸入的特徵圖， C_out 表示經過處理後輸出的特徵圖，ReLU表示啟動函數，conv1×3×3表示時間維度上卷積核大小是1、空間維度上卷積核大小是3×3的卷積操作，conv3×1×1表示時間維度上卷積核大小是3、空間維度上卷積核大小是1×1的卷積操作，BatchNorm表示歸一化操作，T、W、H分別表示時間維度、和空間的兩個維度。Wherein, the temporal convolution is used to extract temporal features. Exemplarily, the temporal convolution may be a temporal convolution with a size of 3+3×3, and a temporal convolution with a size of 3+3×3 means that the size of the convolution kernel in the time dimension is 3, and the convolution in the spatial dimension is 3. The size of the kernel is 3×3, and its processing process can be exemplified as shown in Figure 3a, C _in represents the input feature map, C _out represents the output feature map after processing, ReLU represents the startup function, conv1×3×3 Indicates that the size of the convolution kernel in the time dimension is 1, and the size of the convolution kernel in the spatial dimension is 3×3. Convolution operation, conv3×1×1 indicates that the size of the convolution kernel in the time dimension is 3, and the size of the convolution kernel in the space dimension is 3×3. It is a 1×1 convolution operation, BatchNorm represents a normalization operation, and T, W, and H represent the time dimension and the two dimensions of space, respectively.

示例性的，時間卷積也可以是3+1×1尺寸的時間卷積，3+1×1尺寸的時間卷積表示在時間維度上的卷積核的大小是3，在空間維度上卷積核的大小是1×1，其處理過程示例性的可以如圖3b所示，conv1×1×1表示時間維度上卷積核大小是1、空間維度上卷積核大小是1×1的卷積操作，其餘符號的含義與圖3a中的含義相同，在此將不再贅述。Exemplarily, the temporal convolution may also be a temporal convolution with a size of 3+1×1, and a temporal convolution with a size of 3+1×1 indicates that the size of the convolution kernel in the time dimension is 3, and the convolution in the spatial dimension is 3. The size of the convolution kernel is 1×1, and its processing process can be exemplified as shown in Figure 3b, conv1×1×1 means that the size of the convolution kernel in the time dimension is 1, and the size of the convolution kernel in the spatial dimension is 1×1 Convolution operation, the meanings of other symbols are the same as those in Fig. 3a, which will not be repeated here.

在一些可能的實施方式中，初始構建神經網路的過程中，用於提取時間特徵的各個有向無環圖的結構是相同的，但是在神經網路訓練完成之後，不同的用於提取時間特徵的有向無環圖中的邊對應的目標操作方法可能是不同的；同樣的，構建神經網路的過程中，用於提取空間特徵的各個有向無環圖的結構也是相同的，在神經網路訓練完成之後，不同的用於提取空間特徵的有向無環圖中的邊對應的目標操作方法也可能不同。In some possible implementations, during the initial construction of the neural network, the structures of the directed acyclic graphs used for extracting temporal features are the same, but after the neural network training is completed, different structures used for extracting temporal features are different. The target operation methods corresponding to the edges in the directed acyclic graph of features may be different; similarly, in the process of building a neural network, the structure of each directed acyclic graph used to extract spatial features is also the same. After the neural network is trained, the target operation methods corresponding to the edges in the directed acyclic graph for extracting spatial features may also be different.

在一些可能的實施方式中，用於提取時間特徵的每個有向無環圖中包括兩種有向無環圖，一種是對於輸入的特徵圖的尺寸和通道數進行改變的第一有向無環圖，一種是對於輸入的特徵圖的尺寸和通道數不進行改變的第二有向無環圖。其中，第一有向無環圖中可以包括第一預設個數的節點，第二有向無環圖中可以包括第二預設個數的節點，第一預設個數和第二預設個數可以相同。用於提取空間特徵的每個有向無環圖中也可以包括兩種有向無環圖，一種是對於輸入的特徵圖的尺寸和通道數進行改變的第三有向無環圖，一種是對於輸入的特徵圖的尺寸和通道數不進行改變的第四有向無環圖，其中，第三有向無環圖中可以包括第三預設個數的節點，第四有向無環圖中可以包括第四預設個數的節點，第三預設個數和第四預設個數可以相同。In some possible implementations, each directed acyclic graph for extracting temporal features includes two directed acyclic graphs, one is a first directed acyclic graph that changes the size and number of channels of the input feature map Acyclic graph, one is a second directed acyclic graph that does not change the size and number of channels of the input feature map. Wherein, the first directed acyclic graph may include a first preset number of nodes, the second directed acyclic graph may include a second preset number of nodes, the first preset number and the second preset number of nodes The number can be the same. Each directed acyclic graph used to extract spatial features can also include two directed acyclic graphs, one is a third directed acyclic graph that changes the size and number of channels of the input feature map, and the other is For the fourth directed acyclic graph that does not change the size of the input feature map and the number of channels, the third directed acyclic graph may include a third preset number of nodes, and the fourth directed acyclic graph may include a fourth preset number of nodes, and the third preset number and the fourth preset number may be the same.

因此，在構建的神經網路中包括上述四種有向無環圖，實際應用中，每一種有向無環圖對應的預設個數的節點包括該有向無環圖中每一級的節點的個數，在確定每一級節點個數之後，可以直接確定各個節點之間的連接關係，進而確定有向無環圖。Therefore, the constructed neural network includes the above four kinds of directed acyclic graphs. In practical applications, the preset number of nodes corresponding to each directed acyclic graph includes the nodes at each level of the directed acyclic graph. After determining the number of nodes at each level, the connection relationship between each node can be directly determined, and then the directed acyclic graph can be determined.

示例性的，包含四種有向無環圖的神經網路的網路結構可以如圖4所示，在將樣本視頻輸入至神經網路之後，可以先輸入採樣層，對樣本視頻進行採樣，然後對採樣之後的樣本視頻幀進行特徵提取，輸入至第一個有向無環圖中，最後一個有向無環圖輸入全連接層中，全連接層的輸入即為神經網路的輸出。Exemplarily, the network structure of the neural network including four kinds of directed acyclic graphs can be shown in Figure 4. After the sample video is input into the neural network, the sampling layer can be input first, and the sample video can be sampled. Then feature extraction is performed on the sampled sample video frames, which are input into the first directed acyclic graph, and the last directed acyclic graph is input into the fully connected layer, and the input of the fully connected layer is the output of the neural network.

這裡需要說明的是，通過有向無環圖控制特徵圖的尺寸和通道數，一方面可以擴大神經網路的感受野，另一方面可以減少神經網路的計算量，提高計算效率。上述方法中，所構建的神經網路中不僅包括用於提取空間特徵的有向無環圖，還包括用於提取時間特徵的有向無環圖，有向無環圖的每條邊對應多個操作方法；這樣在利用樣本視頻對神經網路進行訓練後，可以得到訓練後的操作方法的權重參數，進一步基於訓練後的操作方法的權重參數來得到訓練後的神經網路；這種方法訓練的神經網路不僅進行了圖像維度的空間特徵識別，還進行了時間維度的時間特徵識別，訓練出的神經網路對於視頻的識別精度較高。It should be noted here that by controlling the size of the feature map and the number of channels through the directed acyclic graph, on the one hand, the receptive field of the neural network can be expanded, and on the other hand, the computational load of the neural network can be reduced and the computational efficiency can be improved. In the above method, the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to multiple Operation method; in this way, after using the sample video to train the neural network, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; this method trains The neural network not only performs the spatial feature recognition of the image dimension, but also the temporal feature recognition of the time dimension, and the trained neural network has high recognition accuracy for the video.

在一些可能的實施方式中，在確定有向無環圖中除輸入節點外的每個節點對應的特徵圖時，可以根據指向當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖。In some possible implementations, when determining the feature map corresponding to each node in the directed acyclic graph except the input node, the feature map corresponding to each upper-level node pointing to the current node and the current The weight parameter of the operation method corresponding to the edge between the node and each upper-level node pointing to the current node generates a feature map corresponding to the current node.

示例性的，若有向無環圖如圖5所示，則在確定節點3對應的特徵圖時，指向節點3的節點為節點0、節點1和節點2，則可以根據節點0、節點1和節點2對應的特徵圖，以及節點0、節點1和節點2分別與節點3之間的邊對應的操作方法的權重參數，確定節點3對應的特徵圖。Exemplarily, if the directed acyclic graph is shown in FIG. 5 , when determining the feature map corresponding to node 3, the nodes pointing to node 3 are node 0, node 1, and node 2, and then according to node 0, node 1 The feature map corresponding to node 2, and the weight parameters of the operation methods corresponding to the edges between node 0, node 1, and node 2, respectively, and node 3, determine the feature map corresponding to node 3.

其中，若該有向無環圖為用於提取時間特徵的有向無環圖，則節點0、節點1和節點2分別與節點3之間的邊對應的操作方法為第一操作方法，若該有向無環圖為用於提取空間特徵的有向無環圖，則節點0、節點1和節點2分別與節點3之間的邊對應的操作方法為第二操作方法。Wherein, if the directed acyclic graph is a directed acyclic graph for extracting temporal features, the operation method corresponding to the edge between node 0, node 1 and node 2 and node 3 respectively is the first operation method, if The directed acyclic graph is a directed acyclic graph for extracting spatial features, and the operation method corresponding to the edge between node 0, node 1, and node 2 respectively and node 3 is the second operation method.

在生成節點對應的特徵圖的過程中，可以參照圖6所示的方法，包括以下幾個步驟。In the process of generating the feature map corresponding to the node, the method shown in FIG. 6 may be referred to, including the following steps.

步驟601、針對所述當前節點與指向所述當前節點的每個上一級節點之間的當前邊，基於所述當前邊對應的各所述操作方法對所述當前邊對應的上一級節點的特徵圖進行處理，得到所述當前邊對應的各所述操作方法對應的第一中間特徵圖。Step 601, for the current edge between the current node and each upper-level node pointing to the current node, based on the characteristics of the upper-level node corresponding to the current edge based on the operation methods corresponding to the current edge. The graph is processed to obtain a first intermediate feature map corresponding to each of the operation methods corresponding to the current edge.

示例性的，若當前節點所在的有向無環圖為用於進行時間特徵提取的有向無環圖，指向當前節點的有三條當前邊，每條當前邊對應六個第一操作方法，則針對任一條當前邊，可以通過該條當前邊對應的每一個操作方法對與該條當前邊連接的上一節點對應的特徵圖分別進行處理，則可以得到該條當前邊對應的六個第一中間特徵圖，指向該當前節點的有三條當前邊，則通過計算，可以得到十八個第一中間特徵圖。Exemplarily, if the directed acyclic graph where the current node is located is a directed acyclic graph used for temporal feature extraction, there are three current edges pointing to the current node, and each current edge corresponds to six first operation methods, then For any current edge, the feature map corresponding to the previous node connected to the current edge can be processed separately through each operation method corresponding to the current edge, and then six first corresponding to the current edge can be obtained. In the intermediate feature map, there are three current edges pointing to the current node, then through calculation, eighteen first intermediate feature maps can be obtained.

若當前節點所在的有向無環圖為用於進行空間特徵提取的有向無環圖，指向當前節點的有三條當前邊，每條當前邊對應四個第一操作方法，與上述計算方法類似，每條當前邊對應的第一中間特徵圖為四個，通過計算可以得到十二個第一中間特徵圖。If the directed acyclic graph where the current node is located is a directed acyclic graph used for spatial feature extraction, there are three current edges pointing to the current node, and each current edge corresponds to four first operation methods, which is similar to the above calculation method , there are four first intermediate feature maps corresponding to each current edge, and twelve first intermediate feature maps can be obtained through calculation.

步驟602、將所述當前邊對應的各所述操作方法對應的第一中間特徵圖按照各所述操作方法對應的權重參數進行加權求和，得到所述當前邊對應的第二中間特徵圖。Step 602: Perform a weighted summation of the first intermediate feature maps corresponding to the operation methods corresponding to the current edge according to the weight parameters corresponding to the operation methods to obtain a second intermediate feature map corresponding to the current edge.

所述權重參數為待訓練的模型參數，在一些可能的實施方式中，可以給權重參數隨機賦值，然後在神經網路的訓練過程中不斷調整。The weight parameter is a model parameter to be trained. In some possible implementations, the weight parameter can be randomly assigned and then adjusted continuously during the training process of the neural network.

每條指向當前節點的當前邊對應的操作方法都有對應的權重參數，在將各個操作方法對應的第一中間特徵圖按照對應的權重參數進行加權求和時，可以將第一特徵圖對應位置處的取值與該第一特徵圖對應的操作方法的權重參數相乘，然後將對應位置處的相乘結果進行相加，得到該條當前邊對應的第二中間特徵圖。Each operation method corresponding to the current edge pointing to the current node has a corresponding weight parameter. When the first intermediate feature map corresponding to each operation method is weighted and summed according to the corresponding weight parameter, the corresponding position of the first feature map can be calculated. The value at the position is multiplied by the weight parameter of the operation method corresponding to the first feature map, and then the multiplication results at the corresponding position are added to obtain the second intermediate feature map corresponding to the current edge.

延續步驟601中的例子，指向當前節點的有三條邊，每條當前邊對應六個第一操作方法，每個第一操作方法都有對應的權重參數，每條當前邊可以對應六個第一中間特徵圖，然後將每條當前邊對應的六個第一中間特徵圖按照對應的權重參數進行加權求和，得到每條當前邊對應的第二中間特徵圖。Continuing the example in step 601, there are three edges pointing to the current node, each current edge corresponds to six first operation methods, each first operation method has a corresponding weight parameter, and each current edge can correspond to six first operation methods Then, the six first intermediate feature maps corresponding to each current edge are weighted and summed according to the corresponding weight parameters to obtain the second intermediate feature map corresponding to each current edge.

這裡需要說明的是，不同邊對應的同一種操作方法的權重參數可能不同，例如，邊1和邊2均指向當前節點，邊1和邊2對應的操作方法中均包括平均池化操作，邊1對應的平均池化操作的權重參數可能為70%，邊2對應的平均池化操作的權重參數可能為10%。It should be noted here that the weight parameters of the same operation method corresponding to different edges may be different. For example, edge 1 and edge 2 both point to the current node, and the operation methods corresponding to edge 1 and edge 2 both include the average pooling operation. The weight parameter of the average pooling operation corresponding to 1 may be 70%, and the weight parameter of the average pooling operation corresponding to edge 2 may be 10%.

示例性的，在計算第

個節點和第

個節點之間的邊對應的第二特徵圖時，可以通過如下公式（1）進行計算：

公式（1）；其中，o和

表示操作方法，O表示第

個節點和第

個節點之間的操作方法的集合，

表示第

個節點和第

個節點之間的邊對應的操作方法“o”的權重參數，

表示第

個節點和第

個節點之間的邊對應的操作方法“

”的權重參數，

表示第

個節點對應的特徵圖，

表示第

個節點和第

個節點之間的邊對應的第二特徵圖。Exemplarily, when calculating the

node and

When the second feature map corresponding to the edges between nodes can be calculated by the following formula (1):

Formula (1); where, o and

Indicates the operation method, O indicates the first

node and

A collection of operation methods between nodes,

means the first

node and

The weight parameter of the operation method "o" corresponding to the edge between the nodes,

means the first

node and

The operation method corresponding to the edge between nodes"

" weight parameter,

means the first

The feature map corresponding to each node,

means the first

node and

The second feature map corresponding to the edges between nodes.

步驟603、將所述當前節點與指向所述當前節點的各個上一級節點之間的多條邊分別對應的第二中間特徵圖進行求和運算，得到所述當前節點對應的特徵圖。Step 603: Perform a summation operation on the second intermediate feature graphs corresponding to the current node and the multiple edges between each upper-level node pointing to the current node, to obtain a feature graph corresponding to the current node.

其中，各個第二中間特徵圖的尺寸是相同的，在將各個第二中間特徵圖進行求和運算時，可以將各個第二中間特徵圖對應位置處的取值相加，得到當前節點對應的特徵圖。Wherein, the size of each second intermediate feature map is the same, when the sum operation of each second intermediate feature map is performed, the values at the corresponding positions of each second intermediate feature map can be added to obtain the corresponding value of the current node. feature map.

另外，構建的神經網路中還包括採樣層和全連接層，所述採樣層用於對輸入神經網路的視頻進行採樣，得到採樣視頻幀，並對採樣視頻幀進行特徵提取，得到所述採樣視頻幀對應的特徵圖，然後將採樣視頻幀對應的特徵圖輸入至第一個有向無環圖的目標輸入節點。所述全連接層用於基於最後一個有向無環圖輸出的特徵圖確定所述樣本視頻對應的多種事件的發生概率，綜上，構建的神經網路的整體結構示例性的如圖7所示，圖7中包括三個有向無環圖，一個全連接層，一個採樣層，全連接層的輸出即為神經網路的輸出。In addition, the constructed neural network also includes a sampling layer and a fully connected layer, the sampling layer is used to sample the video input to the neural network to obtain sampled video frames, and perform feature extraction on the sampled video frames to obtain the The feature map corresponding to the sampled video frame is then input to the target input node of the first directed acyclic graph. The fully connected layer is used to determine the probability of occurrence of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph. To sum up, the overall structure of the constructed neural network is exemplified as shown in Figure 7. Figure 7 includes three directed acyclic graphs, one fully connected layer and one sampling layer, and the output of the fully connected layer is the output of the neural network.

樣本視頻對應的事件標籤用於表示樣本視頻中所發生的事件，示例性的，樣本視頻中所發生的事件可以包括人跑步、小狗玩耍、兩個人打羽毛球等。在一些可能的實施方式中，在基於樣本視頻和樣本視頻對應的事件標籤，對構建的神經網路進行訓練時，可以通過如圖8所示的方法，包括以下幾個步驟。The event tag corresponding to the sample video is used to represent the event that occurs in the sample video. Exemplarily, the event that occurs in the sample video may include a person running, a dog playing, and two people playing badminton. In some possible implementations, when training the constructed neural network based on the sample video and the event label corresponding to the sample video, the method shown in FIG. 8 may be used, including the following steps.

步驟801、將樣本視頻輸入至神經網路中，輸出得到樣本視頻對應的多種事件的發生概率。Step 801: Input the sample video into the neural network, and output the probability of occurrence of various events corresponding to the sample video.

這裡，樣本視頻對應的多種事件的個數與訓練神經網路的樣本視頻的事件標籤的種類個數相同，例如若通過400種事件標籤的樣本視頻對神經網路進行訓練，則在將任一視頻輸入至神經網路之後，神經網路可以輸出輸入的視頻對應的400種事件分別的發生概率。Here, the number of various events corresponding to the sample video is the same as the number of event labels of the sample video for training the neural network. For example, if the neural network is trained with sample videos of 400 event labels, then any After the video is input to the neural network, the neural network can output the probability of occurrence of 400 kinds of events corresponding to the input video.

步驟802、基於樣本視頻對應的多種事件的發生概率，確定樣本視頻對應的預測事件。Step 802: Determine the predicted event corresponding to the sample video based on the occurrence probability of various events corresponding to the sample video.

例如，可以將對應的發生概率最大的事件確定為神經網路預測的事件，在另外一些可能的實施方式中，樣本視頻可能攜帶有多個事件標籤，例如同時攜帶有小狗玩耍、兩個人打羽毛球的事件標籤，因此在基於樣本視頻對應的多種事件的發生概率，確定樣本視頻對應的預測事件的過程中，還可以將對應的發生概率大於預設概率的事件確定為樣本視頻對應的預測事件。For example, the corresponding event with the highest probability of occurrence may be determined as the event predicted by the neural network. In some other possible implementations, the sample video may carry multiple event tags, such as carrying a dog playing at the same time, two people playing The event tag of playing badminton, so in the process of determining the predicted event corresponding to the sample video based on the probability of occurrence of various events corresponding to the sample video, it is also possible to determine the corresponding event with a probability of occurrence greater than the preset probability as the prediction corresponding to the sample video. event.

步驟803、基於樣本視頻對應的預測事件以及樣本視頻的事件標籤，確定本次訓練過程中的損失值。Step 803: Determine the loss value in this training process based on the predicted event corresponding to the sample video and the event label of the sample video.

示例性的，可以基於樣本視頻對應的預測事件以及樣本視頻的事件標籤確定本次訓練過程中的交叉熵損失。Exemplarily, the cross-entropy loss in this training process may be determined based on the predicted event corresponding to the sample video and the event label of the sample video.

步驟804、判斷本次訓練過程中的損失值是否小於預設損失值。Step 804: Determine whether the loss value in this training process is less than the preset loss value.

在判斷結果為是的情況下，則循序執行步驟805；在判斷結果為否的情況下，則調整本次訓練過程中的神經網路參數的參數值，並返回執行步驟801。If the judgment result is yes, step 805 is sequentially executed; if the judgment result is no, the parameter values of the neural network parameters in the current training process are adjusted, and the execution of step 801 is returned.

這裡，調整的神經網路參數包括有向無環圖的各個邊對應的操作方法的權重參數，由於各個權重參數可以影響有向無環圖的各個邊對應的目標操作方法的選擇，因此這裡的權重參數可以作為神經網路的結構參數；調整的神經網路參數中還包括指令引數，例如可以包括各個卷積操作的卷積核的大小、權重等。Here, the adjusted neural network parameters include the weight parameters of the operation method corresponding to each edge of the directed acyclic graph, since each weight parameter can affect the selection of the target operation method corresponding to each edge of the directed acyclic graph, so here The weight parameter can be used as a structural parameter of the neural network; the adjusted neural network parameter also includes instruction arguments, such as the size and weight of the convolution kernel of each convolution operation.

由於結構參數和指令引數的收斂速度相差較大，在指令引數處於學習的早期，學習速率較小的情況下，可能會導致結構參數的快速收斂，因此可以通過控制學習速率實現指令引數和結構參數的同步學習的過程。Due to the large difference between the convergence speed of the structural parameters and the instruction arguments, when the instruction arguments are in the early stage of learning and the learning rate is small, it may lead to rapid convergence of the structural parameters. Therefore, the instruction arguments can be realized by controlling the learning rate. and the process of simultaneous learning of structural parameters.

示例性的，可以採用逐步學習速率衰減策略，可以預先設置超參數S，表示每優化指令引數和結構參數S次，衰減一次學習速率，衰減的幅度為d（預先設置的），由此可以實現學習速率的逐步衰減，從而實現結構參數和指令引數的同步學習即同步優化。Exemplarily, a step-by-step learning rate decay strategy can be adopted, and the hyperparameter S can be set in advance, indicating that every time the instruction parameters and structural parameters are optimized S times, the learning rate is decayed once, and the magnitude of decay is d (preset), so that it can be The gradual decay of the learning rate is realized, so as to realize the synchronous learning of structural parameters and instruction parameters, that is, synchronous optimization.

現有技術中，在進行參數優化的過程中，一般是通過如下公式（2）和公式（3）進行優化：

公式（2）；

公式（3）；上述公式（2）中，α表示結構參數，ω表示指令引數，

表示α固定時，基於ω計算出的損失值，

表示α固定，然後通過訓練ω使得

最小時，ω的取值，即優化後的ω；上述公式（3）中，

表示優化後的ω不變，基於α計算出的損失值，訓練α，使得

最小。這種方法中，α是需要不斷調整的，每次調整α則需要重新訓練ω，示例性的，若每次訓練ω需要計算100次，若調整α100次，則最終需要計算10000次，計算量較大。In the prior art, in the process of parameter optimization, the optimization is generally performed by the following formulas (2) and (3):

formula (2);

Formula (3); In the above formula (2), α represents the structural parameter, ω represents the instruction argument,

Represents the loss value calculated based on ω when α is fixed,

Indicates that α is fixed, and then by training ω such that

When it is the smallest, the value of ω is the optimized ω; in the above formula (3),

Indicates that the optimized ω is unchanged, and based on the loss value calculated by α, training α such that

minimum. In this method, α needs to be adjusted continuously, and ω needs to be retrained each time α is adjusted. For example, if ω needs to be calculated 100 times each time, if α is adjusted 100 times, it will eventually need to be calculated 10,000 times. larger.

本發明實施例所提供的方法中，在進行參數優化過程中，一般是基於下述公式進行優化：

公式（4）；

公式（5）；上述公式中，

表示指令引數的學習速率，

表示基於

計算ω的梯度值，在計算優化後的ω時，採用近似計算的方法，這樣，每優化一次α值，在優化ω時，僅通過一次計算即可，因此可以看作是α和ω的同時優化。In the method provided by the embodiment of the present invention, in the process of parameter optimization, the optimization is generally performed based on the following formula:

formula (4);

Formula (5); In the above formula,

represents the learning rate of the instruction arguments,

means based on

Calculate the gradient value of ω. When calculating the optimized ω, an approximate calculation method is used. In this way, each time the α value is optimized, when ω is optimized, only one calculation is required, so it can be regarded as the simultaneous calculation of α and ω. optimization.

基於這種方法，在搜索神經網路結構的同時，可以搜索出神經網路內部的網路參數，相比較先確定網路結構再確定網路參數的方法而言，提高了神經網路的確定效率。Based on this method, while searching the neural network structure, the network parameters inside the neural network can be searched. Compared with the method of first determining the network structure and then determining the network parameters, the determination of the neural network is improved. efficient.

步驟805、基於訓練好的神經網路參數，確定訓練好的神經網路模型。Step 805: Determine the trained neural network model based on the trained neural network parameters.

在一些可能的實施方式中，可以基於訓練好的權重參數，為多個有向無環圖的每條邊選擇目標操作方法，為每條邊確定目標操作方法後的神經網路模型即為訓練好的神經網路。In some possible implementations, the target operation method may be selected for each edge of the multiple directed acyclic graphs based on the trained weight parameters, and the neural network model after the target operation method is determined for each edge is the trained one neural network.

示例性的，在基於訓練好的權重參數，為多個有向無環圖的每條邊選擇目標操作方法時，針對所述有向無環圖的每一所述邊，將每一所述邊對應的權重參數最大的操作方法作為每一所述邊對應的目標操作方法。Exemplarily, when a target operation method is selected for each edge of a plurality of directed acyclic graphs based on the trained weight parameters, for each of the edges of the directed acyclic graph, each edge is The operation method with the largest corresponding weight parameter is used as the target operation method corresponding to each of the edges.

在另外一些可能的實施方式中，為了降低神經網路的大小，以及提高神經網路的計算速度，在為多個有向無環圖的每條邊選擇目標操作方法之後，還可以對有向無環圖的邊進行刪減，然後將進行刪減之後的神經網路作為訓練好的神經網路。In some other possible implementations, in order to reduce the size of the neural network and improve the calculation speed of the neural network, after selecting the target operation method for each edge of the multiple directed acyclic graphs, the directed The edges of the ring graph are pruned, and then the pruned neural network is used as a trained neural network.

其中，針對每一所述節點，在指向所述節點的邊的個數大於目標個數的情況下，確定指向所述節點的每條邊對應的所述目標操作方法的權重參數；按照對應的所述權重參數由大到小的順序，對指向所述節點的每條邊進行排序，將除前K位的邊外的其餘邊刪除，其中，K為所述目標個數；將進行刪除處理後的神經網路作為所述訓練後的神經網路。Wherein, for each of the nodes, when the number of edges pointing to the node is greater than the target number, determine the weight parameter of the target operation method corresponding to each edge pointing to the node; The weight parameters are ordered from large to small, and each edge pointing to the node is sorted, and the remaining edges except the edges of the first K bits are deleted, where K is the number of targets; A neural network is used as the trained neural network.

示例性的，若目標個數為兩個，指向某一節點的邊的個數為三個，則可以分別確定指向該節點的三條邊對應的目標操作方法的權重參數，並按照權重參數，對指向該節點的三條邊進行由大到小的順序排序，將排在前兩位的邊保留，將排在第三位的邊刪除。Exemplarily, if the number of targets is two and the number of edges pointing to a node is three, the weight parameters of the target operation methods corresponding to the three edges pointing to the node can be determined respectively, and according to the weight parameters, The three edges pointing to this node are sorted in descending order, the first two edges are kept, and the third one is deleted.

基於相同的構思，本發明實施例還提供了一種視頻識別方法，參見圖9所示，為本發明實施例提供的一種視頻識別方法的流程示意圖，包括以下幾個步驟：步驟901、獲取待識別視頻。步驟902、將所述待識別視頻輸入預先訓練的神經網路中，確定所述待識別視頻對應的多種事件的發生概率。其中，所述神經網路是基於上述實施例提供的神經網路的訓練方法訓練得到的。步驟903、將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。其中，所述發生概率符合預設條件的事件可以是發生概率最大的事件，或者發生概率大於預設概率值的事件。Based on the same concept, an embodiment of the present invention also provides a video recognition method. Referring to FIG. 9, it is a schematic flowchart of a video recognition method provided by an embodiment of the present invention, including the following steps: Step 901: Acquire the video to be identified. Step 902: Input the video to be recognized into a pre-trained neural network, and determine the occurrence probability of various events corresponding to the video to be recognized. Wherein, the neural network is obtained by training based on the training method of the neural network provided in the above embodiment. Step 903 , taking an event whose occurrence probability meets the preset condition as an event occurring in the to-be-identified video. The event whose occurrence probability meets the preset condition may be an event with the highest occurrence probability, or an event whose occurrence probability is greater than a preset probability value.

下面將結合實施例，對上述待識別視頻輸入至神經網路之後，神經網路對於待識別視頻的詳細的處理過程進行介紹，所述神經網路包括採樣層、特徵提取層、全連接層，所述特徵提取層包括多個有向無環圖。In the following, after the video to be recognized is input to the neural network, the detailed processing process of the video to be recognized by the neural network will be introduced. The neural network includes a sampling layer, a feature extraction layer, and a fully connected layer. The feature extraction layer includes a plurality of directed acyclic graphs.

1）採樣層待識別視頻輸入至神經網路之後，首先輸入至採樣層，採樣層可以對待識別視頻進行採樣，獲得多個採樣視頻幀，然後對採樣視頻幀進行特徵提取，得到採樣視頻幀對應的特徵圖，然後將採樣視頻幀對應的特徵圖輸入至特徵提取層。1) Sampling layer After the video to be recognized is input to the neural network, it is first input to the sampling layer. The sampling layer can sample the video to be recognized to obtain multiple sampled video frames, and then perform feature extraction on the sampled video frames to obtain the corresponding feature map of the sampled video frames. The feature maps corresponding to the sampled video frames are then input to the feature extraction layer.

2）特徵提取層特徵提取層包括多個用於進行時間特徵提取的有向無環圖和用於進行空間特徵提取的有向無環圖，每種類型的有向無環圖的個數是預先設置好的，每種類型的有向無環圖內的節點的個數也是預先設置好的，用於進行時間特徵提取的有向無環圖和用於進行空間特徵提取的有向無環圖的區別如下表1所示：表1 用於進行時間特徵提取的有向無環圖用於進行空間特徵提取的有向無環圖節點個數第一預設個數第二預設個數邊對應的操作方法個數 6個操作方法 4個操作方法 2) Feature extraction layer The feature extraction layer includes multiple directed acyclic graphs for temporal feature extraction and directed acyclic graphs for spatial feature extraction. The number of each type of directed acyclic graph is Pre-set, the number of nodes in each type of directed acyclic graph is also preset, the directed acyclic graph for temporal feature extraction and the directed acyclic graph for spatial feature extraction The differences between the graphs are shown in Table 1 below: Table 1 Directed Acyclic Graphs for Temporal Feature Extraction Directed Acyclic Graphs for Spatial Feature Extraction number of nodes first preset number second preset number The number of operation methods corresponding to the edge 6 how-tos 4 how-tos

採樣層將採樣視頻幀對應的特徵圖輸入至特徵提取層之後，可以是將採樣視頻幀對應的特徵圖輸入至第一個有向無環圖的目標輸入節點，第一個有向無環圖的另一個輸入節點為空，第二個有向無環圖的一個輸入節點與第一個有向無環圖的輸出節點連接，另一個輸入節點為空，第三個有向無環圖的一個輸入節點與第二個有向無環圖的輸出節點連接，一個輸入節點與第一個有向無環圖的輸出節點連接，以此類推，最後一個有向無環圖的輸出節點將對應的特徵圖輸入至全連接層。After the sampling layer inputs the feature map corresponding to the sampled video frame to the feature extraction layer, it can input the feature map corresponding to the sampled video frame to the target input node of the first directed acyclic graph, the first directed acyclic graph The other input node of the DAG is empty, one input node of the second DAG is connected to the output node of the first DAG, the other input node is empty, and the third DAG's output node is empty. An input node is connected to the output node of the second DAG, an input node is connected to the output node of the first DAG, and so on, the output node of the last DAG will correspond to The feature map of is input to the fully connected layer.

3）全連接層有向無環圖的輸出節點對應的特徵圖輸入至全連接層之後，全連接層可以基於輸入的特徵圖確定輸入的待識別視頻中對應的多種事件的發生概率，其中，所述待識別視頻中對應的多種事件可以為在訓練神經網路時，所應用的樣本視頻對應的事件標籤。3) Fully connected layer After the feature map corresponding to the output node of the directed acyclic graph is input to the fully connected layer, the fully connected layer can determine the probability of occurrence of various events corresponding to the input video to be recognized based on the input feature map, wherein the video to be recognized The various events corresponding to can be the event labels corresponding to the sample videos applied when training the neural network.

上述實施例所提供的方法中，所構建的神經網路中不僅包括用於提取空間特徵的有向無環圖，還包括用於提取時間特徵的有向無環圖，有向無環圖的每條邊對應多個操作方法；這樣在利用樣本視頻對神經網路進行訓練後，可以得到訓練後的操作方法的權重參數，進一步基於訓練後的操作方法的權重參數來得到訓練後的神經網路；這種方法訓練的神經網路不僅進行了圖像維度的空間特徵識別，還進行了時間維度的時間特徵識別，訓練出的神經網路對於視頻的識別精度較高。In the method provided by the above embodiment, the constructed neural network includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features. Each edge corresponds to multiple operation methods; in this way, after using the sample video to train the neural network, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method. ; The neural network trained by this method not only performs the spatial feature recognition of the image dimension, but also the temporal feature recognition of the time dimension, and the trained neural network has a high recognition accuracy for the video.

本領域技術人員可以理解，在具體實施方式的上述方法中，各步驟的撰寫順序並不意味著嚴格的執行順序而對實施過程構成任何限定，各步驟的執行順序應當以其功能和可能的內在邏輯確定。Those skilled in the art can understand that, in the above-mentioned method of the specific embodiment, the writing order of each step does not mean a strict execution order but constitutes any limitation on the implementation process, and the execution order of each step should be based on its function and possible intrinsic Logical OK.

基於同一發明構思，本發明實施例中還提供了與神經網路的訓練方法對應的神經網路的訓練裝置，由於本發明實施例中的裝置解決問題的原理與本發明實施例上述神經網路的訓練方法相似，因此裝置的實施可以參見方法的實施，重複之處不再贅述。Based on the same inventive concept, the embodiment of the present invention also provides a neural network training device corresponding to the neural network training method. The training methods are similar, so the implementation of the device can refer to the implementation of the method, and the repetition will not be repeated.

參照圖10所示，為本發明實施例提供的一種神經網路的訓練裝置的架構示意圖，所述裝置包括：構建模組1001、訓練模組1002、選擇模組1003；其中，構建模組1001，被配置為獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；訓練模組1002，被配置為基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；選擇模組1003，被配置為基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。Referring to FIG. 10, it is a schematic diagram of the architecture of a neural network training apparatus provided by an embodiment of the present invention, the apparatus includes: a construction module 1001, a training module 1002, and a selection module 1003; wherein, A construction module 1001 is configured to acquire a sample video and construct a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features A graph, and at least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to a plurality of operation methods, and each of the operation methods has a corresponding weight parameter; The training module 1002 is configured to train the neural network based on the sample video and the event label corresponding to each of the sample videos to obtain weight parameters after training; The selection module 1003 is configured to select a target operation method for each edge of the plurality of directed acyclic graphs based on the trained weight parameters to obtain a trained neural network.

在一些可能的實施方式中，所述有向無環圖包括兩個輸入節點；所述神經網路的每個節點對應一個特徵圖；所述構建模組1001，還被配置為：將第N-1個有向無環圖輸出的特徵圖作為第N+1個有向無環圖的一個輸入節點的特徵圖，並將第N個有向無環圖輸出的特徵圖作為所述第N+1個有向無環圖的另一個輸入節點的特徵圖；N為大於1的整數；其中，所述神經網路的第一個有向無環圖中的目標輸入節點對應的特徵圖為對樣本視頻的採樣視頻幀進行特徵提取後的特徵圖，除所述目標輸入節點外的另一個輸入節點為空；所述神經網路的第二個有向無環圖中一個輸入節點的特徵圖為所述第一個有向無環圖輸出的特徵圖，另一個輸入節點為空。In some possible implementations, the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map; the building module 1001 is further configured to: convert the Nth The feature map output by -1 DAG is used as the feature map of an input node of the N+1 DAG, and the feature map output by the N DAG is used as the N+1 DAG output. +1 feature map of another input node of the directed acyclic graph; N is an integer greater than 1; wherein, the feature map corresponding to the target input node in the first directed acyclic graph of the neural network is The feature map after feature extraction is performed on the sampled video frame of the sample video, the other input node except the target input node is empty; the feature of an input node in the second directed acyclic graph of the neural network The picture shows the feature map output by the first directed acyclic graph, and the other input node is empty.

在一些可能的實施方式中，所述構建模組1001，還被配置為將所述有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯，將串聯後的特徵圖作為所述有向無環圖輸出的特徵圖。In some possible implementations, the building module 1001 is further configured to concatenate feature maps corresponding to other nodes in the directed acyclic graph except for the input node, and use the concatenated feature maps as all the feature maps. The feature map of the directed acyclic graph output.

在一些可能的實施方式中，所述神經網路還包括與第一個有向無環圖連接的採樣層，所述採樣層用於對樣本視頻進行採樣，得到採樣視頻幀，並對所述採樣視頻幀進行特徵提取，得到所述採樣視頻幀對應的特徵圖，將所述採樣視頻幀對應的特徵圖輸入至第一個所述有向無環圖的目標輸入至節點；所述神經網路還包括與最後一個有向無環圖的輸出節點連接的全連接層；所述全連接層用於基於最後一個有向無環圖輸出的特徵圖確定所述樣本視頻對應的多種事件的發生概率；所述訓練模組1002，還被配置為：基於所述全連接層計算的所述樣本視頻對應的多種事件的發生概率，以及每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數。In some possible implementations, the neural network further includes a sampling layer connected to the first directed acyclic graph, the sampling layer is used for sampling the sample video to obtain the sampled video frame, and the Perform feature extraction on the sampled video frame, obtain the feature map corresponding to the sampled video frame, and input the feature map corresponding to the sampled video frame to the first target input of the directed acyclic graph to the node; the neural network The road also includes a fully connected layer connected to the output node of the last directed acyclic graph; the fully connected layer is used to determine the occurrence of various events corresponding to the sample video based on the feature map output by the last directed acyclic graph The training module 1002 is further configured to: based on the probability of occurrence of various events corresponding to the sample video calculated by the fully connected layer, and the event label corresponding to each sample video, to the neural network The network is trained to obtain the weight parameters after training.

在一些可能的實施方式中，所述構建模組1001，還被配置為根據指向當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖。In some possible implementations, the building module 1001 is further configured to, according to the feature map corresponding to each upper-level node pointing to the current node, and the current node and each upper-level node pointing to the current node The weight parameter of the operation method corresponding to the edge between the nodes generates a feature map corresponding to the current node.

在一些可能的實施方式中，所述構建模組1001，還被配置為針對所述當前節點與指向所述當前節點的每個上一級節點之間的當前邊，基於所述當前邊對應的各所述操作方法對所述當前邊對應的上一級節點的特徵圖進行處理，得到所述當前邊對應的各所述操作方法對應的第一中間特徵圖；所述當前邊對應的各所述操作方法對應的第一中間特徵圖按照各所述操作方法對應的權重參數進行加權求和，得到所述當前邊對應的第二中間特徵圖；將所述當前節點與指向所述當前節點的各個上一級節點之間的多條邊分別對應的第二中間特徵圖進行求和運算，得到所述當前節點對應的特徵圖。In some possible implementations, the building module 1001 is further configured to, for the current edge between the current node and each upper-level node pointing to the current node, The operation method processes the feature map of the upper-level node corresponding to the current edge to obtain a first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; each of the operations corresponding to the current edge The first intermediate feature map corresponding to the method is weighted and summed according to the weight parameters corresponding to each of the operation methods, and the second intermediate feature map corresponding to the current edge is obtained; A summation operation is performed on the second intermediate feature maps corresponding to the multiple edges between the first-level nodes to obtain the feature map corresponding to the current node.

在一些可能的實施方式中，所述選擇模組1003還被配置為針對所述有向無環圖的每一所述邊，將每一所述邊對應的權重參數最大的操作方法作為每一所述邊對應的目標操作方法。In some possible implementations, the selection module 1003 is further configured to, for each of the edges of the directed acyclic graph, use the operation method with the largest weight parameter corresponding to each of the edges as each edge. The target operation method corresponding to the edge.

在一些可能的實施方式中，所述選擇模組1003還被配置為針對每一所述節點，在指向所述節點的邊的個數大於目標個數的情況下，確定指向所述節點的每條邊對應的所述目標操作方法的權重參數；按照對應的所述權重參數由大到小的順序，對指向所述節點的每條邊進行排序，將除前K位的邊外的其餘邊刪除，其中，K為所述目標個數；將進行刪除處理後的神經網路作為所述訓練後的神經網路。In some possible implementations, the selection module 1003 is further configured to, for each of the nodes, determine, for each of the nodes, when the number of edges pointing to the node is greater than the target number The weight parameter of the target operation method corresponding to the edge; according to the corresponding weight parameter in descending order, sort each edge pointing to the node, and delete the remaining edges except the edges of the top K bits, Wherein, K is the number of targets; the neural network after deletion is used as the trained neural network.

關於裝置中的各部分的處理流程、以及各部分之間的交互流程的描述可以參照上述方法實施例中的相關說明，這裡不再詳述。For the description of the processing flow of each part in the apparatus and the interaction flow between the various parts, reference may be made to the relevant descriptions in the foregoing method embodiments, which will not be described in detail here.

基於同一發明構思，本發明實施例中還提供了與視頻識別方法對應的視頻識別裝置，參照圖11所示，為本發明實施例提供的一種視頻識別裝置的架構示意圖，所述裝置包括：獲取模組1101、第一確定模組1102、以及第二確定模組1103，其中：獲取模組1101，被配置為獲取待識別視頻；第一確定模組1102，被配置為將所述待識別視頻輸入至基於上述實施例所述的神經網路的訓練方法訓練得到的神經網路中，確定所述待識別視頻對應的多種事件的發生概率；第二確定模組1103，被配置為將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。Based on the same inventive concept, an embodiment of the present invention also provides a video recognition device corresponding to the video recognition method. Referring to FIG. 11 , it is a schematic diagram of the architecture of a video recognition device provided by an embodiment of the present invention. The device includes: obtaining The module 1101, the first determination module 1102, and the second determination module 1103, wherein: the acquisition module 1101 is configured to acquire the video to be identified; the first determination module 1102 is configured to Input into the neural network obtained by training based on the training method of the neural network described in the above embodiment, and determine the occurrence probability of various events corresponding to the video to be identified; the second determination module 1103 is configured to An event whose occurrence probability meets a preset condition is regarded as an event occurring in the video to be identified.

基於同一技術構思，本發明實施例還提供了一種電腦設備。參照圖12所示，為本發明實施例提供的電腦設備1200的結構示意圖，包括處理器1201、記憶體1202、和匯流排1203。其中，記憶體1202用於儲存執行指令，包括內部記憶體12021和外部記憶體12022；內部記憶體12021被配置為暫時存放處理器1201中的運算資料，以及與硬碟等外部記憶體12022交換的資料，處理器1201通過內部記憶體12021與外部記憶體12022進行資料交換，當電腦設備1200運行時，處理器1201與記憶體1202之間通過匯流排1203通信，使得處理器1201在執行以下指令：獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。Based on the same technical concept, an embodiment of the present invention also provides a computer device. Referring to FIG. 12 , it is a schematic structural diagram of a computer device 1200 according to an embodiment of the present invention, including a processor 1201 , a memory 1202 , and a bus bar 1203 . Among them, the memory 1202 is used to store execution instructions, including the internal memory 12021 and the external memory 12022; the internal memory 12021 is configured to temporarily store the operation data in the processor 1201 and the external memory 12022 such as hard disks. For data, the processor 1201 exchanges data with the external memory 12022 through the internal memory 12021. When the computer device 1200 is running, the processor 1201 and the memory 1202 communicate through the bus 1203, so that the processor 1201 executes the following instructions: Obtain a sample video, and construct a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and a directed acyclic graph for extracting spatial features At least one directed acyclic graph of ; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each of the operation methods has a corresponding weight parameter; Based on the sample video and the event label corresponding to each of the sample videos, the neural network is trained to obtain weight parameters after training; Based on the trained weight parameters, a target operation method is selected for each edge of the plurality of directed acyclic graphs to obtain a trained neural network.

本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述方法實施例中所述的神經網路的訓練方法的步驟。其中，該儲存介質可以是易失性或非易失的電腦可讀取儲存介質。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the neural network training method described in the above method embodiments are executed. . Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

本發明實施例所提供的神經網路的訓練方法的電腦程式產品，包括儲存了程式碼的電腦可讀儲存介質，所述程式碼包括的指令可用於執行上述方法實施例中所述的神經網路的訓練方法的步驟，可參見上述方法實施例，在此不再贅述。The computer program product of the neural network training method provided by the embodiments of the present invention includes a computer-readable storage medium storing program codes, and the instructions included in the program codes can be used to execute the neural network described in the above method embodiments. For the steps of the road training method, reference may be made to the foregoing method embodiments, which will not be repeated here.

基於同一技術構思，本發明實施例還提供了一種電腦設備。參照圖13所示，為本發明實施例提供的電腦設備1300的結構示意圖，包括處理器1301、記憶體1302、和匯流排1303。其中，記憶體1302用於儲存執行指令，包括內部記憶體13021和外部記憶體13022；內部記憶體13021被配置為暫時存放處理器1301中的運算資料，以及與硬碟等外部記憶體13022交換的資料，處理器1301通過內部記憶體13021與外部記憶體13022進行資料交換，當電腦設備1300運行時，處理器1301與記憶體1302之間通過匯流排1303通信，使得處理器1301在執行以下指令：獲取待識別視頻；將所述待識別視頻輸入至基於上述實施例所述的神經網路的訓練方法訓練得到的神經網路中，確定所述待識別視頻對應的多種事件的發生概率；將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。Based on the same technical concept, an embodiment of the present invention also provides a computer device. Referring to FIG. 13 , it is a schematic structural diagram of a computer device 1300 according to an embodiment of the present invention, including a processor 1301 , a memory 1302 , and a bus bar 1303 . Among them, the memory 1302 is used to store the execution instructions, including the internal memory 13021 and the external memory 13022; the internal memory 13021 is configured to temporarily store the operation data in the processor 1301 and the external memory 13022 such as hard disks. For data, the processor 1301 exchanges data with the external memory 13022 through the internal memory 13021. When the computer device 1300 is running, the processor 1301 and the memory 1302 communicate through the bus 1303, so that the processor 1301 executes the following instructions: Obtaining the video to be recognized; inputting the video to be recognized into the neural network obtained by training based on the neural network training method described in the above embodiment, and determining the occurrence probability of various events corresponding to the video to be recognized; An event whose occurrence probability meets the preset condition is regarded as an event occurring in the video to be identified.

本發明實施例還提供一種電腦可讀儲存介質，該電腦可讀儲存介質上儲存有電腦程式，該電腦程式被處理器運行時執行上述方法實施例中所述的視頻識別方法的步驟。其中，該儲存介質可以是易失性或非易失的電腦可讀取儲存介質。Embodiments of the present invention further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is run by a processor, the steps of the video recognition method described in the above method embodiments are executed. Wherein, the storage medium may be a volatile or non-volatile computer-readable storage medium.

本發明實施例所提供的視頻識別方法的電腦程式產品，包括儲存了程式碼的電腦可讀儲存介質，所述程式碼包括的指令可用於執行上述方法實施例中所述的視頻識別方法的步驟，可參見上述方法實施例，在此不再贅述。The computer program product of the video recognition method provided by the embodiment of the present invention includes a computer-readable storage medium storing a program code, and the instructions included in the program code can be used to execute the steps of the video recognition method described in the above method embodiments. , reference may be made to the foregoing method embodiments, which will not be repeated here.

本發明實施例還提供一種電腦程式，該電腦程式被處理器執行時實現前述實施例的任意一種方法。該電腦程式產品可以通過硬體、軟體或其結合的方式實現。在一個可選實施例中，所述電腦程式產品體現為電腦儲存介質，在另一個可選實施例中，電腦程式產品體現為軟體產品，例如軟體發展包（Software Development Kit，SDK）等等。An embodiment of the present invention further provides a computer program, which implements any one of the methods of the foregoing embodiments when the computer program is executed by a processor. The computer program product can be implemented in hardware, software or a combination thereof. In an optional embodiment, the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) and the like.

所屬領域的技術人員可以清楚地瞭解到，為描述的方便和簡潔，上述描述的系統和裝置的工作過程，可以參考前述方法實施例中的對應過程，在此不再贅述。在本發明所提供的幾個實施例中，應該理解到，所揭露的系統、裝置和方法，可以通過其它的方式實現。以上所描述的裝置實施例僅僅是示意性的，例如，所述單元的劃分，僅僅為一種邏輯功能劃分，實際實現的過程中可以有另外的劃分方式，又例如，多個單元或元件可以結合或者可以集成到另一個系統，或一些特徵可以忽略，或不執行。另一點，所顯示或討論的相互之間的耦合或直接耦合或通信連接可以是通過一些通信介面，裝置或單元的間接耦合或通信連接，可以是電性，機械或其它的形式。Those skilled in the art can clearly understand that, for the convenience and brevity of description, for the working process of the system and device described above, reference may be made to the corresponding process in the foregoing method embodiments, and details are not repeated here. In the several embodiments provided by the present invention, it should be understood that the disclosed systems, devices and methods may be implemented in other manners. The apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division, and there may be other division methods in the actual implementation process. For example, multiple units or elements may be combined. Either it can be integrated into another system, or some features can be omitted, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some communication interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.

所述作為分離部件說明的單元可以是或者也可以不是物理上分開的，作為單元顯示的部件可以是或者也可以不是物理單元，即可以位於一個地方，或者也可以分佈到多個網路單元上。可以根據實際的需要選擇其中的部分或者全部單元來實現本實施例方案的目的。The unit described as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place, or may be distributed to multiple network units . Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本發明各個實施例中的各功能單元可以集成在一個處理單元中，也可以是各個單元單獨物理存在，也可以兩個或兩個以上單元集成在一個單元中。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.

所述功能如果以軟體功能單元的形式實現並作為獨立的產品銷售或使用時，可以儲存在一個處理器可執行的非易失的電腦可讀取儲存介質中。基於這樣的理解，本發明的技術方案本質上或者說對現有技術做出貢獻的部分或者該技術方案的部分可以以軟體產品的形式體現出來，該電腦軟體產品儲存在一個儲存介質中，包括若干指令用以使得一台電腦設備（可以是個人電腦，伺服器，或者網路設備等）執行本發明各個實施例所述方法的全部或部分步驟。而前述的儲存介質包括：U盤、移動硬碟、唯讀記憶體（Read-Only Memory，ROM）、隨機存取記憶體（Random Access Memory，RAM）、磁碟或者光碟等各種可以儲存程式碼的介質。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a processor-executable non-volatile computer-readable storage medium. Based on this understanding, the technical solution of the present invention can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution. The computer software product is stored in a storage medium, including several The instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, removable hard disk, Read-Only Memory (ROM), Random Access Memory (RAM), disk or CD, etc. that can store program codes medium.

最後應說明的是：以上所述實施例，僅為本發明實施例的實施方式，用以說明本發明實施例的技術方案，而非對其限制，本發明實施例的保護範圍並不局限於此，儘管參照前述實施例對本發明實施例進行了詳細的說明，本領域的普通技術人員應當理解：任何熟悉本技術領域的技術人員在本發明實施例揭露的技術範圍內，其依然可以對前述實施例所記載的技術方案進行修改或可輕易想到變化，或者對其中部分技術特徵進行等同替換；而這些修改、變化或者替換，並不使相應技術方案的本質脫離本發明實施例技術方案的精神和範圍，都應涵蓋在本發明實施例的保護範圍之內。因此，本發明實施例的保護範圍應所述以申請專利範圍的保護範圍為準。Finally, it should be noted that the above-mentioned embodiments are only implementations of the embodiments of the present invention, and are used to illustrate the technical solutions of the embodiments of the present invention, but not to limit them, and the protection scope of the embodiments of the present invention is not limited to Therefore, although the embodiments of the present invention have been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that any person skilled in the art who is familiar with the technical field within the technical scope disclosed by the embodiments of the present invention can still The technical solutions recorded in the embodiments are modified or easily thought of to change, or equivalent replacements are made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit of the technical solutions in the embodiments of the present invention. and scope, all should be covered within the protection scope of the embodiments of the present invention. Therefore, the protection scope of the embodiments of the present invention should be based on the protection scope of the patent application.

工業實用性本發明實施例通過獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。上述實施例所構建的神經網路中不僅包括用於提取空間特徵的有向無環圖，還包括用於提取時間特徵的有向無環圖，有向無環圖的每條邊對應多個操作方法；這樣在利用樣本視頻對神經網路進行訓練後，可以得到訓練後的操作方法的權重參數，進一步基於訓練後的操作方法的權重參數來得到訓練後的神經網路；這種方法訓練的神經網路不僅進行了圖像維度的空間特徵識別，還進行了時間維度的時間特徵識別，訓練出的神經網路對於視頻的識別精度較高。Industrial Applicability The embodiment of the present invention obtains a sample video and constructs a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and At least one directed acyclic graph for extracting spatial features; each edge of the directed acyclic graph corresponds to multiple operation methods, and each of the operation methods has a corresponding weight parameter; based on the sample video and each an event label corresponding to the sample video, train the neural network to obtain a weight parameter after training; select a target for each edge of the multiple directed acyclic graphs based on the weight parameter after training How-to to get the trained neural network. The neural network constructed by the above embodiment includes not only a directed acyclic graph for extracting spatial features, but also a directed acyclic graph for extracting temporal features, and each edge of the directed acyclic graph corresponds to multiple operations. method; in this way, after using the sample video to train the neural network, the weight parameters of the trained operation method can be obtained, and the trained neural network can be obtained based on the weight parameters of the trained operation method; The neural network not only recognizes the spatial features of the image dimension, but also recognizes the temporal features of the time dimension. The trained neural network has high recognition accuracy for videos.

1001:構建模組 1002:訓練模組 1003:選擇模組 1101:獲取模組 1102:第一確定模組 1103:第二確定模組 1200:電腦設備 1201:處理器 1202:記憶體 12021:內部記憶體 12022:外部記憶體 1203:匯流排 1300:電腦設備 1301:處理器 1302:記憶體 13021:內部記憶體 13022:外部記憶體 1303:匯流排 101~103,801~805,901~903:步驟1001: Building Mods 1002: Training Modules 1003: select module 1101: Get Mods 1102: The first confirmed module 1103: The second determination module 1200: Computer Equipment 1201: Processor 1202: Memory 12021: Internal memory 12022: External memory 1203: Busbar 1300: Computer Equipment 1301: Processor 1302: Memory 13021: Internal memory 13022: External memory 1303: Busbar 101~103, 801~805, 901~903: Steps

為了更清楚地說明本發明實施例的技術方案，下面將對實施例中所需要使用的附圖作簡單地介紹，此處的附圖被併入說明書中並構成本說明書中的一部分，這些附圖示出了符合本發明的實施例，並與說明書一起用於說明本發明的技術方案。應當理解，以下附圖僅示出了本發明的某些實施例，因此不應被看作是對範圍的限定，對於本領域普通技術人員來講，在不付出創造性勞動的前提下，還可以根據這些附圖獲得其他相關的附圖。圖1示出了本發明實施例所提供的一種神經網路的訓練方法的流程圖；圖2示出了本發明實施例所提供的一種包括有向無環圖的神經網路的網路結構示意圖；圖3a示出了本發明實施例所提供的一種時間卷積的處理過程示意圖；圖3b示出了本發明實施例所提供的另一種時間卷積的處理過程示意圖；圖4示出了本發明實施例所提供的一種神經網路結構的示意圖；圖5示出了本發明實施例所提供的一種有向無環圖的示意圖；圖6示出了本發明實施例所提供的一種生成節點對應的特徵圖的方法的流程圖；圖7示出了本發明實施例所提供的一種構建的神經網路的整體結構示意圖；圖8示出了本發明實施例所提供的一種神經網路的訓練方法的流程示意圖；圖9示出了本發明實施例所提供的一種視頻識別方法的流程示意圖；圖10示出了本發明實施例所提供的一種神經網路的訓練裝置的架構示意圖；圖11示出了本發明實施例所提供的一種視頻識別裝置的架構示意圖；圖12示出了本發明實施例所提供的一種電腦設備的結構示意圖；圖13示出了本發明實施例所提供的另一種電腦設備的結構示意圖。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the accompanying drawings that are used in the embodiments, which are incorporated into the specification and constitute a part of the specification. The drawings illustrate embodiments consistent with the present invention, and together with the description, are used to illustrate the technical solutions of the present invention. It should be understood that the following drawings only show some embodiments of the present invention, and therefore should not be regarded as a limitation of the scope. Other related figures are obtained from these figures. 1 shows a flowchart of a method for training a neural network provided by an embodiment of the present invention; 2 shows a schematic diagram of a network structure of a neural network including a directed acyclic graph provided by an embodiment of the present invention; 3a shows a schematic diagram of a processing process of a time convolution provided by an embodiment of the present invention; FIG. 3b shows a schematic diagram of another temporal convolution processing process provided by an embodiment of the present invention; FIG. 4 shows a schematic diagram of a neural network structure provided by an embodiment of the present invention; FIG. 5 shows a schematic diagram of a directed acyclic graph provided by an embodiment of the present invention; 6 shows a flowchart of a method for generating a feature map corresponding to a node provided by an embodiment of the present invention; FIG. 7 shows a schematic diagram of the overall structure of a constructed neural network provided by an embodiment of the present invention; 8 shows a schematic flowchart of a training method for a neural network provided by an embodiment of the present invention; 9 shows a schematic flowchart of a video recognition method provided by an embodiment of the present invention; FIG. 10 shows a schematic diagram of the architecture of a training apparatus for a neural network provided by an embodiment of the present invention; FIG. 11 shows a schematic structural diagram of a video recognition apparatus provided by an embodiment of the present invention; 12 shows a schematic structural diagram of a computer device provided by an embodiment of the present invention; FIG. 13 shows a schematic structural diagram of another computer device provided by an embodiment of the present invention.

101~103:步驟101~103: Steps

Claims

一種神經網路的訓練方法，包括：獲取樣本視頻，並構建包括多個有向無環圖的神經網路；所述多個有向無環圖中包括用於提取時間特徵的至少一個有向無環圖，和用於提取空間特徵的至少一個有向無環圖；所述有向無環圖的每條邊分別對應多個操作方法，每一所述操作方法具有對應的權重參數；基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數；基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路。A method for training a neural network, comprising: Obtain a sample video, and construct a neural network including a plurality of directed acyclic graphs; the plurality of directed acyclic graphs include at least one directed acyclic graph for extracting temporal features, and a directed acyclic graph for extracting spatial features At least one directed acyclic graph of ; each edge of the directed acyclic graph corresponds to a plurality of operation methods respectively, and each of the operation methods has a corresponding weight parameter; Based on the sample video and the event label corresponding to each of the sample videos, the neural network is trained to obtain weight parameters after training; Based on the trained weight parameters, a target operation method is selected for each edge of the plurality of directed acyclic graphs to obtain a trained neural network.

根據請求項1所述的方法，其中，所述有向無環圖包括兩個輸入節點；所述神經網路的每個節點對應一個特徵圖；所述構建包括多個有向無環圖的神經網路，包括：將第N-1個有向無環圖輸出的特徵圖作為第N+1個有向無環圖的一個輸入節點的特徵圖，並將第N個有向無環圖輸出的特徵圖作為所述第N+1個有向無環圖的另一個輸入節點的特徵圖；N為大於1的整數；其中，所述神經網路的第一個有向無環圖中的目標輸入節點對應的特徵圖為對樣本視頻的採樣視頻幀進行特徵提取後的特徵圖，除所述目標輸入節點外的另一個輸入節點為空；所述神經網路的第二個有向無環圖中一個輸入節點的特徵圖為所述第一個有向無環圖輸出的特徵圖，另一個輸入節點為空。The method according to claim 1, wherein the directed acyclic graph includes two input nodes; each node of the neural network corresponds to a feature map; The construction includes a neural network of multiple directed acyclic graphs, including: The feature map output by the N-1th DAG is used as the feature map of an input node of the N+1th DAG, and the feature map output by the Nth DAG is used as the feature map. the feature map of another input node of the N+1 th DAG; N is an integer greater than 1; Wherein, the feature map corresponding to the target input node in the first directed acyclic graph of the neural network is the feature map after feature extraction is performed on the sampled video frame of the sample video, and other than the target input node One input node is empty; the feature graph of one input node in the second directed acyclic graph of the neural network is the feature graph output by the first directed acyclic graph, and the other input node is empty.

根據請求項2所述的方法，還包括：將所述有向無環圖中除輸入節點外的其他節點對應的特徵圖進行串聯，將串聯後的特徵圖作為所述有向無環圖輸出的特徵圖。The method according to claim 2, further comprising: The feature maps corresponding to the other nodes except the input node in the directed acyclic graph are concatenated, and the concatenated feature graph is used as the feature graph output by the directed acyclic graph.

根據請求項1至3任一項所述的方法，其中，所述用於提取時間特徵的有向無環圖中的每條邊對應多個第一操作方法，所述用於提取空間特徵的有向無環圖中的每條邊對應多個第二操作方法；所述多個第一操作方法中包括所述多個第二操作方法以及至少一個區別於各所述第二操作方法的其他操作方法。The method according to any one of claims 1 to 3, wherein each edge in the directed acyclic graph for extracting temporal features corresponds to a plurality of first operation methods, and the methods for extracting spatial features include: Each edge in the acyclic graph corresponds to multiple second operation methods; the multiple first operation methods include the multiple second operation methods and at least one other operation method that is different from each of the second operation methods .

根據請求項1至3任一項所述的方法，其中，所述神經網路還包括與第一個有向無環圖連接的採樣層，所述採樣層用於對樣本視頻進行採樣，得到採樣視頻幀，並對所述採樣視頻幀進行特徵提取，得到所述採樣視頻幀對應的特徵圖，將所述採樣視頻幀對應的特徵圖輸入至第一個所述有向無環圖的目標輸入至節點；所述神經網路還包括與最後一個有向無環圖連接的全連接層；所述全連接層用於基於最後一個有向無環圖輸出的特徵圖確定所述樣本視頻對應的多種事件的發生概率；所述基於所述樣本視頻和每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數，包括：基於所述全連接層計算的所述樣本視頻對應的多種事件的發生概率，以及每一所述樣本視頻對應的事件標籤，對所述神經網路進行訓練，得到訓練後的權重參數。The method according to any one of claims 1 to 3, wherein the neural network further includes a sampling layer connected to the first directed acyclic graph, and the sampling layer is used to sample the sample video to obtain Sampling a video frame, and extracting features from the sampled video frame to obtain a feature map corresponding to the sampled video frame, and inputting the feature map corresponding to the sampled video frame to the first target of the directed acyclic graph input to the node; The neural network also includes a fully-connected layer connected to the last directed acyclic graph; the fully-connected layer is used to determine the various events corresponding to the sample video based on the feature map output by the last directed acyclic graph. probability of occurrence; The neural network is trained based on the sample video and the event label corresponding to each of the sample videos, and weight parameters after training are obtained, including: Based on the probability of occurrence of various events corresponding to the sample videos calculated by the fully connected layer, and the event labels corresponding to each of the sample videos, the neural network is trained to obtain weight parameters after training.

根據請求項2或3任一項所述的方法，還包括：根據指向當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖。The method according to any one of claim 2 or 3, further comprising: Generate the current The feature map corresponding to the node.

根據請求項6所述的方法，其中，所述根據指向所述當前節點的每個上一級節點對應的特徵圖、以及所述當前節點與指向所述當前節點的每個上一級節點之間的邊對應的所述操作方法的權重參數，生成所述當前節點對應的特徵圖，包括：針對所述當前節點與指向所述當前節點的每個上一級節點之間的當前邊，基於所述當前邊對應的各所述操作方法對所述當前邊對應的上一級節點的特徵圖進行處理，得到所述當前邊對應的各所述操作方法對應的第一中間特徵圖；所述當前邊對應的各所述操作方法對應的第一中間特徵圖按照各所述操作方法對應的權重參數進行加權求和，得到所述當前邊對應的第二中間特徵圖；將所述當前節點與指向所述當前節點的各個上一級節點之間的多條邊分別對應的第二中間特徵圖進行求和運算，得到所述當前節點對應的特徵圖。The method according to claim 6, wherein the feature map corresponding to each upper-level node pointing to the current node, and the feature map between the current node and each upper-level node pointing to the current node The weight parameter of the operation method corresponding to the edge, and the feature map corresponding to the current node is generated, including: For the current edge between the current node and each upper-level node pointing to the current node, process the feature map of the upper-level node corresponding to the current edge based on the operation methods corresponding to the current edge , obtain the first intermediate feature map corresponding to each of the operation methods corresponding to the current edge; The first intermediate feature map corresponding to each of the operation methods corresponding to the current edge is weighted and summed according to the weight parameters corresponding to each of the operation methods, to obtain a second intermediate feature map corresponding to the current edge; Perform a summation operation on the second intermediate feature graphs corresponding to the current node and the multiple edges between the respective upper-level nodes pointing to the current node, to obtain a feature graph corresponding to the current node.

根據請求項1至3任一項所述的方法，其中，所述基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，包括：針對所述有向無環圖的每一所述邊，將每一所述邊對應的權重參數最大的操作方法作為每一所述邊對應的目標操作方法。The method according to any one of claim 1 to 3, wherein the selecting a target operation method for each edge of the plurality of directed acyclic graphs based on the weight parameter after training includes: For each edge of the directed acyclic graph, the operation method with the largest weight parameter corresponding to each edge is used as the target operation method corresponding to each edge.

根據請求項8所述的方法，其中，所述基於所述訓練後的權重參數，為所述多個有向無環圖的每條邊選擇目標操作方法，以得到訓練後的神經網路，包括：針對每一所述節點，在指向所述節點的邊的個數大於目標個數的情況下，確定指向所述節點的每條邊對應的所述目標操作方法的權重參數；按照對應的所述權重參數由大到小的順序，對指向所述節點的每條邊進行排序，將除前K位的邊外的其餘邊刪除，其中，K為所述目標個數；將進行刪除處理後的神經網路作為所述訓練後的神經網路。The method according to claim 8, wherein the target operation method is selected for each edge of the multiple directed acyclic graphs based on the weight parameters after training, so as to obtain a trained neural network, comprising: : For each of the nodes, when the number of edges pointing to the node is greater than the target number, determine the weight parameter of the target operation method corresponding to each edge pointing to the node; According to the corresponding weight parameters in descending order, sort each edge pointing to the node, and delete the remaining edges except the edge with the top K bits, where K is the number of targets; The neural network after deletion processing is used as the trained neural network.

一種視頻識別方法，包括：獲取待識別視頻；將所述待識別視頻輸入至基於請求項1至9任一項所述的神經網路的訓練方法訓練得到的神經網路中，確定所述待識別視頻對應的多種事件的發生概率；將對應的發生概率符合預設條件的事件作為與所述待識別視頻中發生的事件。A video recognition method, comprising: Get the video to be identified; Inputting the video to be identified into a neural network obtained by training based on the training method of the neural network described in any one of request items 1 to 9, and determining the probability of occurrence of multiple events corresponding to the video to be identified; An event whose corresponding occurrence probability meets the preset condition is regarded as an event occurring in the video to be identified.

一種電腦設備，包括：處理器、記憶體和匯流排，所述記憶體儲存有所述處理器可執行的機器可讀指令，當電腦設備運行時，所述處理器與所述記憶體之間通過匯流排通信，所述機器可讀指令被所述處理器執行時執行如請求項1至9任一項所述的神經網路的訓練方法的步驟，或執行如請求項10所述的視頻識別方法的步驟。A computer device, comprising: a processor, a memory and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, there is a gap between the processor and the memory. Through bus communication, the machine-readable instructions, when executed by the processor, execute the steps of the method for training a neural network according to any one of claim 1 to 9, or execute the video according to claim 10. Identify the steps of the method.

一種電腦可讀儲存介質，所述電腦可讀儲存介質上儲存有電腦程式，所述電腦程式被處理器運行時執行如請求項1至9任一項所述的神經網路的訓練方法的步驟，或執行如請求項10所述的視頻識別方法的步驟。A computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the steps of the training method for a neural network according to any one of claim items 1 to 9 are executed. , or perform the steps of the video recognition method as described in claim 10.