CN112668366B

CN112668366B - Image recognition method, device, computer readable storage medium and chip

Info

Publication number: CN112668366B
Application number: CN201910980310.7A
Authority: CN
Inventors: 严锐; 谢凌曦; 田奇
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2019-10-15
Filing date: 2019-10-15
Publication date: 2024-04-26
Anticipated expiration: 2039-10-15
Also published as: WO2021073311A1; CN112668366A

Abstract

The application provides an image recognition method, an image recognition device, a computer readable storage medium and a computer readable storage medium chip, relates to the field of artificial intelligence, and particularly relates to the field of computer vision. The method comprises the following steps: extracting image characteristics of an image to be processed, determining time sequence characteristics and space characteristics of each image of a plurality of characters in the image to be processed in each frame of multi-frame image in the image to be processed, determining action characteristics according to the time sequence characteristics and the space characteristics, and identifying group actions of the plurality of characters in the image to be processed according to the action characteristics. The method is used for better identifying the group actions of the plurality of people in the image to be processed by determining the time association relation of the actions of each person in the plurality of people in the extracted image to be processed and the association relation of the actions of other people.

Description

Image recognition method, device, computer readable storage medium and chip

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to an image recognition method, apparatus, computer readable storage medium and chip.

Background

Computer vision is an integral part of various intelligent/autonomous systems in various fields of application, such as manufacturing, inspection, document analysis, medical diagnosis, and military, and is a study of how to use cameras/cameras and computers to acquire the data and information of a subject. In image, eyes (cameras/video cameras) and brains (algorithms) are installed on a computer to replace human eyes to identify, track, measure targets and the like, so that the computer can sense the environment. Because perception can be seen as the extraction of information from sensory signals, computer vision can also be seen as science of how to "perceive" an artificial system from images or multi-dimensional data. In general, computer vision is to acquire input information by using various imaging systems instead of visual organs, and then to process and interpret the input information by using a computer instead of the brain. The ultimate goal of computer vision is to enable computers to view and understand the world visually, like humans, with the ability to adapt themselves to the environment.

The recognition and understanding of the behavior of a person in an image is one of the most valuable information. Motion recognition is an important research topic in the field of computer vision. The computer can understand the content of the video through motion recognition. The action recognition technology can be widely applied to various fields such as public place monitoring, man-machine interaction and the like. Feature extraction is a key link of the action recognition process, and the action recognition can be effectively performed only according to accurate features. In the case of group motion recognition, the accuracy of group motion recognition is affected by the relationship in time between the motions of each of the plurality of persons in the video and the relationship between the motions of the plurality of persons.

Existing schemes generally extract time sequence features of a person through a long short-term memory (LSTM), where the time sequence features are used to represent the relevance of actions of the person over time. Then, the interactive motion feature of each person may be calculated from the time series feature of each person, so that the motion feature of each person is determined from the interactive motion feature of each person to infer group motions of a plurality of persons from the motion feature of each person. The interactive action features are used to represent the relevance between character actions.

However, in the above-described scheme, the interactive motion characteristics of each person are determined based only on the time-dependent relationship of the motions of each person, and accuracy is to be improved when used for recognition of group motions.

Disclosure of Invention

The application provides an image recognition method, an image recognition device, a computer readable storage medium and a computer readable chip, which are used for better recognizing group actions of a plurality of people in an image to be processed.

In a first aspect, there is provided an image recognition method, the method comprising: extracting image characteristics of an image to be processed, wherein the image to be processed comprises a plurality of frames of images; determining a timing characteristic of each of the plurality of persons in each of the plurality of images; determining a spatial feature of each of the plurality of people in each of the plurality of images; determining an action feature of each of the plurality of persons in each of the plurality of images; and identifying group actions of the plurality of people in the image to be processed according to the action characteristics of each of the plurality of people in each frame of image in the multi-frame image.

Alternatively, the group action of the plurality of persons in the image to be processed may be a certain sport or activity, for example, the group action of the plurality of persons in the image to be processed may be playing basketball, playing volleyball, playing football, dancing, etc.

Wherein the image to be processed comprises a plurality of characters, and the image characteristics of the image to be processed comprise the image characteristics of the characters in each frame of images in a plurality of frames of images in the image to be processed.

In the application, when the group actions of the plurality of people are determined, not only the time sequence characteristics of the plurality of people are considered, but also the space characteristics of the plurality of people are considered, and the group actions of the plurality of people can be determined better and more accurately by integrating the time sequence characteristics and the space characteristics of the plurality of people.

When the image recognition method is performed by the image recognition apparatus, the image to be processed may be an image acquired from the image recognition apparatus, or the image to be processed may be an image received by the image recognition apparatus from another device, or the image to be processed may be captured by a camera of the image recognition apparatus.

The image to be processed can be a continuous multi-frame image in a video, or can be a multi-frame image selected according to a preset rule in a video according to a preset rule.

It should be understood that among the plurality of persons in the image to be processed described above, the plurality of persons may include only persons, only animals, or both persons and animals.

In the above extraction of the image features of the image to be processed, the person in the image may be identified to determine the bounding boxes of the person, the image in each bounding box corresponding to one person in the image, and then the image features of each person may be obtained by extracting the features of the image of each bounding box.

Optionally, the skeleton node of the person in the bounding box corresponding to each person can be identified first, and then the image feature vector of the person is extracted according to the skeleton node of each person, so that the extracted image feature can reflect the action of the person more accurately, and the accuracy of the extracted image feature is improved.

Furthermore, skeletal nodes in the bounding box can be connected according to the character structure to obtain a connected image, and then the connected image is extracted with image feature vectors.

Or the region where the skeleton node is located and the region outside the region where the skeleton node is located can be set with different colors for display, so that a processed image is obtained, and then the image characteristics of the processed image are extracted.

Further, a local visible image corresponding to the bounding box can be determined according to the image area where the skeletal node of the person is located, and then feature extraction is performed on the local visible image to obtain image features of the image to be processed.

The locally visible image is an image composed of an area including a skeletal node of a person in the image to be processed. Specifically, the region outside the region where the skeletal nodes of the person are located in the bounding box may be obscured to obtain the locally visible image.

When determining the time sequence feature of a certain person in a plurality of persons, the time association relationship between the different time actions of the person can be determined through the similarity between the image feature vectors of the person between the different actions in different frame images, and then the time sequence feature of the person is obtained.

Assuming that the multi-frame image in the image to be processed is specifically a T frame, and i is a positive integer less than or equal to T, the ith frame image represents images in a corresponding sequence in the T frame image; assuming that the plurality of characters in the image to be processed are specifically K characters, the jth character represents a character in a corresponding sequence in the K characters, and i and j are positive integers.

The time sequence feature of the jth person in the ith frame image in the multi-frame images to be processed is determined according to the similarity of the image feature of the jth person in the ith frame image and the image features of other frame images in the multi-frame images.

It should be understood that the time series characteristic of the jth person in the ith frame image is used to represent the association between the motion of the jth person in the ith frame image and the motion of the multiframe image. The similarity of the corresponding image features of a person in two frames of images can reflect the time dependency degree of the action of the person.

If the similarity of the corresponding image features of a person in two frames of images is higher, the association between the actions of the person at two time points is tighter; conversely, if the similarity of the corresponding image features of a certain person in two frames of images is lower, the association between the actions of the person at two points in time is weaker.

When the spatial characteristics of a plurality of people are determined, the spatial association relationship between the actions of different people in the same frame image is determined through the similarity between the image characteristics of different people in the same frame image.

The spatial feature of the jth person in the multiple ith frame image in the multiple frame images to be processed is determined according to the similarity between the image feature of the jth person in the ith frame image and the image features of other persons except the jth person in the ith frame image. That is, the spatial feature of the jth person in the ith frame image may be determined based on the similarity of the image feature of the jth person in the ith frame image to the image features of the other persons in the ith frame image except for the jth person.

It should be understood that the spatial feature of the jth person in the ith frame image is used to represent the association between the actions of the jth person in the ith frame image and the actions of other persons in the ith frame image except for the jth person.

Specifically, the similarity between the image feature vector of the jth person in the ith frame image and the image feature vectors of the other persons except for the jth person may reflect the degree of dependence of the jth person in the ith frame image on the actions of the other persons except for the jth person. That is, the higher the similarity of the image feature vectors corresponding to the two persons, the closer the association between the actions of the two persons; conversely, when the similarity of the image feature vectors corresponding to the two persons is lower, the association between the actions of the two persons is weaker.

Alternatively, the similarity between the above-described temporal features and between the spatial features may be calculated by a Ming distance (Minkowski distance) (e.g., euclidean distance, manhattan distance), cosine similarity, chebyshev distance, hamming distance, or the like.

The spatial association between different person actions and the temporal association between the same person actions can provide important clues for the categories of the multi-person scene in the image. Therefore, in the image recognition process, the method and the device can effectively improve the recognition accuracy by comprehensively considering the spatial association relationship between different person actions and the time association relationship between the same person actions.

Alternatively, when determining the motion feature of a person in a frame image, the time sequence feature, the space feature and the image feature corresponding to the person in the frame image may be fused, so as to obtain the motion feature of the person in the frame image.

When the time sequence features, the space features and the image features are fused, a combined fusion mode can be adopted for fusion.

For example, features corresponding to a person in a frame of image are fused to obtain motion features of the person in the frame of image.

Further, the features to be fused may be directly added or weighted added when fusing the above-described plurality of features.

Alternatively, in fusing the above-described multiple features, fusion may be performed in a cascade and channel fusion manner. Specifically, the dimensions of the features to be fused may be directly spliced, or spliced after multiplying by a certain coefficient, i.e., a weight value.

Alternatively, the pooling layer may be used to process the plurality of features to achieve a fusion of the plurality of features.

With reference to the first aspect, in some implementations of the first aspect, when identifying the group actions of the plurality of people in the image to be processed according to the action characteristics of each of the plurality of people in each frame of the image to be processed, the action characteristics of each of the plurality of people in the image to be processed may be classified to obtain the action of each of the plurality of people, and the group actions of the plurality of people may be determined according to the actions.

Alternatively, the motion characteristics of each person in the plurality of persons in the processed image in each frame of image may be input into the classification module, so as to obtain a classification result of the motion characteristics of each person in the plurality of persons, that is, the motion of each person, and further, the motion with the largest number of corresponding persons is used as the group motion of the plurality of persons.

Optionally, a person may be selected from a plurality of persons, and the motion characteristics of the person in each frame of image may be input into the classification module, so as to obtain a classification result of the motion characteristics of the person, that is, the motion of the person, and then the obtained motion of the person is used as the group motion of the plurality of persons in the image to be processed.

With reference to the first aspect, in some implementations of the first aspect, when identifying the group actions of the multiple people in the image to be processed according to the action characteristics of each of the multiple people in each frame of the image to be processed, the action characteristics of the multiple people in each frame of the image may be fused to obtain the action characteristics of the image to be processed, and then the action characteristics of each frame of the image may be classified to obtain the action of each frame of the image, and the group actions of the multiple people in the image to be processed may be determined accordingly.

Optionally, the action features of multiple people in each frame of image may be fused to obtain the action feature of the frame of image, and then the action feature of each frame of image is input into the classification module respectively to obtain the action classification result of each frame of image, and the classification result with the largest number of images in the corresponding images to be processed in the output category of the classification module is used as the group action of multiple people in the images to be processed.

Optionally, the motion features of multiple people in each frame of image may be fused to obtain the motion feature of the frame of image, then the average value of the motion features of each frame of image is obtained to obtain the average motion feature of each frame of image, and then the average motion feature of each frame of image is input into the classification module, and then the classification result corresponding to the average motion feature of each frame of image is used as the group motion of multiple people in the image to be processed.

Optionally, a frame of image can be selected from the images to be processed, the motion features of the frame of image obtained by fusing the motion features of a plurality of people in the frame of image are input into the classification module, so as to obtain a classification result of the frame of image, and the classification result of the frame of image is further used as a group motion of a plurality of people in the images to be processed.

With reference to the first aspect, in some implementations of the first aspect, after the group actions of the plurality of people in the image to be processed are identified, tag information of the image to be processed is generated according to the group actions, where the tag information is used to indicate the group actions of the plurality of people in the image to be processed.

The method can be used for classifying the video library, labeling different videos in the video library according to the corresponding group actions, and facilitating the user to check and search.

With reference to the first aspect, in certain implementations of the first aspect, after identifying a group action of a plurality of people in the image to be processed, determining a key person in the image to be processed according to the group action.

Optionally, determining the contribution degree of each of the plurality of people in the image to be processed to the group action, and then determining the person with the highest contribution degree as the key person.

It should be appreciated that the key persona contributes more to the group action of the plurality of personas than other personas of the plurality of personas other than the key persona.

The above approach may be used, for example, to detect key persons in a video image, typically comprising several persons, most of which are not important. The effective detection of the key persona facilitates a more rapid and accurate understanding of the video content based on the key persona surrounding information.

For example, assuming that a video is a ball game, the player controlling the ball has the greatest influence on all people in the field including players, referees, audiences and the like, and has the highest contribution to group actions, so that the player controlling the ball can be determined to be a key person, and the person watching the video can be helped to understand the situation that the game is and is about to happen by determining the key person.

In a second aspect, there is provided an image recognition method, the method comprising: extracting image characteristics of an image to be processed; determining the spatial characteristics of a plurality of people in each frame of image to be processed; and determining action characteristics of the plurality of people in each frame of the image to be processed, and identifying group actions of the plurality of people in the image to be processed according to the action characteristics of the plurality of people in each frame of the image to be processed.

The action features of the plurality of people in the image to be processed are obtained by fusing the spatial features of the plurality of people in the image to be processed and the image features in the image to be processed.

The image to be processed may be one frame image, or may be a plurality of frames of continuous or discontinuous images.

In the application, when the group actions of the plurality of people are determined, only the spatial characteristics of the plurality of people are considered, and the time sequence characteristic of each person is not required to be calculated, so that the method is particularly suitable for the situation that the determination of the spatial characteristics of the people does not depend on the time sequence characteristics of the people, and the group actions of the plurality of people can be more conveniently determined. For another example, when only one frame of image is identified, there is no time sequence feature of the same person at different times, and the method is also more applicable.

The image to be processed can be one frame of image or continuous multi-frame image in a section of video, or one or more frames of images selected according to a preset rule in a section of video according to a preset rule.

When extracting the image features of the image to be processed, the person in the image may be identified, so as to determine the bounding boxes of the person, the image in each bounding box corresponds to one person in the image, and then the image features of each person may be obtained by extracting the features of the image of each bounding box.

Optionally, the skeleton node of the person in the bounding box corresponding to each person can be identified first, and then the image feature of the person is extracted according to the skeleton node of each person, so that the extracted image feature can reflect the action of the person more accurately, and the accuracy of the extracted image feature is improved.

Furthermore, the skeleton nodes in the boundary box can be connected according to the character structure to obtain a connection image, and then the connection image is extracted with the image feature vector.

Or the region where the skeleton node is located and the region outside the region where the skeleton node is located can be displayed by different colors to obtain a processed image, and then the image characteristics of the processed image are extracted.

The locally visible image is an image composed of the region where the skeletal node of the person in the image to be processed is located. Specifically, the region of the bounding box outside the region where the skeletal nodes of the person are located may be obscured to obtain the locally visible image.

The spatial feature of the jth person in the multiple image frames in the ith frame of the multiple image frames to be processed is determined according to the similarity between the image feature of the jth person in the ith image frame and the image features of other persons. That is, the spatial feature of the jth person in the ith frame image may be determined based on the similarity of the image feature of the jth person in the ith frame image to the image features of the other persons.

It should be understood that the spatial feature of the jth person in the ith frame image is used to represent the association relationship between the actions of the jth person in the ith frame image and the actions of other persons in the ith frame image except for the jth person.

Specifically, the similarity between the image feature vector of the jth person in the ith frame image and the image feature vectors of other persons in the ith frame image except for the jth person may reflect the degree of dependence of the jth person in the ith frame image on the actions of the other persons. That is, the higher the similarity of the image feature vectors corresponding to the two artifacts, the tighter the association between the actions of the two; conversely, the lower the similarity, the weaker the association between the actions of the two people.

Alternatively, the similarity between the above spatial features may be calculated by a Ming distance (Minkowski distance) (e.g., euclidean distance, manhattan distance), cosine similarity, chebyshev distance, hamming distance, and the like.

Alternatively, when determining the motion feature of a person in a frame of image, the spatial feature and the image feature corresponding to the person in the frame of image may be fused, so as to obtain the motion feature of the person in the frame of image.

When the spatial features and the image features are fused, a combined fusion mode can be adopted for fusion.

Further, when the above-described plurality of features are fused, the features to be fused may be directly added, or weighted addition may be performed.

With reference to the second aspect, in some implementations of the second aspect, when identifying group actions of a plurality of people in the image to be processed according to action features in each frame of the image to be processed in the plurality of people, the action features of each person in the image to be processed in each frame of the image to be processed may be classified to obtain actions of each person, and the group actions of the plurality of people may be determined according to the actions.

With reference to the second aspect, in some implementations of the second aspect, when identifying group actions of a plurality of people in the image to be processed according to action features in each frame of the image to be processed in the plurality of people, the action features of the plurality of people in each frame of the image may be fused to obtain action features of the image to be processed, and then the action features of each frame of the image may be classified to obtain actions of each frame of the image, and group actions of the plurality of people in the image to be processed may be determined according to the actions.

With reference to the second aspect, in some implementations of the second aspect, after identifying the group actions of the plurality of people in the image to be processed, tag information of the image to be processed is generated according to the group actions, where the tag information is used to indicate the group actions of the plurality of people in the image to be processed.

With reference to the second aspect, in some implementations of the second aspect, after identifying a group action of a plurality of people in the image to be processed, a key person in the image to be processed is determined according to the group action.

In a third aspect, there is provided an image recognition method, the method comprising: extracting image characteristics of an image to be processed; determining the dependency relationship among different persons in an image to be processed and the dependency relationship among actions of the same person at different moments; fusing the image features with the time-space feature vectors to obtain the action features of each frame of image of the image to be processed; and carrying out classification prediction on the motion characteristics of each frame of image so as to determine the group motion category of the image to be processed.

In the application, the complex reasoning process of identifying the group actions is completed, and when the group actions of a plurality of people are determined, not only the time sequence characteristics of the plurality of people are considered, but also the space characteristics of the plurality of people are considered, and the group actions of the plurality of people can be better and more accurately determined by integrating the time sequence characteristics and the space characteristics of the plurality of people.

Optionally, when the image features of the image to be processed are extracted, object tracking may be performed on each person, a bounding box of each person in each frame of image may be determined, the image in each bounding box corresponds to one person, and feature extraction may be performed on the image of each bounding box to obtain the image features of each person.

When the image features of the image to be processed are extracted, the image features can be extracted by identifying the skeleton nodes of the person, so that the influence of redundant information of the image in the feature extraction process is reduced, and the feature extraction accuracy is improved. Specifically, the image features may be extracted from skeletal nodes using a convolutional network.

Alternatively, skeletal nodes in the bounding box may be connected according to the character structure to obtain a connected image, and then the connected image may be subjected to extraction of an image feature vector. Or the region where the skeleton node is located and the region outside the region where the skeleton node is located can be displayed by different colors, and then the image characteristics of the processed image are extracted.

Alternatively, the character action mask matrix may be calculated from the image of the character and skeletal nodes. Each point in the mask matrix corresponds to a pixel. In the mask matrix, the values in the square area with the skeleton point as the center and the side length of l are set to 1, and the values in other positions are set to 0.

Further, RGB color patterns may be employed for masking. RGB color mode uses an RGB model to assign an intensity value in the range of 0 to 255 to the RGB components of each pixel in the image. And masking the original character action picture by using a masking matrix to obtain a local visible image.

Optionally, regions of variable length/surrounding each of the skeletal nodes are reserved, and other regions are obscured.

For each person, the local visible image is utilized to extract the image characteristics, so that redundant information in the boundary box can be reduced, the image characteristics can be extracted according to the structural information of the person, and the expressive ability of the image characteristics on the actions of the person is enhanced.

When the dependency relationship between different people in the image to be processed and the dependency relationship between actions of the same person at different moments are determined, the cross interaction module is utilized to determine the time correlation of the body gestures of the people in the multi-frame image and/or determine the space correlation of the body gestures of the people in the multi-frame image.

Optionally, the cross interaction module is used for realizing interaction of the features, and a feature interaction model is established and used for representing the association relation of the body gestures of the person in time and/or space.

Alternatively, by calculating the similarity between image features of different persons in the same frame image, the spatial dependence between the body poses of different persons in the same frame image can be determined. The spatial dependence is used to represent the dependence of the body posture of a person on the body posture of other persons in a certain frame of image, i.e. the spatial dependence between actions of the person. The spatial dependency may be represented by a spatial feature vector.

Alternatively, by calculating the similarity between image features of the same person at different times, the time dependence between the body poses of the same person at different times can be determined. The time dependence may also be referred to as a time-series dependence, which is used to represent the dependence of the body posture of the person in a certain frame of image on the body posture of the person in other video frames, i.e. the time-series dependence inherent in an action. The time dependence may be represented by a time series feature vector.

The time-space feature vector of the kth person can be calculated according to the space feature vector and the time sequence feature vector of the kth person in the image to be processed.

In the process of fusing the image features and the time-space feature vectors to obtain the motion features of each frame of the image to be processed, the image features of K persons in the images at T moments are collectedSet/>, of time-space feature vectors of K people in the image of image and T momentsThe time-space feature vectors in the T moments are fused to obtain the image features of each image in the images at the T moments.

Optionally, fusing the image feature of the kth person at the moment t with the time-space feature vector to obtain a person feature vector of the kth person at the moment t; or residual connection is carried out on the image characteristic and the time-space characteristic vector so as to obtain the character characteristic vector. And determining a set of character feature vectors of the K characters at the time t according to the character feature vectors of each character in the K characters. And carrying out maximum pooling on the set of the character feature vectors to obtain the action feature vectors.

In the process of performing classification prediction according to the action characteristics to determine the group action category of the image to be processed, different modes can be adopted to obtain the classification result of the group action.

Optionally, the motion feature vector at the time t is input into a classification module to obtain a classification result of the frame image. The classification result of the classification module on the image feature vector at any T moment can be used as the classification result of the group action in the T frame image. The classification result of the group actions in the T-frame image may also be understood as the classification result of the group actions of the person in the T-frame image, or the classification result of the T-frame image.

Optionally, the motion feature vectors of the T-frame images are respectively input into a classification module to obtain classification results of each frame of image. The classification result of the T-frame image may belong to one or more categories. And taking one of the output categories of the classification module, which corresponds to the T frame image and has the largest number of images, as a classification result of the group action in the T frame image.

Optionally, the motion feature vectors of the T-frame image are averaged to obtain an average feature vector. Each bit in the average feature vector is an average of corresponding bits in the image feature vector representation of the T-frame image. The average feature vector may be input to a classification module to obtain classification results of group actions in the T-frame image.

The method can complete the complex reasoning process of group action recognition: determining image characteristics of multi-frame images, determining time sequence characteristics and space characteristics according to interdependence relations between different characters and actions at different times in the images, fusing the image characteristics to obtain action characteristics of each frame of images, and deducing group actions of the multi-frame images by classifying the action characteristics of each frame of images.

In a fourth aspect, an image recognition device is provided, the image recognition device having functionality to implement the method of the first to third aspects or any possible implementation thereof.

Optionally, the image recognition apparatus comprises respective modules or units implementing the method in an implementation of any of the first to third aspects.

In a fifth aspect, a training device for a neural network is provided, the training device having a function for implementing the method in any one of the implementation manners of the first aspect to the third aspect.

Optionally, the training device comprises respective modules implementing the method in an implementation of any of the first to third aspects.

Optionally, the training device comprises means for implementing the method in an implementation of any of the first to third aspects.

In a sixth aspect, there is provided an image recognition apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of any one of the implementations of the first to third aspects described above when the program stored in the memory is executed.

In a seventh aspect, there is provided a training apparatus for a neural network, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of any one of the implementations of the first to third aspects described above when the program stored in the memory is executed.

In an eighth aspect, there is provided an electronic device including the image recognition apparatus in the fourth or sixth aspect.

The electronic device in the eighth aspect may specifically be a mobile terminal (e.g., a smart phone), a tablet computer, a notebook computer, an augmented reality/virtual reality device, an in-vehicle terminal device, or the like.

In a ninth aspect, there is provided a computer device comprising the training apparatus of the neural network in the fifth or seventh aspect.

The computer device may specifically be a computer, a server, a cloud device, or a device with a certain computing power that can implement training on a neural network.

In a tenth aspect, the present application provides a computer readable storage medium having stored therein computer instructions which, when run on a computer, cause the computer to perform the method of any one of the implementations of the first to third aspects.

In an eleventh aspect, the present application provides a computer program product comprising computer program code which, when run on a computer, causes the computer to perform the method of any one of the implementations of the first to third aspects.

In a twelfth aspect, a chip is provided, the chip comprising a processor and a data interface, the processor reading instructions stored on a memory through the data interface to perform the method of any one of the implementations of the first to third aspects.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, where the instructions, when executed, are configured to perform the method in any implementation manner of the first aspect to the third aspect.

The chip may be a field programmable gate array FPGA or an application specific integrated circuit ASIC.

Drawings

FIG. 1 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 2 is a schematic diagram of an application environment provided by an embodiment of the present application;

FIG. 3 is a schematic flow chart of a method for group action recognition provided by an embodiment of the present application;

FIG. 4 is a schematic flow chart of a method for group action recognition provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a system architecture according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a convolutional neural network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a chip hardware structure according to an embodiment of the present application;

FIG. 8 is a schematic flow chart of a training method of a neural network model provided by an embodiment of the present application;

FIG. 9 is a schematic flow chart of an image recognition method provided by an embodiment of the present application;

FIG. 10 is a schematic flow chart of an image recognition method provided by an embodiment of the present application;

FIG. 11 is a schematic flow chart of an image recognition method provided by an embodiment of the present application;

FIG. 12 is a schematic diagram of a process for acquiring a locally visible image provided by an embodiment of the present application;

FIG. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application;

FIG. 14 is a schematic illustration of the spatial relationship of different character actions provided by embodiments of the present application;

FIG. 15 is a schematic diagram of the spatial relationship of different character actions provided by embodiments of the present application;

FIG. 16 is a schematic diagram of the relationship of actions of a person over time provided by an embodiment of the present application;

FIG. 17 is a schematic diagram of the relationship of actions of a character over time provided by an embodiment of the present application;

FIG. 18 is a schematic diagram of a system architecture of an image recognition network according to an embodiment of the present application;

fig. 19 is a schematic structural view of an image recognition device according to an embodiment of the present application;

Fig. 20 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

Fig. 21 is a schematic structural diagram of a neural network training device according to an embodiment of the present application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

The scheme of the application can be applied to the fields of video analysis, video identification, abnormal or dangerous behavior detection and the like which need to analyze the videos of multiple complex scenes. The video may be, for example, sports game video, daily surveillance video, etc. Two general application scenarios are briefly described below.

Application scenario one: video management system

With the rapid rise in mobile network speed, users have stored a large amount of short videos on electronic devices. More than one person may be included in the short video. The short videos in the video library are identified, so that the user or the system can conveniently conduct classified management on the video library, and user experience is improved.

As shown in FIG. 1, the group action recognition system provided by the application is used for training the neural network structure suitable for short video classification and deploying a test by using a given database, the trained neural network structure can classify the short videos to obtain group action categories corresponding to different short videos, and different labels are marked on different short videos, so that the user can check and search conveniently, the time of manual classification and management can be saved, and the management efficiency and user experience can be improved.

And (2) an application scene II: critical person detection system

Typically, several people are included in the video, most of which are not important. Effectively detecting key characters facilitates rapid understanding of scene content. As shown in FIG. 2, the group action recognition system provided by the application can recognize the key characters in the video, so that the video content can be more accurately understood according to the surrounding information of the key characters.

For easy understanding, related terms and related concepts such as neural networks related to the embodiments of the present application are described below.

(1) Neural network

The neural network may be composed of neural units, which may refer to an arithmetic unit with x _s and an intercept b as inputs, and the output of the arithmetic unit may be:

Where s=1, 2, … … n, n is a natural number greater than 1, W _s is the weight of x _s, and b is the bias of the neural unit. f () is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network

Deep neural networks (deep neural network, DNN), also known as multi-layer neural networks, can be understood as neural networks having many hidden layers, many of which are not particularly metrics. From DNNs, which are divided by the location of the different layers, the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. For example, layers in a fully connected neural network are fully connected, that is, any neuron in layer i must be connected to any neuron in layer i+1. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression: wherein/> Is an input vector,/>Is an output vector,/>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for input vectors/>The output vector/>, is obtained through the simple operationSince the number of DNN layers is large, the coefficient W and the offset vector/>And thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as/>The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.

The summary is: the coefficients of the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined asIt should be noted that the input layer is devoid of W parameters. In deep neural networks, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", meaning that it can accomplish more complex learning tasks. The process of training the deep neural network, i.e. learning the weight matrix, has the final objective of obtaining a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network.

(3) Convolutional neural network

The convolutional neural network (convolutional neuron network, CNN) is a deep neural network with a convolutional structure. The convolutional neural network comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer in the convolution neural network, which performs convolution processing on an input signal. In the convolutional layer of the convolutional neural network, one neuron may be connected with only a part of adjacent layer neurons. In a convolutional layer, a number of feature planes are typically included, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. The same learned image information can be used for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network. In addition, the direct benefit of sharing weights is to reduce the connections between layers of the convolutional neural network, while reducing the risk of overfitting.

(4) A recurrent neural network (recurrent neural networks, RNN) is used to process the sequence data. In the traditional neural network model, from an input layer to an implicit layer to an output layer, the layers are fully connected, and no connection exists for each node between each layer. Although this common neural network solves many problems, it still has no weakness for many problems. For example, you want to predict what the next word of a sentence is, it is generally necessary to use the previous word, because the previous and next words in a sentence are not independent. RNN is called a recurrent neural network in the sense that a sequence's current output is related to the previous output. The specific expression is that the network memorizes the previous information and applies the previous information to the calculation of the current output, namely, the nodes between the hidden layers are not connected any more and are connected, and the input of the hidden layers comprises not only the output of the input layer but also the output of the hidden layer at the last moment. In theory, RNNs are able to process sequence data of any length. Training for RNNs is the same as training for traditional CNNs or DNNs. Error back propagation algorithms are also used, but with a few differences: that is, if the RNN is network extended, parameters therein, such as W, are shared; this is not the case with conventional neural networks such as those described above. And in using a gradient descent algorithm, the output of each step depends not only on the network of the current step, but also on the state of the previous steps of the network. This learning algorithm is referred to as a time-based back-propagation algorithm (back propagation through time, BPTT).

Why is the convolutional neural network already present, the neural network is also looped? The reason is simple, and in convolutional neural networks, one precondition assumption is that: the elements are independent of each other, and the input and output are independent of each other, such as cats and dogs. However, in the real world, many elements are interconnected, such as the stock changes over time, and further such as one says: i like travel, where the most favored place is Yunnan, and later have the opportunity to go. Here, the filling should be known to humans as filling "yunnan". Because humans will infer from the context, but how to have the machine do this? RNNs have thus been developed. RNNs aim to give robots the ability to memorize as a robot. Thus, the output of the RNN needs to rely on current input information and historical memory information.

(5) Loss function

In training the deep neural network, since the output of the deep neural network is expected to be as close to the value actually expected, the weight vector of each layer of the neural network can be updated by comparing the predicted value of the current network with the actually expected target value according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the pre-configuration parameters of each layer in the deep neural network), for example, if the predicted value of the network is higher, the weight vector is adjusted to be predicted to be lower, and the adjustment is continued until the deep neural network can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network becomes a process of reducing the loss as much as possible.

(6) Residual error network

When the depth of the neural network is continuously increased, the degradation problem occurs, namely, the accuracy is firstly increased along with the increase of the depth of the neural network, then the saturation is achieved, and the accuracy is reduced when the depth is continuously increased. The biggest difference between a common direct convolutional neural network and a residual network (ResNet) is that ResNet has many branches bypassing the input directly to the later layers, and by bypassing the input information directly to the output, the integrity of the information is protected and the problem of degradation is solved. The residual network includes a convolutional layer and/or a pooling layer.

The residual network may be: in addition to layer-by-layer connection among multiple hidden layers in the deep neural network, for example, a layer 1 hidden layer is connected with a layer 2 hidden layer, a layer 2 hidden layer is connected with a layer 3 hidden layer, a layer 3 hidden layer is connected with a layer 4 hidden layer (the hidden layer is a data operation path of the neural network and can be also visually called as neural network transmission), and the residual network is further provided with a direct connection branch which is directly connected with the layer 4 hidden layer from the layer 1 hidden layer, namely, the processing of the layer 2 hidden layer and the layer 3 hidden layer is skipped, and the data of the layer 1 hidden layer is directly transmitted to the layer 4 hidden layer for operation. The road network may be: the deep neural network includes a weight acquisition branch in addition to the above operation path and the direct connection branch, and the branch is introduced into a transmission gate (transmission gate) to acquire a weight value, and outputs a weight value T for subsequent operation of the above operation path and the direct connection branch.

(7) Back propagation algorithm

The convolutional neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in an initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

(8) Pixel value

The pixel value of the image may be a Red-Green-Blue (RGB) color value, and the pixel value may be a long integer representing the color. For example, the pixel value is 255×red+100×green+76×blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. The smaller the value, the lower the luminance, the larger the value, and the higher the luminance in each color component. For a gray image, the pixel value may be a gray value.

(9) Group action recognition

Group action recognition (group activity recognition, GAR) may also be referred to as group activity recognition, which is used to identify what a group of people do in a video. Is an important issue in computer vision. GAR has many potential applications, including video surveillance and sports video analysis. In contrast to conventional single person action recognition, GAR is required not only to recognize the behavior of persons, but also to infer potential relationships between persons.

The group action recognition may be performed in the following manner:

(1) Extracting a time sequence feature (also called character action representation) of each character from the corresponding bounding box;

(2) Inferring a spatial context (also referred to as an interactive representation) between each persona;

(3) These representations are connected as final group activity characteristics (also known as feature aggregation).

These methods are indeed effective, but ignore the concurrency of the multi-level information, resulting in unsatisfactory performance of the GAR.

A group action is composed of different actions of a plurality of people in the group, namely actions which are equivalent to the completion of the cooperation of a plurality of people, and the actions of the people reflect different postures of the body.

In addition, the traditional model often ignores the spatial dependency relationship between different people, and the spatial dependency relationship between people and the time dependency relationship of each person action can provide important clues for GAR. For example, a person must observe his teammate condition while striking his ball, and at the same time, he must constantly adjust his posture over time to perform such a striking action. And such several persons cooperate with each other to complete a group action. All the above information, including the motion feature (also called as the gesture (human parts) feature of each person in each frame of image, the time and space dependency feature (also called as the human actions feature) of each person's motion, the feature (also called as the group activity feature) of each frame of image and the interrelation between these features, together form an entity, which affects the recognition of the group motion. That is, the conventional method cannot take full advantage of the potential time and space dependencies therein by using a stepwise method to process complex information of such an entity. Furthermore, these methods are also highly likely to disrupt co-occurrence relationships between the spatial and temporal domains. The existing method is used for training the CNN network under the condition of extracting time sequence dependent features, so that the features extracted by the feature extraction network ignore the spatial dependency relationship among people in the image. In addition, more redundant information is included in the bounding box, which may be less accurate for the extracted motion features of the person.

FIG. 3 is a schematic flow chart of a method of group action recognition. See in particular "A Hierarchical Deep Temporal Model for Group Activity Recognition"(Ibrahim M S,Muralidharan S,Deng Z,et al.IEEE Conference on Computer Vision and Pattern Recognition.2016:1971-1980).

The existing algorithm is used for carrying out target tracking on a plurality of characters in a plurality of video frames, and the size and the position of each character in each video frame in the plurality of video frames are determined. The convolution feature of each person in each video frame is extracted using the person CNN, and the convolution feature is input into a person long-term memory (LSTM) to extract the time series feature of each person. And splicing the convolution characteristic and the time sequence characteristic corresponding to each person as the person action characteristic of the person. And splicing and maximally pooling the character action characteristics of a plurality of characters in the video to obtain the action characteristics of each video frame. The motion characteristics of each video frame are input into the group LSTM to obtain the corresponding characteristics of the video frames. And inputting the characteristics corresponding to the video frames into a group action classifier, so as to classify the input video, namely determining the category to which the group action in the video belongs.

Two-step training is required to arrive at a hierarchical depth timing model (HIERARCHICAL DEEP temporal model, HDTM) that can identify videos that include this particular type of group action. The HDTM models include character CNN, character LSTM, group LSTM, and group action classifier.

The existing algorithm is used for carrying out target tracking on a plurality of characters in a plurality of video frames, and the size and the position of each character in each video frame in the plurality of video frames are determined. Each person corresponds to a person action tag. Each input video corresponds to a group action tag.

And training the character CNN, the character LSTM and the character action classifier according to the character action labels corresponding to each character, so as to obtain the trained character CNN and the trained character LSTM.

Training the group LSTM and the group action classifier according to the group action label, thereby obtaining the trained group LSTM and the trained group action classifier.

And according to the first training step, obtaining the character CNN and the character LSTM, and extracting the convolution characteristic and the time sequence characteristic of each character in the input video. And then, performing a second training step according to the characteristic representation of each video frame obtained by splicing the convolution characteristics and the time sequence characteristics of the extracted multiple characters. After the two-step training is completed, the obtained neural network model can conduct group action recognition on the input video.

The determination of the character motion feature representation for each character is made by the neural network model trained in the first step. The character action characteristic representations of the plurality of characters are fused to identify group actions, which are performed by the neural network model trained in the second step. There is a gap between feature extraction and group action classification, that is, the neural network model obtained by the first step of training can accurately extract features for identifying the actions of the person, but whether the features are suitable for identifying the group actions cannot be guaranteed.

Fig. 4 is a schematic flow chart of a method of group action recognition. See in particular "Social scene understanding:End-to-end multi-person action localization and collective activity recognition"(Bagautdinov,Timur,et al.IEEE Conference on Computer Vision and Pattern Recognition.2017:4315-4324).

The image of the t-th frame of the video frames is fed into a full convolution network (fully convolutional networks, FCN) to obtain a number of character features f ^t. The time sequence modeling is carried out on the character features f ^t through the RNN to obtain the time sequence features of each character, and the time sequence features of each character are sent into the classifier to simultaneously identify the character action p _I ^t and the group action p _C ^t.

One-step training is required to obtain a neural network model that can identify videos that include the particular type of group action. That is, the training image is input into the FCN, and parameters of the FCN and the RNN are adjusted according to the character action label and the group action label of each character in the training image, so as to obtain the FCN and the RNN after training.

The FCN may generate a multi-scale feature map F ^t of the t-th frame image. A number of detection frames B ^t and corresponding probabilities p ^t are generated by a deep full convolution network (deep fully convolutional networks, DFCN), and B ^t and p ^t are fed into a markov random field (Markov random field, MRF) to obtain a trusted detection frame B ^t to determine a feature F ^t corresponding to the trusted detection frame B ^t from a multi-scale feature map F ^t. From the characteristics of the persons in the trusted detection frame b ^t-1 and the trusted detection frame b ^t, it can be determined that the same person is in the trusted detection frames b ^t-1 and b ^t. FCNs may also be obtained by pre-training.

A group action consists of different actions of several characters, which in turn are reflected in different body poses of each character. The time series characteristics of a character may reflect the time dependence of the action of one character. The spatial dependency between character actions also provides important clues for group action recognition. The accuracy is affected to some extent by the group action recognition scheme which does not consider the spatial dependence between the persons.

In addition, in the training process of the neural network, the person action labels of each person are usually determined manually, so that the workload is high.

In order to solve the above problems, an embodiment of the present application provides an image recognition method. When the group actions of the plurality of people are determined, not only the time sequence characteristics of the plurality of people are considered, but also the space characteristics of the plurality of people are considered, and the group actions of the plurality of people can be determined better and more accurately by integrating the time sequence characteristics and the space characteristics of the plurality of people.

A system architecture according to an embodiment of the present application will be described with reference to fig. 5.

Fig. 5 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 5, the system architecture 500 includes an execution device 510, a training device 520, a database 530, a client device 540, a data storage system 550, and a data acquisition system 560.

In addition, the execution device 510 includes a computing module 511, an I/O interface 512, a preprocessing module 513, and a preprocessing module 514. Among other things, the calculation module 511 may include a target model/rule 501 therein, with the preprocessing module 513 and preprocessing module 514 being optional.

The data acquisition device 560 is used to acquire training data. For the image recognition method according to the embodiment of the present application, the training data may include a multi-frame training image (the multi-frame training image includes a plurality of people, for example, a plurality of people) and corresponding labels, where the labels give the group action category of the people in the training image. After the training data is collected, the data collection device 560 stores the training data in the database 530 and the training device 520 trains the target model/rule 501 based on the training data maintained in the database 530.

The training device 520 obtains the target model/rule 501 based on the training data, and the training device 520 identifies the input multi-frame training image, compares the output prediction type with the label until the difference between the output prediction type of the training device 520 and the result of the label is smaller than a certain threshold value, thereby completing the training of the target model/rule 501.

The above-mentioned target model/rule 501 can be used to implement the image recognition method according to the embodiment of the present application, that is, one or more frames of images to be processed (after the related preprocessing) are input into the target model/rule 501, so as to obtain the group action category of the person in the one or more frames of images to be processed. The target model/rule 501 in the embodiment of the present application may be specifically a neural network. In practical applications, the training data maintained in the database 530 is not necessarily acquired by the data acquisition device 560, but may be received from other devices. It should be further noted that the training device 520 is not necessarily completely based on the training data maintained by the database 530 to perform training of the target model/rule 501, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 501 obtained by training according to the training device 520 may be applied to different systems or devices, such as the execution device 510 shown in fig. 5, where the execution device 510 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR)/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 5, the execution device 510 configures an input/output (I/O) interface 512 for data interaction with an external device, and a user may input data to the I/O interface 512 through the client device 540, where the input data may include in an embodiment of the present application: the image to be processed is input by the client device. The client device 540 here may be in particular a terminal device.

The preprocessing module 513 and the preprocessing module 514 are configured to perform preprocessing according to input data (such as an image to be processed) received by the I/O interface 512, and in an embodiment of the present application, the preprocessing module 513 and the preprocessing module 514 may be omitted or only one preprocessing module may be used. When the preprocessing module 513 and the preprocessing module 514 are not present, the calculation module 511 may be directly employed to process the input data.

In preprocessing input data by the execution device 510, or in performing processing related to computation or the like by the computation module 511 of the execution device 510, the execution device 510 may call data, codes or the like in the data storage system 550 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 550.

Finally, the I/O interface 512 presents the processing results, such as the group action categories calculated by the target model/rule 501, to the client device 540 for presentation to the user.

Specifically, the group action category processed by the target model/rule 501 in the computing module 511 may be processed by the preprocessing module 513 (and may also be processed by the preprocessing module 514) and then the processing result is sent to the I/O interface, and then the processing result is sent to the client device 540 by the I/O interface for display.

It should be appreciated that when the preprocessing module 513 and the preprocessing module 514 are not present in the system architecture 500, the computing module 511 may also transmit the processed group action category to the I/O interface, and then send the processing result to the client device 540 for display by the I/O interface.

It should be noted that the training device 520 may generate, based on different training data, a corresponding target model/rule 501 for different targets or different tasks, where the corresponding target model/rule 501 may be used to achieve the targets or to perform the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 5, the user may manually give input data, which may be manipulated through an interface provided by I/O interface 512. In another case, the client device 540 may automatically send the input data to the I/O interface 512, and if the client device 540 is required to automatically send the input data requiring authorization from the user, the user may set the corresponding permissions in the client device 540. The user may view the results output by the execution device 510 at the client device 540, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 540 may also be used as a data collection terminal to collect input data from the input I/O interface 512 and output data from the output I/O interface 512 as new sample data, and store the new sample data in the database 530. Of course, instead of being collected by the client device 540, the I/O interface 512 may directly store the input data of the I/O interface 512 and the output result of the I/O interface 512 as new sample data into the database 530.

It should be noted that fig. 5 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 5, the data storage system 550 is an external memory with respect to the execution device 510, and in other cases, the data storage system 550 may be disposed in the execution device 510.

As shown in fig. 5, the training device 520 trains the target model/rule 501, which may be a neural network in the embodiment of the present application, and specifically, the neural network provided in the embodiment of the present application may be a CNN and a deep convolutional neural network (deep convolutional neural networks, DCNN), and so on.

Since CNN is a very common neural network, the structure of CNN is described with emphasis on fig. 6. As described in the basic concept introduction above, the convolutional neural network is a deep neural network with a convolutional structure, and is a deep learning (DEEP LEARNING) architecture, where the deep learning architecture refers to learning at multiple levels at different abstraction levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons can respond to an image input thereto.

Fig. 6 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present application. As shown in fig. 6, convolutional Neural Network (CNN) 600 may include an input layer 610, a convolutional layer/pooling layer 620 (where the pooling layer is optional), and a fully connected layer (fully connected layer) 630. The relevant contents of these layers are described in detail below.

Convolution layer/pooling layer 620:

Convolution layer:

The convolution/pooling layer 620 as shown in fig. 6 may include layers as examples 621-626, for example: in one implementation, layer 621 is a convolutional layer, layer 622 is a pooling layer, layer 623 is a convolutional layer, layer 624 is a pooling layer, layer 625 is a convolutional layer, and layer 626 is a pooling layer; in another implementation, 621, 622 are convolutional layers, 623 are pooling layers, 624, 625 are convolutional layers, 626 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of one convolution layer will be described below using convolution layer 621 as an example.

The convolution layer 621 may include a number of convolution operators, also known as kernels, which act in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels … … depending on the value of the step size stride), to complete the extraction of specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The sizes (rows and columns) of the weight matrixes are the same, the sizes of the convolution feature images extracted by the weight matrixes with the same sizes are the same, and the convolution feature images with the same sizes are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network 600 can perform correct prediction.

When convolutional neural network 600 has multiple convolutional layers, the initial convolutional layer (e.g., 621) tends to extract more general features, which may also be referred to as low-level features; as the depth of convolutional neural network 600 increases, features extracted by the later convolutional layers (e.g., 626) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

Since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, each of 621-626 as illustrated at 620 in FIG. 6, which may be one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Full connectivity layer 630:

After processing by convolutional layer/pooling layer 620, convolutional neural network 600 is not yet sufficient to output the desired output information. Because, as previously described, the convolution/pooling layer 620 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), convolutional neural network 600 needs to utilize fully-connected layer 630 to generate the output of the number of classes required for one or a group. Thus, multiple hidden layers (631, 632 to 23n as shown in fig. 6) may be included in the fully connected layer 630, and the output layer 240, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the hidden layers in the fully connected layer 630, i.e., the final layer of the overall convolutional neural network 600 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network 600 (e.g., propagation from 610 to 240 directions in fig. 6 is forward propagation) is completed, the backward propagation (e.g., propagation from 240 to 610 directions in fig. 6 is backward propagation) will begin to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network 600 and the error between the result output by the convolutional neural network 600 through the output layer and the ideal result.

It should be noted that the convolutional neural network 600 shown in fig. 6 is only an example of a convolutional neural network, and the convolutional neural network may also exist in the form of other network models in a specific application.

It should be appreciated that the Convolutional Neural Network (CNN) 600 shown in fig. 6 may be used to perform the image recognition method according to the embodiment of the present application, and as shown in fig. 6, the group action category may be obtained after the image to be processed is processed by the input layer 610, the convolutional layer/pooling layer 620 and the full-connection layer 630.

Fig. 7 is a schematic diagram of a chip hardware structure according to an embodiment of the present application. As shown in fig. 7, the chip includes a neural network processor 700. The chip may be provided in an execution device 510 as shown in fig. 5 to perform the calculation of the calculation module 511. The chip may also be provided in a training device 520 as shown in fig. 5 for completing training work of the training device 520 and outputting the target model/rule 501. The algorithm of each layer in the convolutional neural network as shown in fig. 6 can be implemented in a chip as shown in fig. 7.

The neural Network Processor (NPU) 50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) which distributes tasks. The NPU has a core part of an arithmetic circuit 703, and the controller 704 controls the arithmetic circuit 703 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuitry 703 internally includes a plurality of processing units (PEs). In some implementations, the arithmetic circuit 703 is a two-dimensional systolic array. The arithmetic circuit 703 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 703 is a general purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 703 takes the data corresponding to the matrix B from the weight memory 702 and buffers the data on each PE in the arithmetic circuit 703. The arithmetic circuit 703 takes the matrix a data from the input memory 701 and performs matrix operation with the matrix B, and the obtained partial result or the final result of the matrix is stored in an accumulator (accumulator) 708.

The vector calculation unit 707 may further process the output of the operation circuit 703, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, vector computation unit 707 may be used for network computation of non-convolutional/non-FC layers in a neural network, such as pooling (pooling), batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 707 can store the vector of processed outputs to the unified buffer 706. For example, the vector calculation unit 707 may apply a nonlinear function to an output of the operation circuit 703, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 707 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as an activation input to the arithmetic circuit 703, for example for use in subsequent layers in a neural network.

The unified memory 706 is used for storing input data and output data.

The weight data is transferred to the input memory 701 and/or the unified memory 706 directly through the memory cell access controller 705 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 702, and the data in the unified memory 706 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 710 for implementing interactions between the main CPU, DMAC, and finger memory 709 over the bus.

An instruction fetch memory (instruction fetch buffer) 709 connected to the controller 704 for storing instructions used by the controller 704;

The controller 704 is configured to invoke an instruction cached in the instruction memory 709, so as to control a working process of the operation accelerator.

Typically, the unified memory 706, the input memory 701, the weight memory 702, and the finger memory 709 are on-chip (on-chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

In addition, in the present application, the operations of the respective layers in the convolutional neural network shown in fig. 6 may be performed by the operation circuit 703 or the vector calculation unit 707.

Fig. 8 is a schematic flowchart of a training method of a neural network model according to an embodiment of the present application.

S801, training data is acquired, wherein the training data comprises T1 frame training images and annotation categories.

The T1 frame training image corresponds to one annotation class. T1 is a positive integer greater than 1. The T1 frame training image may be a continuous multi-frame image in a video, or may be a multi-frame image selected according to a preset rule in a video according to a preset rule. For example, the T1 frame training image may be a multi-frame image obtained by selecting every predetermined time period in a video, or may be a multi-frame image of a predetermined number of frames in a video.

The T1 frame training image may include a plurality of persons including only a person, only an animal, or both a person and an animal.

The label category is used for indicating the category of the group action of the person in the T1 frame training image.

S802, processing the T1 frame training image by utilizing a neural network to obtain a training class.

The following processing is performed on the T1 frame training image by using a neural network:

s802a, extracting image features of the T1 frame training image.

At least one frame of image is selected from the T1 frame of training images, and image characteristics of a plurality of people in each frame of image in the at least one frame of image are extracted.

In a frame of training images, the image features of a person may be used to represent the body pose of the person in the frame of training images, i.e., the relative position between the different limbs of the person. The image features described above may be represented by vectors.

S802b, determining the spatial characteristics of a plurality of people in each training image in at least one training image.

The spatial characteristics of the jth person in the ith training image of the at least one training image are determined according to the similarity between the image characteristics of the jth person in the ith training image and the image characteristics of other persons except the jth person in the ith training image, and i and j are positive integers.

The spatial feature of the jth person in the ith training image is used for representing the association relation between the actions of the jth person in the ith training image and the actions of other persons except the jth person in the ith training image.

The similarity between the corresponding image features of different people in the same frame of image can reflect the spatial dependency degree of the actions of the different people. That is, the more similar the image features corresponding to two persons are, the more closely the relationship between the actions of the two persons is; conversely, the lower the similarity of the image features corresponding to the two persons, the weaker the association between the actions of the two persons.

S802c, determining time sequence characteristics of each person in the plurality of persons in at least one frame of training image in different frames of images.

The time sequence feature of the jth person in the ith training image in the at least one frame of training images is determined according to the similarity between the image feature of the jth person in the ith training image and the image feature of the jth person in the other frames of training images except the ith training image, and i and j are positive integers.

The time sequence feature of the jth person in the ith training image is used for representing the association relation between the action of the jth person in the ith training image and the actions of the jth person in other training images of at least one training image.

The similarity of the corresponding image features of a person in two frames of images can reflect the time dependence of the action of the person. The higher the similarity of the corresponding image features of a person in two frames of images, the closer the association between the actions of the person at two points in time; conversely, the lower the similarity, the weaker the association between the actions of the character at these two points in time.

S802d, determining action characteristics of a plurality of people in each training image in at least one training image.

The motion characteristics of the jth person in the ith training image are obtained by fusing the spatial characteristics of the jth person in the ith training image, the time sequence characteristics of the jth person in the ith training image and the image characteristics of the jth person in the ith training image.

S802e, identifying group actions of a plurality of people in the T1 frame training image according to action characteristics of the plurality of people in each frame training image in the at least one frame training image, so as to obtain training categories corresponding to the group actions.

The motion features of each of the plurality of people in each of the at least one training image may be fused to obtain a feature representation of each of the at least one training image.

An average value of each bit of the training feature representation of each frame of training images in the T1 training frame image may be calculated to obtain an average feature representation. Each bit of the average training feature representation is an average of corresponding bits of the feature representation of each of the T1 frame training images. The training classes may be obtained by classifying, i.e. identifying group actions of a plurality of persons in the T1 frame training image, based on the average feature representation.

To increase the amount of training data, a training class may be determined for each of the at least one training image. The training class for determining each frame of image is described as an example. The at least one training image may be all or part of the T1 frame training image.

S803, determining the loss value of the neural network according to the training category and the labeling category.

The loss value L of the neural network can be expressed as:

Wherein N _Y represents the number of group action categories, i.e., the number of categories output by the neural network; the category of the annotation is indicated, Represented by one-hot coding,/>Includes N _Y bits,/>For representing one of them,/>P _t represents the training class of the T-th frame image in the T1 frame image, p _t is represented by one-hot encoding, and p _t comprises N _Y bits,/>Representing one of them,/>The t-th frame image can also be understood as an image at time t.

And S804, adjusting the neural network through back propagation according to the loss value.

In the training process, the training data generally includes multiple sets of combinations of training images and labeling categories, where each set of combinations of training images and labeling categories may include one or more training images and a unique labeling category corresponding to the one or more training images.

In the process of training the neural network, an initial set of model parameters can be set for the neural network, and then the model parameters of the neural network are gradually adjusted according to the difference between the training type and the labeling type until the difference between the training type and the labeling type is within a certain preset range, or when the training times reach the preset times, the model parameters of the neural network at the moment are determined to be final parameters of the neural network model, so that the training of the neural network is completed.

Fig. 9 is a schematic flowchart of an image recognition method according to an embodiment of the present application.

And S901, extracting image characteristics of an image to be processed.

The image to be processed comprises a plurality of characters, and the image characteristics of the image to be processed comprise the image characteristics of each character in each frame in a multi-frame image in the image to be processed.

Before step S901, an image to be processed may be acquired. The image to be processed may be retrieved from memory or may be received.

For example, when the image recognition method shown in fig. 9 is performed by the image recognition apparatus, the image to be processed may be an image obtained from the image recognition apparatus, or the image to be processed may be an image received by the image recognition apparatus from another device, or the image to be processed may be captured by a camera of the image recognition apparatus.

The image to be processed may be a continuous multi-frame image in a video, or may be a multi-frame image selected according to a preset rule in a video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or multiple frames of images can be selected from a video according to a preset frame number interval.

In a frame of image, the image features of a person may be used to represent the body position of the person in the frame, i.e. the relative position between the different limbs of the person. The image features of a person described above may be represented by vectors, which may be referred to as image feature vectors. The extraction of the above image features may be performed by CNN.

Alternatively, when the image features of the image to be processed are extracted, the characters in the image may be identified, so as to determine the bounding boxes of the characters, the image in each bounding box corresponds to one character, and then the image features of each bounding box are extracted, so as to obtain the image features of each character.

Since more redundant information is included in the image within the bounding box, the redundant information is independent of the action of the person. In order to improve the accuracy of the image feature vectors, the impact of redundant information may be reduced by identifying skeletal nodes of the person in each bounding box.

Optionally, the skeleton node of each person in the bounding box corresponding to the person can be identified first, and then the image feature vector of the person is extracted according to the skeleton node of the person, so that the extracted image feature can reflect the action of the person more accurately, and the accuracy of the extracted image feature is improved.

Furthermore, skeleton nodes in the boundary frame can be connected according to the character structure so as to obtain a connection image; and extracting the image feature vector from the connected image.

When masking the region outside the region where the skeleton node is located, the color of the pixel corresponding to the region outside the region where the skeleton node is located may be set to a certain preset color, for example, black. That is, the region where the skeleton node is located retains the same information as the original image, and the information of the region outside the region where the skeleton node is located is obscured. Therefore, only the image features of the partially visible image need be extracted, and the extraction operation for the masked region is not required.

The region where the bone node is located may be square, circular or other shape centered on the bone node. The side length (or radius), area, etc. of the region where the bone node is located may be a preset value.

According to the method for extracting the image features of the image to be processed, the features can be extracted according to the local visible image so as to obtain the image feature vector of the person corresponding to the boundary box; a mask matrix may also be determined based on the skeletal nodes, and the image masked based on the mask matrix. See in particular the description of fig. 11 and 12.

When a plurality of frames of images are acquired, different persons in the images can be determined through target tracking. For example, distinguishing persons in an image may be determined by sub-features of the persons in the image. The sub-features may be colors, edges, motion information, texture information, etc.

S902, determining spatial characteristics of each person in the multiple images in each frame of the multiple images.

And determining the spatial association relationship between the actions of different people in the same frame image through the similarity between the image characteristics of different people in the same frame image.

The spatial feature of the jth person in the ith frame of image in the image to be processed may be determined according to the similarity between the image feature of the jth person in the ith frame of image and the image features of other persons except the jth person in the ith frame of image, where i and j are positive integers.

Specifically, the similarity between the image feature vector of the jth person in the ith frame image and the image feature vectors of the other persons except for the jth person may reflect the degree of dependence of the jth person in the ith frame image on the actions of the other persons except for the jth person. That is, the higher the similarity of the image feature vectors corresponding to the two persons, the closer the association between the actions of the two persons; conversely, when the similarity of the image feature vectors corresponding to the two persons is lower, the association between the actions of the two persons is weaker. The spatial association of actions of different persons in one frame image can be described with reference to fig. 14 and 15.

S903, determining time sequence characteristics of each person in the multiple images in each frame of the multiple images.

And determining the time association relationship between the actions of the same person at different moments through the similarity between the image feature vectors of the same person in different actions in different frame images.

The time sequence feature of the jth person in the ith frame image in the to-be-processed image can be determined according to the similarity between the image feature of the jth person in the ith frame image and the image features of other frame images except the ith frame image, wherein i and j are positive integers.

The time sequence feature of the jth person in the ith frame image is used for representing the association relation between the action of the jth person in the ith frame image and the actions in other frame images except the ith frame image.

The similarity of the corresponding image features of a person in two frames of images can reflect the time dependence of the action of the person. The higher the similarity of the corresponding image features of a person in two frames of images, the closer the association between the actions of the person at two points in time; conversely, the lower the similarity, the weaker the association between the actions of the character at these two points in time. The relationship of the actions of one person over time can be seen from the description of fig. 16 and 17.

In the above process, the similarity between the features is referred to, and the similarity may be obtained in an unused manner. For example, the similarity between the above features may be calculated by a method of a Ming distance (Minkowski distance) (e.g., euclidean distance, manhattan distance), cosine similarity, chebyshev distance, hamming distance, etc.

Alternatively, the similarity may be calculated by calculating the sum of products of each bit of the two features after the linear change.

S904, determining an action feature of each of the plurality of persons in each of the plurality of frame images.

Alternatively, when determining the motion characteristics of a person in a frame of image, the time sequence characteristics, the space characteristics and the image characteristics corresponding to the person in a frame of image can be fused, so as to obtain the motion characteristics of the person in the frame of image.

For example, the spatial feature of the jth person in the ith frame image in the image to be processed, the time sequence feature of the jth person in the ith frame image, and the image feature of the jth person in the ith frame image may be fused, so as to obtain the action feature of the jth person in the ith frame image.

When the time sequence features, the space features and the image features are fused, different fusion modes can be adopted for fusion, and the fusion modes are exemplified below.

The first mode is to perform fusion by means of combination.

The features to be fused may be added directly, or weighted.

It will be appreciated that the weighted addition is performed by multiplying the features to be fused by a coefficient, i.e. a weight value.

That is, the channel dimensions (CHANNEL WISE) may be linearly combined in a combined fashion.

The plurality of features output from the plurality of layers of the feature extraction network may be added, for example, the plurality of features output from the plurality of layers of the feature extraction network may be directly added, or the plurality of features output from the plurality of layers of the feature extraction network may be added according to a certain weight. T1 and T2 respectively represent features output by two layers of the feature extraction network, and may be represented by T3 as a fused feature, where t3=a×t1+b×t2, a and b are coefficients, i.e., weight values, a+.0, and b+.0, respectively, multiplied by T1 and T2 when calculating T3.

And a second mode, adopting a cascade (concatate) mode and a channel fusion (channel fusion) mode to carry out fusion.

Cascading and channel fusion are another way of fusion. By adopting a cascading and channel fusion mode, the dimension numbers of the features to be fused can be directly spliced, or the features can be spliced after being multiplied by a certain coefficient, namely a weight value.

And thirdly, processing the characteristics by utilizing a pooling layer to realize the fusion of the characteristics.

The plurality of feature vectors may be maximally pooled to determine the target feature vector. In the target feature vector obtained through maximum pooling, each bit is the maximum value of the corresponding bits in the feature vectors. Multiple feature vectors may also be averaged pooled to determine a target feature vector. In the target feature vector obtained through average pooling, each bit is an average value of corresponding bits in the feature vectors.

Alternatively, the features corresponding to a person in a frame of image may be fused in a combined manner to obtain the motion features of the person in the frame of image.

When the multi-frame image is acquired, the feature vector group corresponding to at least one person in the ith frame image may further include a time sequence feature vector corresponding to at least one person in the ith frame image.

S905, identifying group actions of the plurality of people in the image to be processed according to action characteristics of each of the plurality of people in each frame of image in the multi-frame image.

It should be understood that group actions are composed of actions of several characters in a group, i.e., actions that are commonly accomplished by multiple characters.

In one implementation, the motion characteristics of each frame of image may be determined based on the motion characteristics of each of a plurality of people of each frame of image in the image to be processed. Then, group actions of a plurality of persons in the image to be processed can be identified based on the action characteristics of each frame of image.

Optionally, the motion features of multiple people in one frame of image may be fused in a mode of maximum pooling, so as to obtain the motion features of the frame of image.

In another implementation, the motion characteristics of each person in the images of each frame in the images to be processed may be classified to obtain the motion of each person, and the group motion of the plurality of persons may be determined according to the motion characteristics.

Steps S901 to S904 may be implemented by the neural network model trained in fig. 8.

It should be understood that the steps are not limited in order, for example, the timing characteristics may be determined first, and then the spatial characteristics may be determined, which is not described herein.

In the method shown in fig. 9, when determining the group actions of the plurality of people, not only the time sequence characteristics of the plurality of people are considered, but also the space characteristics of the plurality of people are considered, and the group actions of the plurality of people can be determined better and more accurately by integrating the time sequence characteristics and the space characteristics of the plurality of people.

Optionally, in the method shown in fig. 9, after the group actions of the plurality of people in the image to be processed are identified, tag information of the image to be processed is generated according to the group actions, where the tag information is used to indicate the group actions of the plurality of people in the image to be processed.

Alternatively, in the method shown in fig. 9, after the group actions of the plurality of persons in the image to be processed are identified, the key persons of the image to be processed are determined according to the group actions.

Alternatively, in the above-mentioned process of determining the key person, the contribution degree of each person in the plurality of persons in the image to be processed to the group action may be determined first, and then the person with the highest contribution degree may be determined as the key person.

Fig. 10 is a schematic flowchart of an image recognition method according to an embodiment of the present application.

S1001, extracting image characteristics of an image to be processed.

The image to be processed comprises at least one frame of image, and the image characteristics of the image to be processed comprise the image characteristics of a plurality of people in the image to be processed.

Prior to step S1001, an image to be processed may be acquired. The image to be processed may be retrieved from memory or may be received.

For example, when the image recognition method shown in fig. 10 is performed by the image recognition apparatus, the image to be processed may be an image obtained from the image recognition apparatus, or the image to be processed may be an image received by the image recognition apparatus from another device, or the image to be processed may be captured by a camera of the image recognition apparatus.

It should be understood that the image to be processed may be one frame image or may be multiple frames image.

When the image to be processed is a plurality of frames, the image to be processed can be a plurality of continuous frames of images in a video, or can be a plurality of frames of images selected according to a preset rule in a video. For example, in a video, multiple frames of images can be selected according to a preset time interval; or multiple frames of images can be selected from a video according to a preset frame number interval.

The image to be processed may include a plurality of persons including only a person, only an animal, or both a person and an animal.

Alternatively, the image features of the image to be processed described above may be extracted by the method shown in step S901 in fig. 9.

S1002, determining spatial characteristics of a plurality of people in each frame of to-be-processed image.

The spatial feature of a certain person in the plurality of persons in each frame of the image to be processed is determined according to the similarity between the image feature of the person in the frame of the image to be processed and the image feature of other persons except the person in the frame of the image to be processed.

Alternatively, the spatial characteristics of a plurality of persons in each frame of the image to be processed may be determined by the method shown in step S902 in fig. 9.

S1003, determining action characteristics of a plurality of people in each frame of to-be-processed image.

The action characteristics of a certain person in the plurality of persons in each frame of the image to be processed are obtained by fusing the spatial characteristics of the person in the frame of the image to be processed and the image characteristics of the person in the frame of the image to be processed.

Alternatively, a fusion method shown in step S904 in fig. 9 may be used to determine the motion characteristics of a plurality of characters in the frame-free image to be processed.

S1004, identifying group actions of the plurality of people in the image to be processed according to action characteristics of the plurality of people in the image to be processed of each frame.

Alternatively, a method shown in step S905 in fig. 9 may be employed to identify group actions of a plurality of persons in an image to be processed.

In the method shown in fig. 10, without calculating the time series characteristics of each person, when the determination of the spatial characteristics of the person does not depend on the time series characteristics of the person, it is possible to more easily determine the group actions of a plurality of persons. For another example, when only one frame of image is identified, there is no time sequence feature of the same person at different times, and the method is also more applicable.

Fig. 11 is a schematic flowchart of an image recognition method according to an embodiment of the present application.

S1101, extracting image characteristics of an image to be processed.

The image to be processed comprises a plurality of frames of images, and the image characteristics of the image to be processed comprise the image characteristics of a plurality of people in each frame of image in at least one frame of image selected from the plurality of frames of images.

Alternatively, the extraction of the features may be performed on images corresponding to a plurality of persons in the input multi-frame image.

According to the method for extracting the image features of the image to be processed, the features can be extracted according to the local visible image so as to obtain the image feature vector of the person corresponding to the boundary box; a mask matrix may also be determined based on the skeletal nodes, and the image masked based on the mask matrix.

The above method of determining a mask matrix from skeletal nodes is specifically illustrated below.

S1101 a) determines a bounding box of each person in advance.

For time t, an image including the kth person within the bounding box

S1101 b) extracting skeletal nodes of each person in advance.

At time t, extracting skeleton node of kth person

S1101 c) calculates a mask matrix of character actions.

Can be based on the image of the personAnd skeletal node/>Computing character action mask matrix/>Masking matrixCorresponds to one pixel.

Optionally, a mask matrixIn the square region with the skeleton point as the center and the side length of l, the value in the square region is set to 1, and the values in other positions are set to 0. Mask matrix/>The calculation formula of (2) is as follows:

In the RGB color mode, an RGB model is used to assign an intensity value in the range of 0 to 255 to the RGB components of each pixel in the image. If RGB color mode is used, the matrix is masked The calculation formula of (2) can be expressed as: /(I)

Using matricesFor the original figure action image/>Masking to obtain local visible image/>May represent one pixel. /(I)The RGB component value of each pixel in (a) is between 0-1. Operator/>Representation ofEach bit of (3) is associated with a corresponding/>Is multiplied by each bit of the code.

Fig. 12 is a schematic diagram of a process for acquiring a locally visible image according to an embodiment of the present application. As shown in fig. 12, for a pictureMasking is performed. Specifically, the skeletal node/>, is preservedThe areas around each node that become longer than l are masked from the other areas.

The number of persons in the T-frame image is assumed to be the same, that is, images each including K persons in the T-frame image. From locally-visible images corresponding to the K persons in each of the T-frame imagesExtracting image features/>Can be represented by a D-dimensional vector, i.e./>The extraction of the image features of the T-frame image described above may be performed by CNN.

The set of image features of K persons in a T frame image may be denoted as X,For each person, a locally visible image/>, is utilizedThe extraction of the image features can reduce redundant information in the boundary box, extract the image features according to the structural information of the body, and enhance the expressive ability of the image features on the actions of the person.

S1102, determining the dependency relationship among actions of different persons in the image to be processed and the dependency relationship among actions of the same person at different moments.

In this step, the spatial correlation of the actions of different persons in the image to be processed and the temporal correlation of the actions of the same person at different times are determined using a cross interaction module (cross interaction module, CIM).

The cross interaction module is used for realizing interaction of the features, establishing a feature interaction model, and the feature interaction model is used for representing association relations of the body gestures of the person in time and/or space.

Spatial correlation of the body gestures of a person may be manifested by spatial dependence. Spatial dependence is used to represent the dependence of the body pose of one person in a frame of an image on the body poses of other persons in the frame of the image, i.e. the spatial dependence between the actions of the person. The above spatial dependency may be represented by a spatial feature vector.

For example, if one frame of the image to be processed corresponds to the image at time t, the spatial feature vector of the kth personCan be expressed as:

wherein K represents K characters in the corresponding frame image at the moment t, Representing the image characteristics of the kth person in the K persons at the time t,/>Representing the image features of the kth' of the K characters at time t, r (a, b) =θ (a) ^T Φ (b) is used to calculate the similarity between feature a and feature b, θ (), Φ (), g () represents three linear embedding functions, θ (), Φ (), g () may be the same or different, respectively. r (a, b) may represent the dependency of feature b on feature a.

By calculating the similarity between image features of different persons in the same frame of image, spatial dependence between the body poses of different persons in the same frame of image can be determined.

The time dependence of the body gestures of a person may be reflected by a time dependence. The time dependence may also be referred to as a time-series dependence, which is used to represent the dependence of the body posture of the person in a certain frame of image on the body posture of the person in other frames of images, i.e. the time-series dependence inherent to the action of a person. The above-described time dependence can be represented by a timing feature vector.

For example, if one frame of the image to be processed corresponds to the image at time t, the time sequence feature vector of the kth personCan be expressed as:

Wherein T represents an image of T times in total in the image to be processed, namely the image to be processed comprises a T frame image, Representing the image characteristics of the kth person at time t,/>Representing the image characteristics of the kth person at time t'.

By calculating the similarity between image features of the same person at different times, the time dependence between the body poses of the same person at different times can be determined.

Can be based on the space eigenvector of the kth person at time t in the image to be processedAnd timing eigenvector/>Calculating to obtain the time-space eigenvector/>, of the kth person at the moment tTime-space eigenvectors/>May be used to represent the "time-space" associated information for the kth persona. Time-space eigenvectors/>Can be expressed as a timing feature vector/>And spatial feature vector/>Add/addResults of the operation:

Fig. 13 is a schematic diagram of a method for calculating similarity between image features according to an embodiment of the present application. As shown in fig. 13, the image feature of the kth person at time t is calculated Vector representation of similarity to image features of other persons at time t, and image feature of kth person at time t/>Vector representation of similarity with image features of kth person at other times, averaging (Avg) to determine time-space feature vector of kth person at time tThe set of time-space eigenvectors of K people in a T frame image can be represented as H,/>

S1103, fusing the image features and the time-space feature vectors to obtain the action features of each frame of image.

Image characteristics of K characters in images at T momentsSet/>, of time-space feature vectors of K people in the image of image and T momentsThe time-space feature vectors in the T moment are fused to obtain the action feature of each image in the images at the T moments. The motion characteristics of each frame of image can be represented by motion characteristic vectors.

Image features of the kth person at time t can be used forAnd time-space eigenvectors/>Fusing to obtain character feature vector/>, of the kth character at the moment tImage feature/>And time-space eigenvectors/>Residual connection is carried out to obtain character feature vector/>

According to the character feature vector of each of the K charactersAt time t, the set of character feature vectors of K characters/>Can be expressed as: /(I)

The set of person feature vectors B _t is maximally pooled to obtain motion feature vector z _t, each bit in motion feature vector z _t beingIn the bit of maximum value.

S1104, classifying and predicting the action characteristics of each frame of image to determine the group actions of the images to be processed.

The classification module may be a softmax classifier. The classification result of the classification module may use one-bit valid (one-hot) coding, i.e. only one bit of the output result is valid. That is, the category corresponding to the classification result of any image feature vector is the only one of the output categories of the classification module.

The motion feature vector z _t of a frame of image at the time t can be input into the classification module to obtain the classification result of the frame of image. The classification result of z _t at any T moment by the classification module can be used as the classification result of the group actions in the T frame image. The classification result of the group actions in the T-frame image may also be understood as the classification result of the group actions of the person in the T-frame image, or the classification result of the T-frame image.

The motion feature vectors z ₁,z₂,…,z_T of the T-frame images can be respectively input into the classification module to obtain classification results of each frame of image. The classification result of the T-frame image may belong to one or more categories. And taking one of the output categories of the classification module, which corresponds to the T frame image and has the largest number of images, as a classification result of the group action in the T frame image.

The motion feature vector z ₁,z₂,…,z_T of the T frame image can be averaged to obtain an average motion feature vectorAverage motion feature vector/>Is the average of the bits in z ₁,z₂,…,z_T. Average motion feature vector/>And inputting the classification module to obtain a classification result of the group actions in the T frame image.

The method can complete the complex reasoning process of group action recognition: extracting image characteristics of multi-frame images, determining time sequence characteristics and space characteristics according to interdependence relations of actions among different people and different moments of the same people in the images, fusing the time sequence characteristics, the space characteristics and the image characteristics to obtain action characteristics of each frame of images, and further deducing group actions of the multi-frame images by classifying the action characteristics of each frame of images.

In the embodiment of the application, when the group actions of the plurality of people are determined, not only the time sequence characteristics of the plurality of people are considered, but also the space characteristics of the plurality of people are considered, and the group actions of the plurality of people can be better and more accurately determined by integrating the time sequence characteristics and the space characteristics of the plurality of people.

In the embodiment of the application, when the group actions of a plurality of people are determined, only the spatial characteristics of the plurality of people are considered for identification so as to more conveniently determine the group actions of the plurality of people.

Experiments on popular benchmark datasets prove the effectiveness of the image recognition method provided by the embodiment of the application.

The neural network obtained by training is used for image recognition, and group actions can be accurately recognized. Table 1 shows the recognition accuracy of recognizing the public data set by using the neural network model obtained by training and adopting the image recognition method provided by the embodiment of the application. The data comprising the group actions in the public data set is input into the trained neural network, and the multi-class accuracy (multi-class accuracy, MCA) indicates that the number of the correct classification results in the classification results of the data comprising the group actions in the neural network is proportional to the data comprising the group actions. The average per-class accuracy (MEAN PER CLASS accuracy, MPCA) represents the average of the ratio of the number of correctly classified results of each class to the number of the class of data in the data comprising the group action in the classification results of the neural network on the data comprising the group action.

TABLE 1

In the training process of the neural network, the training of the neural network can be completed without depending on character action labels.

In the training process, an end-to-end training mode is adopted, namely, the neural network is adjusted only according to the final classification result.

The convolutional neural network AlexNet and the residual network ResNet-18 are adopted, the neural network training method provided by the embodiment of the application is adopted for training, and the image recognition method provided by the embodiment of the application is adopted for group action recognition, so that the accuracy rates MCA and MPCA are higher, and better effects can be achieved.

Feature interactions, i.e., determining the dependency relationship between people and the dependency relationship of the actions of the people in time. The similarity between the two image features is calculated through the function r (a, b), and the larger the calculation result of the function r (a, b), the stronger the dependency relationship of the body posture corresponding to the two image features.

And determining the spatial feature vector of each person in each frame of image through the similarity between the image features of the plurality of persons in each frame of image. The spatial feature vector of a person in a frame image is used to represent the other spatial dependencies of the person on the frame image, i.e. the dependency of the person's body posture on the body posture of other persons.

FIG. 14 is a schematic diagram of the spatial relationship of different character actions provided by embodiments of the present application. For one frame of image of group actions as shown in fig. 14, the dependence of each person in the group actions on the body posture of other persons is represented by the spatial dependence matrix of fig. 15. Each bit in the space dependency matrix is represented by a square, and the darkness, i.e. brightness, of the color of the square represents the similarity of the image characteristics of two people, i.e. the calculation result of the function r (a, b). The larger the calculation result of the function r (a, b), the darker the color of the lattice. The calculation result of the function r (a, b) may be normalized, i.e. mapped between 0 and 1, thereby rendering the spatial dependency matrix.

Intuitively, the player in FIG. 14, player number 10, has a greater impact on her teammate's follow-up. Through the calculation of the function r (a, b), the tenth row and the tenth column representing the player No. 10 in the spatial dependency matrix are darker. I.e. player number 10 is most relevant to group action. Therefore, the function r (a, b) can reflect the high degree of correlation of the body gestures between one person and other persons in one frame of image, namely, can reflect the situation of higher degree of dependence. In fig. 14, the spatial dependence (SPATIAL DEPENDENCY) between the body poses of the players numbered 1-6 is weak. The upper left black box region in the spatial dependency matrix is darker in color, and the upper left region represents the dependency between body poses between players # 1-6. Therefore, the neural network provided by the embodiment of the application can better reflect the dependency relationship or association relationship between the body posture of one person and the body posture of other persons in one frame of image.

The time sequence feature vector of a person in a frame image is determined through the similarity between the image features of the person in the multi-frame image. The time sequence feature vector of one person in one frame image is used for representing the dependency relationship of the body posture of the person on the body postures of the persons in other frame images.

The body posture of the player No. 10 shown in fig. 14 in the 10 frames of images in chronological order is represented by the time dependency matrix of fig. 17 as shown in fig. 16. Each bit in the time-dependent matrix is represented by a square, and the darkness, i.e. brightness, of the color of the square represents the similarity of the image features of two persons, i.e. the calculation result of the function r (a, b).

The body posture of the player # 10 in the 10 frame image corresponds to take-off (frames 1-3), hang (frames 4-8) and land (frames 9-10). In our knowledge, "take off" and "land" should be more discriminative. In the time-dependent matrix shown in fig. 17, the similarity of the image features of the player No. 10 in the 2 nd and 10 th frame images with those in the other images is relatively high. In the black frame region shown in fig. 17, the image features of the 4 th to 8 th frames of images, i.e., the floating state 10 players, have a low similarity with those of the other images. Therefore, the neural network provided by the embodiment of the application can better reflect the time association relationship of the body gesture of one person in the multi-frame image.

Method embodiments of the present application are described above with reference to the accompanying drawings, and device embodiments of the present application are described below. It will be appreciated that the description of the method embodiments corresponds to the description of the apparatus embodiments, and that accordingly, non-described parts may be referred to the previous method embodiments.

Fig. 18 is a schematic diagram of a system architecture of an image recognition device according to an embodiment of the present application. The image recognition device shown in fig. 18 includes a feature extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804. The image recognition apparatus in fig. 18 can perform the image recognition method according to the embodiment of the present application, and a process of processing an input picture by the image recognition apparatus will be described below.

The feature extraction module 1801, which may also be referred to as a local feature extraction module (partial-body extractor module), is configured to extract image features of a person according to skeletal nodes of the person in the image. The functionality of the feature extraction module 1801 may be implemented using a convolutional network. The multi-frame image is input to a feature extraction module 1801. The image features of a person may be represented by vectors, and the vectors representing the image features of the person may be referred to as image feature vectors of the person.

The cross interaction module 1802 is configured to map image features of a plurality of people in each frame of image in the multi-frame image to time-space interaction features of each person. The time-space interaction feature is used to represent "time-space" associated information that identifies the persona. The time-space interaction characteristic of a person in a frame image can be obtained by fusing the time sequence characteristic and the space characteristic of the person in the frame image. The cross-interaction module 1802 may be implemented by a convolution layer and/or a full connection layer.

The feature fusion module 1803 is configured to fuse the motion feature of each person in a frame of image with the time-space interaction feature, so as to obtain an image feature vector of the frame of image. The image feature vector of the frame image may be used as a feature representation of the frame image.

The classification module 1804 is configured to classify according to the image feature vectors, thereby determining a classification of the group actions of the people in the T-frame image input to the feature extraction module 1801. The classification module 1804 may be a classifier.

The image recognition apparatus shown in fig. 18 may be used to perform the image recognition method shown in fig. 11.

Fig. 19 is a schematic structural diagram of an image recognition device according to an embodiment of the present application. The image recognition apparatus 3000 shown in fig. 19 includes an acquisition unit 3001 and a processing unit 3002.

An acquisition unit 3001 for acquiring an image to be processed;

The processing unit 3002 is configured to execute each image recognition method according to the embodiment of the present application.

Alternatively, the acquiring unit 3001 may be used to acquire an image to be processed; the processing unit 3002 may be configured to perform steps S901 to S904 or steps S1001 to S1004 described above to identify group actions of a plurality of persons in the image to be processed.

Alternatively, the acquiring unit 3001 may be used to acquire an image to be processed; the processing unit 3002 may be configured to perform the above steps S1101 to S1104 to identify the group actions of the person in the image to be processed.

The processing unit 3002 may be divided into a plurality of modules according to the processing functions.

For example, the processing unit 3002 may be divided into an extraction module 1801, a cross interaction module 1802, a feature fusion module 1803, and a classification module 1804 as shown in fig. 18. The processing unit 3002 can implement the functions of the respective modules shown in fig. 18, and further can be used to implement the image recognition method shown in fig. 11.

Fig. 20 is a schematic hardware configuration of an image recognition apparatus according to an embodiment of the present application. The image recognition apparatus 4000 shown in fig. 20 (the apparatus 4000 may be a computer device in particular) includes a memory 4001, a processor 4002, a communication interface 4003, and a bus 4004. The memory 4001, the processor 4002 and the communication interface 4003 are connected to each other by a bus 4004.

The memory 4001 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 4001 may store a program, and when the program stored in the memory 4001 is executed by the processor 4002, the processor 4002 is configured to perform the steps of the image recognition method of the embodiment of the present application.

The processor 4002 may be a general-purpose central processing unit (central processing unit, CPU), microprocessor, application SPECIFIC INTEGRATED Circuit (ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing related programs to implement the image recognition method of the present application.

The processor 4002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the image recognition method of the present application may be performed by integrated logic circuitry of hardware or instructions in software form in the processor 4002.

The processor 4002 may also be a general purpose processor, a digital signal processor (DIGITAL SIGNAL processing DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (field programmable GATE ARRAY, FPGA) or other programmable logic device, a discrete gate or transistor logic device, a discrete hardware component. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 4001, and the processor 4002 reads information in the memory 4001, and in combination with hardware thereof, performs functions to be executed by units included in the image recognition apparatus, or performs an image recognition method of an embodiment of the method of the present application.

The communication interface 4003 enables communication between the apparatus 4000 and other devices or communication networks using a transceiving apparatus such as, but not limited to, a transceiver. For example, the image to be processed can be acquired through the communication interface 4003.

Bus 4004 may include a path for transferring information between various components of device 4000 (e.g., memory 4001, processor 4002, communication interface 4003).

Fig. 21 is a schematic hardware structure of a neural network training device according to an embodiment of the present application. Similar to the apparatus 4000 described above, the neural network training apparatus 5000 shown in fig. 21 includes a memory 5001, a processor 5002, a communication interface 5003, and a bus 5004. The memory 5001, the processor 5002, and the communication interface 5003 are communicatively connected to each other via a bus 5004.

The memory 5001 may be a ROM, a static storage device, and a RAM. The memory 5001 may store a program that, when executed by the processor 5002, the processor 5002 and the communication interface 5003 are operable to perform the respective steps of the neural network training method of embodiments of the present application.

The processor 5002 may employ a general-purpose CPU, microprocessor, ASIC, GPU, or one or more integrated circuits for performing the procedures required to implement the functions performed by the elements of the image processing apparatus of the present application or to perform the neural network training methods of the method embodiments of the present application.

The processor 5002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the neural network training method according to the embodiments of the present application may be implemented by hardware integrated logic circuits or software instructions in the processor 5002.

The processor 5002 may also be a general purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 5001, and the processor 5002 reads information in the memory 5001, and combines the hardware thereof to perform functions required to be performed by units included in the image processing apparatus of the embodiment of the present application, or to perform the training method of the neural network of the method embodiment of the present application.

The communication interface 5003 enables communication between the apparatus 5000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, an image to be processed can be acquired through the communication interface 5003.

Bus 5004 may include a path for transferring information between various components of device 5000 (e.g., memory 5001, processor 5002, communications interface 5003).

It should be noted that although the above described apparatus 4000 and apparatus 5000 only show memory, processors, communication interfaces, in a specific implementation, it will be appreciated by those skilled in the art that the apparatus 4000 and apparatus 5000 may also include other devices necessary to achieve proper operation. Also, those skilled in the art will appreciate that the apparatus 4000 and the apparatus 5000 may also include hardware devices that implement other additional functions, as desired. Furthermore, it will be appreciated by those skilled in the art that the apparatus 4000 and the apparatus 5000 may also include only the devices necessary to implement the embodiments of the present application, and not all of the devices shown in fig. 20 and 21.

The embodiment of the application also provides an image recognition device, which comprises: at least one processor and a communication interface for information interaction of the image recognition device with other communication devices, which when executed in the at least one processor causes the image recognition device to perform the method above.

The embodiment of the application also provides a computer program storage medium, which is characterized in that the computer program storage medium is provided with program instructions, which when executed directly or indirectly, enable the implementation of the method in the preamble.

An embodiment of the present application also provides a chip system, where the chip system includes at least one processor, and where program instructions, when executed in the at least one processor, cause the method in the foregoing to be implemented.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image recognition method, comprising:

Extracting image characteristics of an image to be processed, wherein the image to be processed comprises a plurality of characters, and the image characteristics of the image to be processed comprise the image characteristics of the characters in each frame of image in multi-frame images of the image to be processed;

Determining a time sequence feature of each person in the plurality of persons in each frame of the multi-frame images, wherein the time sequence feature of the j-th person in the image to be processed is determined according to the similarity between the image feature of the j-th person in the image to be processed and the image feature of the j-th person in the images to be processed in the multi-frame images except the image to be processed in the image to be processed, and i and j are positive integers;

determining a spatial feature of each person in the plurality of persons in each frame of the multi-frame image, wherein the spatial feature of a j-th person in the plurality of persons in an i-th frame image in the to-be-processed image is determined according to the similarity of the image feature of the j-th person in the i-th frame image and the image features of other persons in the plurality of persons except the j-th person in the i-th frame image;

Determining action characteristics of each person in the multiple images in each frame of the multiple images, wherein the action characteristics of the j person in the multiple images in the i frame of the multiple images are obtained by fusing spatial characteristics of the j person in the i frame of the multiple images, time sequence characteristics of the j person in the i frame of the multiple images and image characteristics of the j person in the i frame of the multiple images;

And identifying group actions of the plurality of people in the image to be processed according to action characteristics of the plurality of people in each frame of image in the multi-frame image.

2. The method according to claim 1, wherein the extracting image features of the image to be processed comprises:

determining an image area where a bone node of each of the plurality of people is located in each of the plurality of images;

And extracting the characteristics of the image area where the skeleton node of each person in the plurality of persons is located, so as to obtain the image characteristics of the image to be processed.

3. The method according to claim 2, wherein the feature extraction of the image area where the skeletal node of each of the plurality of people is located, to obtain the image feature of the image to be processed, includes:

Masking, in each of the plurality of frame images, an area outside an image area where the skeletal node of each of the plurality of persons is located, to obtain a locally visible image, the locally visible image being an image composed of an image area where the skeletal node of each of the plurality of persons is located;

and extracting the characteristics of the local visible image to obtain the image characteristics of the image to be processed.

4. A method according to any one of claims 1 to 3, wherein the identifying the group action of the plurality of persons in the image to be processed based on the action characteristics of the plurality of persons in each of the plurality of frames of images comprises:

Classifying action features of each person in the plurality of persons in each frame of image in the multi-frame image to obtain actions of each person in the plurality of persons;

and determining group actions of the plurality of people in the image to be processed according to the actions of each of the plurality of people.

5. The method according to any one of claims 1 to 4, further comprising:

generating tag information of the image to be processed, wherein the tag information is used for indicating group actions of the plurality of people in the image to be processed.

6. The method according to any one of claims 1 to 4, further comprising:

Determining the contribution degree of each person in the plurality of persons to the group actions of the plurality of persons according to the group actions of the plurality of persons in the image to be processed;

and determining a key person in the plurality of people according to the contribution degree of each person in the plurality of people to the group action of the plurality of people, wherein the contribution degree of the key person to the group action of the plurality of people is larger than the contribution degree of other people except for the key person in the plurality of people to the group action of the plurality of people.

7. An image recognition apparatus, comprising:

An acquisition unit configured to acquire an image to be processed;

a processing unit for:

8. The apparatus of claim 7, wherein the processing unit is configured to,

Determining an image area where a bone node of each person in the plurality of persons is located in each image in the multi-frame images;

9. The apparatus of claim 8, wherein the processing unit is configured to,

10. The device according to any one of claims 7 to 9, wherein the processing unit is adapted to,

11. The apparatus of claim 10, wherein the processing unit is configured to,

12. The device according to any one of claims 7 to 10, wherein the processing unit is adapted to,

13. An image recognition apparatus, the apparatus comprising:

A memory for storing a program;

A processor for executing the memory-stored program, which processor is adapted to perform the method of any one of claims 1 to 6 when the memory-stored program is executed.

14. A computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of any one of claims 1 to 6.

15. A chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface to perform the method of any one of claims 1 to 6.