WO2021135509A1 - Image processing method and apparatus, electronic device, and storage medium - Google Patents

Image processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2021135509A1
WO2021135509A1 PCT/CN2020/121349 CN2020121349W WO2021135509A1 WO 2021135509 A1 WO2021135509 A1 WO 2021135509A1 CN 2020121349 W CN2020121349 W CN 2020121349W WO 2021135509 A1 WO2021135509 A1 WO 2021135509A1
Authority
WO
WIPO (PCT)
Prior art keywords
face
expression
image
area
key points
Prior art date
Application number
PCT/CN2020/121349
Other languages
French (fr)
Chinese (zh)
Inventor
武文琦
叶泽雄
肖万鹏
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2021135509A1 publication Critical patent/WO2021135509A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships

Definitions

  • This application relates to the field of artificial intelligence, and in particular to an image processing method, device, electronic equipment, and storage medium.
  • This application proposes an image processing method, device, electronic equipment, and storage medium, which can improve the accuracy of facial expression recognition.
  • an image processing method which is executed by an electronic device, and the method includes:
  • an image processing device including:
  • the obtaining module is configured to obtain the face image to be processed
  • An extraction module configured to extract key points of the face image
  • a positioning module configured to locate an expression-sensitive area in the face image based on the key point, where the expression-sensitive area is a local area of the face with dense expression feature information
  • the recognition module is configured to perform facial expression recognition on the facial image based on the facial expression sensitive area.
  • an image processing electronic device including: a memory storing computer-readable instructions; a processor, reading the computer-readable instructions stored in the memory, to execute the image processing method.
  • a computer-readable storage medium is disclosed, and computer-readable instructions are stored thereon.
  • the computer-readable instructions are executed by the processor of the computer, the computer is caused to execute the image processing. method.
  • the embodiment of the application obtains the face image to be processed, extracts the face key points of the face image, and then locates the expression sensitive area in the face image based on the extracted face key points, and compares the facial expression sensitive areas based on the expression sensitive area.
  • the facial image performs facial expression recognition.
  • the expression sensitive area is a local area of the face with dense expression feature information, such as the eye area and the mouth area. Since the consideration of the sensitive area of expression is specially introduced when performing expression recognition, the feature expression required for expression recognition is more comprehensive, thereby improving the accuracy of expression recognition.
  • FIG. 1A shows a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application.
  • Fig. 1B shows a flowchart of an image processing method according to an embodiment of the present application.
  • Fig. 2 shows a process of image processing using a pre-trained neural network according to an embodiment of the present application.
  • Fig. 3 shows the internal specific structure of the main network structure according to an embodiment of the present application.
  • Fig. 4 shows the internal specific structure of the ResBlock residual block according to an embodiment of the present application.
  • Fig. 5 shows the specific internal structure of the attention module according to an embodiment of the present application.
  • Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present application.
  • Fig. 7 shows a hardware diagram of an image processing electronic device according to an embodiment of the present application.
  • the embodiments of the present application relate to the field of artificial intelligence, and specifically, mainly relate to computer vision technology and machine learning in the field of artificial intelligence.
  • AI Artificial Intelligence
  • digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology.
  • Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and electromechanical integration.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Computer Vision is a science that studies how to make machines "see”. Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image.
  • Computer vision studies related theories and technologies trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data.
  • Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
  • the image processing method, device, electronic device, and storage medium of the embodiments of the present application may be used in an emotion analysis system and a human-computer interaction system, for example.
  • FIG. 1A shows a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application.
  • the execution subject of the image processing method in the embodiment of the present application may be any image processing terminal with sufficient computing capability.
  • the image processing terminal may be a cloud server 101, a local computer cluster 102, a personal computer terminal 103, a mobile terminal 104, or the aforementioned multiple cooperative terminals.
  • the face image processed by the image processing terminal may be obtained through the network 105 or obtained locally at the image processing terminal.
  • the face image can be a static image or a dynamic image in a video.
  • the embodiments of the present application may be executed by a neural network pre-trained in an image processing terminal. Specifically, after the pre-trained neural network in the image processing terminal obtains the face image to be processed, the neural network parameters generated based on the pre-training-extract the face key points of the face image; locate the person based on the face key points The expression-sensitive area in the face image; facial expression recognition is based on the expression-sensitive area. It is understandable that a single pre-trained neural network can execute all the steps of the embodiments of this application to realize expression recognition; or multiple pre-trained neural networks can execute part of the steps of the embodiments of this application to realize expression recognition.
  • an image processing method includes:
  • Step 110 Obtain a face image to be processed
  • Step 120 Extract face key points of the face image
  • Step 130 Based on the key points of the face, locate an expression-sensitive area in the face image, where the expression-sensitive area is a local area of the face with dense expression feature information;
  • Step 140 Perform expression recognition on the face image based on the expression sensitive area.
  • the embodiment of the application obtains the face image to be processed, extracts the face key points of the face image, and then locates the expression sensitive area in the face image based on the extracted face key points, and compares the facial expression sensitive areas based on the expression sensitive area.
  • the facial image performs facial expression recognition.
  • the expression sensitive area is a local area of the face with dense expression feature information, such as the eye area and the mouth area. Because the expression sensitive area is specially introduced into the expression recognition, the expression of the characteristics required for expression recognition is more comprehensive. In addition, the expression sensitive area can better reflect the difference between different expressions compared with other areas of the face. Difference, therefore, is also a sensitive area with a high degree of expression difference. By introducing the consideration of the sensitive area of expression, the accuracy of expression recognition is also improved.
  • step 110 a face image to be processed is acquired.
  • acquiring the face image to be processed includes:
  • the input image is cropped based on the location of the face, and the face image to be processed in the input image is obtained.
  • a face detection algorithm for example, a face detection algorithm based on binary wavelet transform, a face detection algorithm based on facial binocular structure features, etc.
  • the image processing terminal After the image processing terminal obtains the input image to be processed, it processes the input image based on the face detection algorithm, and locates the position of the human face in the input image (for example, the position of the rectangular area containing the human face in the input image ). Then the input image is cropped-the part outside the position of the face is cropped to obtain the face image to be processed. If necessary, the cropped input image can also be scaled, so that the obtained face image to be processed is more convenient for subsequent image processing.
  • step 120 the face key points of the face image are extracted.
  • step 130 based on the key points of the human face, an expression-sensitive area in the human face image is located, and the expression-sensitive area is a local area of the human face with dense expression feature information.
  • the image processing terminal extracts the face key points of the face image, based on the face key points, it locates the expression sensitive area in the face image.
  • the expression-sensitive area is a preset partial area of the face, for example: an eye area including eyes and/or a mouth area including lips, or an area including eyes and eyebrows.
  • the expression-sensitive area includes at least two partial areas of a human face, and locating the expression-sensitive area in the face image based on the key points of the human face includes:
  • the at least two face local areas are respectively located.
  • Regional key points refer to the key points of the face that compose the corresponding local area of the face.
  • the key points of the face that compose the mouth area that is, the key points of the mouth are-left corner of the mouth, right corner of the mouth, and tip of the nose.
  • the expression sensitive area to be located includes at least two partial areas of the human face.
  • the image processing terminal extracts the key points of the human face, it locates the area key points corresponding to the at least two local areas of the human face respectively.
  • the location of the key points of the area can be performed based on the statistics of the key points of the human face in advance. Specifically, it can be understood that under normal circumstances, the position of the key points of the region in the face is relatively fixed, and the statistical features of the key points of the face can be obtained (for example, the tip of the nose is at the middle line of the face).
  • the left and right corners of the mouth are located on both sides of the tip of the nose, and the left corner, the right corner of the mouth, and the tip of the nose are connected to each other to form an isosceles triangle.
  • the extracted key points of the face can be located on the basis of the obtained statistical features.
  • the at least two partial areas of the human face include an eye area and a mouth area.
  • locating the area key points corresponding to the at least two face local areas respectively includes: locating the eye key points corresponding to the eye area and the mouth area from the face key points The key point of the mouth.
  • the expression sensitive area to be located includes the eye area and the mouth area.
  • the image processing terminal extracts the key points of the face, it locates the key points of the eyes corresponding to the eye area (for example: the outer corner of the left eye, the outer corner of the right eye, the tip of the nose), and the key point of the mouth corresponding to the mouth area (for example, : Left corner of mouth, right corner of mouth, tip of nose).
  • the eye area for example: preset the width of the wide side, the lower long side through the tip of the nose, and the length of the connecting line segment between the outer corner of the left eye and the outer corner of the right eye is the long side length
  • the connecting line segment Take the connecting line segment as the rectangular area that bisects the fold line, and locate it as the eye area
  • the mouth area for example: preset the width of the wide side, the upper long side through the tip of the nose, The length of the connecting line segment between the left and right corners of the mouth is taken as the length of the long side
  • the connecting line segment is taken as the rectangular area bisecting the fold line, and it is positioned as the mouth area).
  • step 140 facial expression recognition is performed on the facial image based on the facial expression sensitive area.
  • performing expression recognition on the face image based on the expression sensitive area includes:
  • facial expression recognition is performed on the face image.
  • the image processing terminal combines the global feature corresponding to the face image and the regional feature corresponding to the expression sensitive area, and on this basis performs expression recognition on the face image.
  • the expression of the expression-sensitive area is more concentrated, that is, the expression-related features in the expression-sensitive area are more abundant.
  • the enhancement of the expression-related features in the expression-sensitive area is realized, thereby improving the ability to express features, and improving the accuracy of expression recognition on this basis.
  • the expression-sensitive area includes at least two partial areas of a human face
  • extracting the regional features corresponding to the expression-sensitive area from the expression-sensitive area includes: extracting the at least two facial partial areas from the at least two facial partial areas.
  • the regional features corresponding to the local areas of the face includes: extracting the at least two facial partial areas from the at least two facial partial areas.
  • the method Before performing expression recognition on the face image based on the global feature and the regional feature, the method further includes:
  • the splicing feature is fused to obtain the fusion feature of the at least two face local areas.
  • Performing expression recognition on the face image based on the global feature and the regional feature includes: performing expression recognition on the face image based on the global feature and the fusion feature.
  • Global features refer to the characteristics of the overall face image, such as the overall texture feature of the face image and the distribution feature of the overall pixel gray value.
  • Regional features refer to the features that correspond to the local area of the face, such as the texture features of the eye area, the distribution feature of the pixel gray value of the eye area, and the contour feature of the eye.
  • the expression sensitive area to be located includes at least two face partial areas, and after the image processing terminal respectively locates the at least two face partial areas, the at least two persons are extracted from the at least two face partial areas.
  • the expression sensitive area to be located includes the eye area and the mouth area. After locating to the eye area and the mouth area respectively, the eye features are extracted from the eye area and the mouth area is extracted To the mouth features.
  • the image processing terminal After the image processing terminal extracts the respective regional features corresponding to the at least two human face partial regions, it stitches the at least two regional features to obtain the corresponding stitched features. For example, after extracting the eye feature corresponding to the eye area and the mouth feature corresponding to the mouth area, the eye feature and the mouth feature are spliced to obtain the corresponding splicing feature.
  • both eye features and mouth features can exist in the form of feature maps-eye feature maps and mouth feature maps. The splicing of the eye feature and the mouth feature is to stack the eye feature map and the mouth feature map in the same spatial position, which is equivalent to combining the "eye feature map" with the "mouth feature map” Picture" this piece of paper is stacked.
  • the image processing terminal After acquiring the splicing feature, the image processing terminal fuses the splicing feature to obtain the fusion feature of the at least two regional features.
  • the main purpose of fusing the spliced features to obtain the corresponding fused features is to reduce the dimensionality of the spliced features to facilitate subsequent processing. For example, after obtaining the splicing feature obtained by splicing the eye feature and the mouth feature, the splicing feature is merged, and the fusion feature of the eye feature and the mouth feature is obtained.
  • the image processing terminal After the image processing terminal obtains the fusion features of the at least two face local areas, they are combined with the global features of the face image to perform expression recognition on the face image. For example: after acquiring Xiao Ming's face image-extracting the global features of Xiao Ming's face; extracting Xiao Ming's eye features; extracting Xiao Ming's mouth features. Furthermore, Huawei’s eye features and mouth features are spliced to obtain the corresponding splicing feature; the splicing feature is merged to obtain the corresponding fusion feature; and then the global feature of Huawei’s face is combined with the fusion feature to obtain the corresponding splicing feature. Identify it.
  • the splicing process of the eye feature map and the mouth feature map can be expressed as:
  • y cat represents the feature map after feature stitching
  • f cat represents the stitching process
  • y conv represents the feature map after fusion
  • f represents the filter, which is used to reduce the dimensionality of the feature and fuse it in the same spatial position versus b is the bias term.
  • a convolution filter with a size of 1 ⁇ 1 ⁇ 2D ⁇ DC can be used to reduce the dimensionality of features and fuse them at the same spatial position versus Among them, DC represents the number of output channels.
  • the embodiments of the present application can be executed by a neural network pre-trained in an image processing terminal.
  • the pre-training of the neural network may be performed by the pre-training terminal.
  • the pre-training terminal and the image processing terminal may be the same terminal or different terminals. The following is a detailed description of the improvements made in the process of pre-training the neural network to improve the accuracy of feature expression.
  • the neural network used for image processing (especially for facial expression recognition) is pre-trained based on the center loss function L IC that introduces the distance between classes.
  • the loss function is a function that maps the value of a random event or its related random variable to a non-negative real number to express the "risk” or "loss" of the random event.
  • the loss function is usually associated with the optimization problem as a learning criterion, that is, solving and evaluating the model by minimizing the loss function.
  • the inter-class distance includes the distance between the first central expression corresponding to the current input feature and the second central expression corresponding to the current input feature, and the central loss function L IC is expressed as the following formula:
  • x i is the current input feature
  • c yi is the first central expression
  • c k is the second central expression
  • m is the training data (or training sample) contained in the training data set used when training the neural network )
  • the current input feature is a piece of training data in the training data set
  • n is the number of expression categories
  • is a preset balance factor.
  • the central loss function is improved-the consideration of the distance between classes is introduced, that is, the consideration of the distance between each central expression is introduced.
  • the center loss function learns a category center for each expression category, and penalizes the distance between the current input feature and the category center corresponding to the current input feature through a penalty function, thus To achieve the purpose of reducing the distance within the class.
  • the central loss function learns a standard central expression for each type of expression, and penalizes the distance between the current input feature and the central expression corresponding to the current input feature through a penalty function , Thereby reducing the distance between expressions belonging to the same type of expression, making the same type of expression closer to the corresponding central expression.
  • the central loss function usually only considers the intra-class distances of different expression categories, while ignoring the inter-class distances between different classes. If the centers of the two categories are too close, it may cause the clustering of features to fail. In other words, in the case of expression recognition, the central loss function only considers the distance between the same type of expressions, but does not consider the distance between the central expressions.
  • the central expression A belongs to the same type of expression, and it may be determined as the same type of expression as the central expression B about 50%. If clustering fails, confusion can easily occur.
  • the pre-training terminal introduces the consideration of the inter-class distance, which reduces the intra-class distance between the current input feature and the central expression while increasing the inter-class distance between different central expressions.
  • x i is the current input feature
  • c yi is the first central expression
  • c k is the second central expression
  • m is the training data contained in the training data set used when training the neural network
  • the current input feature is a piece of training data in the training data set
  • n is the number of expression categories
  • is a preset balance factor.
  • pre-training the neural network includes: performing joint supervised pre-training on the neural network based on a joint loss function L composed of a preset softmax loss function L S and the central loss function L IC.
  • the joint loss function L is expressed as the following formula:
  • L L S + ⁇ L IC , where ⁇ is a preset scale factor.
  • the pre-training terminal adopts a joint loss function L composed of a preset softmax loss function and a central loss function L IC that introduces the inter-class distance to perform joint supervised pre-training on the neural network.
  • the joint loss function L can be expressed as:
  • the front part of the plus sign is the softmax loss function L S
  • the back part of the plus sign is the center loss function L IC that introduces the distance between classes.
  • w represents the weight
  • b represents the bias term
  • is the preset scale factor, used to balance L S and L IC .
  • the neural network trained by the image processing method of the embodiment of the present application can clearly distinguish the differences between different expressions, so that the division and recognition of expression features can be accurately performed.
  • it also has a better effect of similar facial expression recognition.
  • pre-training the neural network includes:
  • the neural network used for image processing is pre-trained.
  • the pre-training terminal pre-trains the neural network, it needs to input the sample image into the neural network, and then adjust the neural network parameters according to the feedback result of the neural network.
  • the number of sample images in the sample image set is insufficient, over-fitting is prone to occur-it can be regarded as the "knowledge" "learned" by the neural network is too narrow.
  • the sample image is transformed, the sample image set is expanded, and the neural network is pre-trained based on the expanded sample image set to avoid the occurrence of overfitting.
  • expanding the sample image set based on the transformation of the sample image includes:
  • the flipped image is added to the sample image set to expand the sample image set.
  • the transformation performed on the sample image is to flip the sample image (for example, horizontal flip, vertical flip), and the corresponding flip image is obtained. If each sample image is flipped once, and the corresponding flipped image obtained is added to the sample image set, the sample image set is expanded to twice the original size.
  • expanding the sample image set based on the transformation of the sample image includes:
  • the rotated image is added to the sample image set to expand the sample image set.
  • the transformation performed on the sample image is to rotate the sample image to obtain the corresponding rotated image.
  • Each sample is rotated by a preset angle once, and the corresponding rotated image obtained is added to the sample image set, the sample image set will be expanded by a factor of 1.
  • pre-training the neural network includes:
  • the neural network used for the image processing is pre-trained.
  • the pre-training terminal performs face detection on the sample image to obtain the face image in the sample image; preset cropping and scaling of the face image are performed to obtain the pre-trained face image. Set the face image of the pixel size; and then pre-train the neural network based on the face image set composed of the face image of the pixel size.
  • the pre-training terminal performs face detection on the sample image, and detects the face frame in the sample image where the face image is located; cuts off the part outside the face frame to obtain the face image; scales the face image to 122 ⁇ 96 pixels; and based on the face image set composed of face images with the size of 122 ⁇ 96 pixels, the neural network is pre-trained.
  • the method includes: expanding the face image set based on the transformation of the cropped and scaled face images.
  • Pre-training the neural network used for image processing based on the face image set includes: pre-training the neural network based on the expanded face image set.
  • the face image set can be further transformed on this basis to expand the face image set , On the basis of reducing the change of face scale, further preventing the appearance of over-fitting phenomenon.
  • the transformation performed on the face image can refer to the transformation performed on the sample image in the above-mentioned embodiment, so it will not be repeated here.
  • an expression sensitive area enhancement network ESAEnNet (Expression Sensitive Area Enhancement Network) is proposed to perform image processing on face images.
  • the network uses the multitask cascaded convolutional network MTCNN (Multitask Cascaded Convolutional Network) to extract the key points of the face in the face image-the outer corner of the left eye, the outer corner of the right eye, the tip of the nose, the corner of the left mouth, and the corner of the right mouth to locate The eye area and mouth area in the face image.
  • MTCNN Multitask Cascaded Convolutional Network
  • the main network structure of this network adopts HCNet64, among which, refer to Fig. 3 for the specific internal structure of HCNet64.
  • the network processes the eye area and mouth area through HCNet64, and extracts the eye features of the eye area and the mouth feature of the mouth area; the network processes the original image of the person image through HCNet64 to extract the person image Global characteristics.
  • the extracted eye features and mouth features are stitched in the feature stitching layer, and then fine-tuned by the convolutional layer, pooling layer, and Fusion Dense Block before fusion.
  • the network structure of Fusion Dense Block contains 6 layers, in order: BN (Batch Normalization) layer, ReLU (Rectified linear unit) function, 1 ⁇ 1 convolution kernel convolution layer, BN layer, ReLU function, 3 ⁇ 3
  • the convolutional layer of the convolution kernel contains 12 filters, except for the global average pooling layer, the core size of all pooling layers is 2 ⁇ 2, and the step size is 2.
  • the network integrates eye features, mouth features, and global features through a fully convolutional layer, that is, the FC layer, and then performs expression recognition on this basis.
  • FIG. 3 shows the specific internal structure of HCNet64 in an embodiment of the present application.
  • HCNet64 is composed of 4ResBlock (residual block), 8ResBlock, and 4ResBlock.
  • FIG. 4 shows the specific internal structure of the ResBlock residual block in an embodiment of the present application.
  • each ResBlock residual block is formed by connecting two convolutional layers of 1 ⁇ 1 convolution kernels and a convolutional layer of 3 ⁇ 3 convolution kernels.
  • FIG. 5 shows the specific internal structure of the convolutional attention module according to an embodiment of the present application.
  • the convolutional attention module CBAM Convolutional Block Attention Module
  • the channel attention module and the spatial attention module are processed in turn to extract the refined feature (Refined Feature).
  • the following shows the experimental performance of the embodiments of the present application in practical applications.
  • the existing expression recognition methods LBP-TOP, HOG 3D, MSR, STM-ExpLet, DTAGN-Joint, 3D-CNN, 3D-CNN-DAP, GCNetS
  • the accuracy rate of expression recognition performed by 1R1 and IDEnNet is compared with the accuracy rate of expression recognition performed by the embodiment of the present application.
  • the experimental performance of the embodiment of the present application is displayed based on the CK+ data set.
  • the CK+ data set is the most representative facial expression recognition data set, and it is also the most widely used data set today.
  • the CK+ dataset contains 593 video sequences of 123 users; the sample images in the CK+ dataset are labeled with 7 different expressions: disdain, angry, disgusted, afraid, happy, sad, surprised.
  • Table 1 shows the expression recognition accuracy rates of ESAEnNet and other methods proposed in the embodiments of the present application on the CK+ data set.
  • Table 2 below shows the confusion matrix obtained by applying the embodiment of the present application on the CK+ data set.
  • the row header of the confusion matrix represents the actually recognized expressions
  • the list header represents the pre-labeled expressions
  • the corresponding value indicates how many of the pre-labeled expressions are recognized as corresponding to the actually recognized expressions.
  • the experimental performance of the embodiment of the present application is displayed based on the MMI data set.
  • the MMI data set contains 312 video sequences of 30 users; the sample images in the MMI data set are labeled with 6 different expressions-angry, disgusted, afraid, happy, sad, surprised.
  • Table 3 shows the expression recognition accuracy rates of the ESAEnNet and other methods proposed in the embodiments of the present application on the MMI data set.
  • Table 4 below shows the confusion matrix obtained by applying the embodiment of the present application on the MMI data set.
  • the experimental performance of the embodiment of the present application is displayed based on the VIS subset of the Oulu-CASIA data set.
  • the Oulu-CASIA dataset contains 480 video sequences of 80 users; the sample images in the Oulu-CASIA dataset are annotated with 6 different expressions-angry, disgusted, afraid, happy, sad, surprised.
  • the VIS subset of the Oulu-CASIA data set refers to a video sequence captured by a VIS camera under strong light conditions.
  • Table 5 shows the expression recognition accuracy of the ESAEnNet and other methods proposed in the embodiment of the application on the VIS subset of the Oulu-CASIA data set.
  • Table 6 below shows the confusion matrix obtained by applying the embodiment of the present application on the VIS subset of the Oulu-CASIA data set.
  • an image processing device including:
  • the obtaining module 210 is configured to obtain a face image to be processed
  • the extraction module 220 is configured to extract key points of the face image
  • the positioning module 230 is configured to locate an expression-sensitive area in the face image based on the key point, where the expression-sensitive area is a local area of the face with dense expression feature information;
  • the recognition module 240 is configured to perform facial expression recognition on the facial image based on the facial expression sensitive area.
  • the expression sensitive area includes at least two partial areas of a human face
  • the positioning module 230 is configured to:
  • the at least two face local areas are respectively located.
  • the at least two human face partial areas include an eye area and a mouth area
  • the positioning module 230 is configured to:
  • the mouth area is located.
  • the recognition module 240 is configured to: extract global features corresponding to the face image;
  • the expression sensitive area includes at least two partial areas of a human face
  • the recognition module 240 is configured to:
  • the device is configured to pre-train the neural network for the image processing based on the central loss function L IC that introduces the inter-class distance, wherein the inter-class distance Including the distance between the first central expression corresponding to the current input feature and the second central expression corresponding to the current input feature, the central loss function L IC is expressed as the following formula:
  • x i is the current input feature
  • cyi is the first central expression
  • c k is the second central expression
  • m is the training data contained in the training data set used when training the neural network ( Or the number of training samples)
  • the current input feature is a piece of training data in the training data set
  • n is the number of expression categories
  • is a preset balance factor.
  • the device is configured to perform joint supervision and prediction on the neural network based on a joint loss function L composed of a preset softmax loss function L S and the central loss function L IC Training, where the joint loss function L is expressed as the following formula:
  • the device is configured as:
  • the neural network used for image processing is pre-trained.
  • the device is configured as:
  • the flipped image is added to the sample image set to expand the sample image set.
  • the device is configured as:
  • the rotated image is added to the sample image set to expand the sample image set.
  • the device is configured as:
  • Pre-training the neural network used for the image processing based on the face image set Pre-training the neural network used for the image processing based on the face image set.
  • the device is configured as:
  • Pre-training the neural network based on the expanded face image set Pre-training the neural network based on the expanded face image set.
  • the image processing electronic device 30 according to an embodiment of the present application will be described below with reference to FIG. 7.
  • the image processing electronic device 30 shown in FIG. 7 is only an example, and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
  • the image processing electronic device 30 is represented in the form of a general-purpose computing device.
  • the components of the image processing electronic device 30 may include, but are not limited to: the aforementioned at least one processing unit 310, the aforementioned at least one storage unit 320, and a bus 330 connecting different system components (including the storage unit 320 and the processing unit 310).
  • the storage unit stores program code, and the program code can be executed by the processing unit 310, so that the processing unit 310 executes the various exemplary methods described in the description section of the exemplary method in this specification. Steps of implementation. For example, the processing unit 310 may perform various steps as shown in FIG. 1B.
  • the storage unit 320 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 3201 and/or a cache storage unit 3202, and may further include a read-only storage unit (ROM) 3203.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 320 may also include a program/utility tool 3204 having a set of (at least one) program modules 3205.
  • program modules 3205 include but are not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 330 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the image processing electronic device 30 may also communicate with one or more external devices 400 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the image processing electronic device 30. And/or communicate with any device (such as a router, modem, etc.) that enables the image processing electronic device 30 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 350. An input/output (I/O) interface 350 is connected to the display unit 340.
  • the image processing electronic device 30 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 360.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the network adapter 360 communicates with other modules of the image processing electronic device 30 through the bus 330.
  • other hardware and/or software modules can be used in conjunction with the image processing electronic device 30, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, Tape drives and data backup storage systems, etc.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by the processor of the computer, the computer is caused to execute the above method The method described in the example section.
  • a program product for implementing the method in the above method embodiment which can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be used in the terminal Running on equipment, such as a personal computer.
  • CD-ROM portable compact disk read-only memory
  • the program product of the present invention is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
  • the program product can use any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RGM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
  • the program code used to perform the operations of this application can be written in any combination of one or more programming languages.
  • the programming languages include object-oriented programming languages-such as JAVA, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (KGN) or a wide area network (WGN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • KGN local area network
  • WGN wide area network
  • modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory.
  • the features and functions of two or more modules or units described above may be embodied in one module or unit.
  • the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a non-volatile storage medium which can be a CD-ROM, U disk, mobile hard disk, etc.
  • Including several instructions to make a computing device which can be a personal computer, a server, a mobile terminal, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Human Computer Interaction (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Image Analysis (AREA)

Abstract

Provided are an image processing method and apparatus, an electronic device, and a storage medium. The method comprises: acquiring a facial image to be processed (110); extracting key facial points of the facial image (120); on the basis of the key facial points, positioning a sensitive expression area in the facial image, wherein the sensitive expression area is a localized facial area with intensive expression feature information (130); and on the basis of the sensitive expression area, carrying out expression recognition on the facial image (140).

Description

图像处理方法、装置、电子设备及存储介质Image processing method, device, electronic equipment and storage medium
本申请要求于2019年12月30日提交的申请号为201911398384.6、发明名称为“图像处理方法、装置、电子设备及存储介质”的中国专利申请的优先权。This application claims the priority of the Chinese patent application filed on December 30, 2019 with the application number 201911398384.6 and the invention title "Image processing method, device, electronic equipment and storage medium".
技术领域Technical field
本申请涉及人工智能领域,具体涉及一种图像处理方法、装置、电子设备及存储介质。This application relates to the field of artificial intelligence, and in particular to an image processing method, device, electronic equipment, and storage medium.
背景技术Background technique
随着互联网技术的高度发展,许多涉及到人工智能领域的***在进行图像处理时需要进行表情识别。表情识别的准确率越高,对于后续的处理便越有利,用户的体验也会更佳。With the rapid development of Internet technology, many systems involving the field of artificial intelligence need to perform facial expression recognition when performing image processing. The higher the accuracy of facial expression recognition, the more beneficial it will be for subsequent processing, and the better the user experience will be.
发明内容Summary of the invention
本申请提出了一种图像处理方法、装置、电子设备及存储介质,能够提高表情识别的准确率。This application proposes an image processing method, device, electronic equipment, and storage medium, which can improve the accuracy of facial expression recognition.
根据本申请实施例的一方面,公开了一种图像处理方法,由电子设备执行,所述方法包括:According to an aspect of the embodiments of the present application, an image processing method is disclosed, which is executed by an electronic device, and the method includes:
获取待处理的人脸图像;Obtain the face image to be processed;
提取所述人脸图像的人脸关键点;Extracting face key points of the face image;
基于所述人脸关键点,定位所述人脸图像中的表情敏感区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;Locating an expression-sensitive area in the face image based on the key points of the human face, where the expression-sensitive area is a local area of the human face with dense expression feature information;
基于所述表情敏感区域,对所述人脸图像进行表情识别。Based on the expression sensitive area, perform expression recognition on the face image.
根据本申请实施例的一方面,公开了一种图像处理装置,所述装置包括:According to an aspect of the embodiments of the present application, an image processing device is disclosed, the device including:
获取模块,配置为获取待处理的人脸图像;The obtaining module is configured to obtain the face image to be processed;
提取模块,配置为提取所述人脸图像的关键点;An extraction module, configured to extract key points of the face image;
定位模块,配置为基于所述关键点,定位所述人脸图像中的表情敏感 区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;A positioning module configured to locate an expression-sensitive area in the face image based on the key point, where the expression-sensitive area is a local area of the face with dense expression feature information;
识别模块,配置为基于所述表情敏感区域,对所述人脸图像进行表情识别。The recognition module is configured to perform facial expression recognition on the facial image based on the facial expression sensitive area.
根据本申请实施例的一方面,公开了一种图像处理电子设备,包括:存储器,存储有计算机可读指令;处理器,读取存储器存储的计算机可读指令,以执行所述图像处理方法。According to one aspect of the embodiments of the present application, an image processing electronic device is disclosed, including: a memory storing computer-readable instructions; a processor, reading the computer-readable instructions stored in the memory, to execute the image processing method.
根据本申请实施例的一方面,公开了一种计算机可读存储介质,其上存储有计算机可读指令,当所述计算机可读指令被计算机的处理器执行时,使计算机执行所述图像处理方法。According to one aspect of the embodiments of the present application, a computer-readable storage medium is disclosed, and computer-readable instructions are stored thereon. When the computer-readable instructions are executed by the processor of the computer, the computer is caused to execute the image processing. method.
本申请实施例获取到待处理的人脸图像,提取出该人脸图像的人脸关键点,进而基于提取出的人脸关键点定位该人脸图像中的表情敏感区域,基于表情敏感区域对该人脸图像进行表情识别。其中,表情敏感区域为表情特征信息密集的人脸局部区域,例如:眼部区域、嘴部区域。由于在进行表情识别时特别引入了对表情敏感区域的考虑,使得进行表情识别所需的特征表达更为全面,从而提高了表情识别的精准度。The embodiment of the application obtains the face image to be processed, extracts the face key points of the face image, and then locates the expression sensitive area in the face image based on the extracted face key points, and compares the facial expression sensitive areas based on the expression sensitive area. The facial image performs facial expression recognition. Among them, the expression sensitive area is a local area of the face with dense expression feature information, such as the eye area and the mouth area. Since the consideration of the sensitive area of expression is specially introduced when performing expression recognition, the feature expression required for expression recognition is more comprehensive, thereby improving the accuracy of expression recognition.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。It should be understood that the above general description and the following detailed description are only exemplary and cannot limit the application.
附图说明Description of the drawings
通过参考附图详细描述其示例实施例,本申请的上述和其它目标、特征及优点将变得更加显而易见。By describing its exemplary embodiments in detail with reference to the accompanying drawings, the above and other objectives, features, and advantages of the present application will become more apparent.
图1A示出了本申请实施例的图像处理方法实施环境示意图。FIG. 1A shows a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application.
图1B示出了根据本申请一个实施例的图像处理方法的流程图。Fig. 1B shows a flowchart of an image processing method according to an embodiment of the present application.
图2示出了根据本申请一个实施例的使用预训练的神经网络进行图像处理的过程。Fig. 2 shows a process of image processing using a pre-trained neural network according to an embodiment of the present application.
图3示出了根据本申请一个实施例的主网络结构的内部具体结构。Fig. 3 shows the internal specific structure of the main network structure according to an embodiment of the present application.
图4示出了根据本申请一个实施例的ResBlock残差块的内部具体结构。Fig. 4 shows the internal specific structure of the ResBlock residual block according to an embodiment of the present application.
图5示出了根据本申请一个实施例的注意力模块的内部具体结构。Fig. 5 shows the specific internal structure of the attention module according to an embodiment of the present application.
图6示出了根据本申请一个实施例的图像处理装置的方框图。Fig. 6 shows a block diagram of an image processing apparatus according to an embodiment of the present application.
图7示出了根据本申请一个实施例的图像处理电子设备的硬件图。Fig. 7 shows a hardware diagram of an image processing electronic device according to an embodiment of the present application.
具体实施方式Detailed ways
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些示例实施方式使得本申请的描述将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。附图仅为本申请的示意性图解,并非一定是按比例绘制。图中相同的附图标记表示相同或类似的部分,因而将省略对它们的重复描述。Example embodiments will now be described more fully with reference to the accompanying drawings. However, the example embodiments can be implemented in various forms, and should not be construed as being limited to the examples set forth herein; on the contrary, these example embodiments are provided so that the description of this application will be more comprehensive and complete, and the concept of the example embodiments Comprehensively communicate to those skilled in the art. The drawings are only schematic illustrations of the application and are not necessarily drawn to scale. The same reference numerals in the figures denote the same or similar parts, and thus their repeated description will be omitted.
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多示例实施方式中。在下面的描述中,提供许多具体细节从而给出对本申请的示例实施方式的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而省略所述特定细节中的一个或更多,或者可以采用其它的方法、组元、步骤等。在其它情况下,不详细示出或描述公知结构、方法、实现或者操作以避免喧宾夺主而使得本申请的各方面变得模糊。In addition, the described features, structures, or characteristics may be combined in one or more example embodiments in any suitable manner. In the following description, many specific details are provided to give a sufficient understanding of the exemplary embodiments of the present application. However, those skilled in the art will realize that the technical solutions of the present application can be practiced without one or more of the specific details, or other methods, components, steps, etc. can be used. In other cases, well-known structures, methods, implementations, or operations are not shown or described in detail to avoid overwhelming people and obscuring all aspects of the present application.
附图中所示的一些方框图是功能实体,不一定必须与物理或逻辑上独立的实体相对应。可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。Some of the block diagrams shown in the drawings are functional entities and do not necessarily correspond to physically or logically independent entities. These functional entities may be implemented in the form of software, or implemented in one or more hardware modules or integrated circuits, or implemented in different networks and/or processor devices and/or microcontroller devices.
本申请实施例涉及人工智能领域,具体地,主要涉及到人工智能领域中的计算机视觉技术、机器学习。The embodiments of the present application relate to the field of artificial intelligence, and specifically, mainly relate to computer vision technology and machine learning in the field of artificial intelligence.
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用***。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technology of computer science, which attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理技术、操作/交互***、机电一 体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。Artificial intelligence technology is a comprehensive discipline, covering a wide range of fields, including both hardware-level technology and software-level technology. Basic artificial intelligence technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technology, operation/interaction systems, and electromechanical integration. Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
计算机视觉技术(Computer Vision,CV)计算机视觉是一门研究如何使机器“看”的科学,更进一步的说,就是指用摄影机和电脑代替人眼对目标进行识别、跟踪和测量等机器视觉,并进一步做图形处理,使电脑处理成为更适合人眼观察或传送给仪器检测的图像。作为一个科学学科,计算机视觉研究相关的理论和技术,试图建立能够从图像或者多维数据中获取信息的人工智能***。计算机视觉技术通常包括图像处理、图像识别、图像语义理解、图像检索、OCR、视频处理、视频语义理解、视频内容/行为识别、三维物体重建、3D技术、虚拟现实、增强现实、同步定位与地图构建等技术,还包括常见的人脸识别、指纹识别等生物特征识别技术。Computer Vision (CV) Computer Vision is a science that studies how to make machines "see". Furthermore, it refers to the use of cameras and computers instead of human eyes to identify, track, and measure targets. And further graphics processing, so that computer processing becomes more suitable for human eyes to observe or send to the instrument to detect the image. As a scientific discipline, computer vision studies related theories and technologies, trying to establish an artificial intelligence system that can obtain information from images or multi-dimensional data. Computer vision technology usually includes image processing, image recognition, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning and mapping Construction and other technologies also include common face recognition, fingerprint recognition and other biometric recognition technologies.
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。Machine Learning (ML) is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance. Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence. Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
本申请实施例的图像处理方法、装置、电子设备及存储介质例如可以用于情感分析***和人机交互***中。The image processing method, device, electronic device, and storage medium of the embodiments of the present application may be used in an emotion analysis system and a human-computer interaction system, for example.
图1A示出了本申请实施例的图像处理方法实施环境示意图。本申请实施例的图像处理方法的执行主体可以为任一具有足够运算能力的图像处理终端。如图1A所示,图像处理终端可以为云端服务器101、本地计算机集群102、个人电脑终端103、移动终端104或前述多个相互协作的终端。所述图像处理终端处理的人脸图像可以是通过网络105获得,或者是在图像处理终端本地获得。人脸图像可以是静态的图像,或者是视频中动态的图像。FIG. 1A shows a schematic diagram of an implementation environment of an image processing method according to an embodiment of the present application. The execution subject of the image processing method in the embodiment of the present application may be any image processing terminal with sufficient computing capability. As shown in FIG. 1A, the image processing terminal may be a cloud server 101, a local computer cluster 102, a personal computer terminal 103, a mobile terminal 104, or the aforementioned multiple cooperative terminals. The face image processed by the image processing terminal may be obtained through the network 105 or obtained locally at the image processing terminal. The face image can be a static image or a dynamic image in a video.
需要说明的是,本申请实施例可由图像处理终端中预训练的神经网络执行。具体地,图像处理终端中预训练的神经网络获取到待处理的人脸图 像后,基于预训练生成的神经网络参数——提取该人脸图像的人脸关键点;基于人脸关键点定位人脸图像中的表情敏感区域;基于表情敏感区域对人脸图像进行表情识别。可以理解,可以由单独一个预训练的神经网络执行本申请实施例的所有步骤,以实现表情识别;也可以由多个预训练的神经网络分别执行本申请实施例的部分步骤,以实现表情识别,例如:预训练三个神经网络——用于提取人脸关键点的神经网络1,用于定位表情敏感区域的神经网络2,用于进行表情识别的神经网络3。从而神经网络1执行获取人脸图像以及提取人脸关键点的步骤、神经网络2执行定位表情敏感区域的步骤、神经网络3执行基于表情敏感区域进行表情识别的步骤,从而实现表情识别。It should be noted that the embodiments of the present application may be executed by a neural network pre-trained in an image processing terminal. Specifically, after the pre-trained neural network in the image processing terminal obtains the face image to be processed, the neural network parameters generated based on the pre-training-extract the face key points of the face image; locate the person based on the face key points The expression-sensitive area in the face image; facial expression recognition is based on the expression-sensitive area. It is understandable that a single pre-trained neural network can execute all the steps of the embodiments of this application to realize expression recognition; or multiple pre-trained neural networks can execute part of the steps of the embodiments of this application to realize expression recognition. For example: pre-training three neural networks-neural network 1 for extracting key points of human faces, neural network 2 for locating sensitive areas of expression, neural network 3 for expression recognition. Therefore, the neural network 1 performs the steps of acquiring the face image and extracting the key points of the human face, the neural network 2 performs the step of locating the expression sensitive area, and the neural network 3 performs the step of expression recognition based on the expression sensitive area, thereby realizing expression recognition.
下面对本申请的具体实施过程进行详细描述。The specific implementation process of this application will be described in detail below.
参考图1B所示,一种图像处理方法,包括:Referring to FIG. 1B, an image processing method includes:
步骤110、获取待处理的人脸图像;Step 110: Obtain a face image to be processed;
步骤120、提取所述人脸图像的人脸关键点;Step 120: Extract face key points of the face image;
步骤130、基于所述人脸关键点,定位所述人脸图像中的表情敏感区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;Step 130: Based on the key points of the face, locate an expression-sensitive area in the face image, where the expression-sensitive area is a local area of the face with dense expression feature information;
步骤140、基于所述表情敏感区域,对所述人脸图像进行表情识别。Step 140: Perform expression recognition on the face image based on the expression sensitive area.
本申请实施例获取到待处理的人脸图像,提取出该人脸图像的人脸关键点,进而基于提取出的人脸关键点定位该人脸图像中的表情敏感区域,基于表情敏感区域对该人脸图像进行表情识别。其中,表情敏感区域为表情特征信息密集的人脸局部区域,例如:眼部区域、嘴部区域。由于在进行表情识别时特别引入了对表情敏感区域的考虑,使得进行表情识别所需的特征表达更为全面,另外,表情敏感区域与人脸其他区域相比,更能体现不同表情之间的差异,因此也是具有高度表情差异性的敏感区域,通过引入对表情敏感区域的考虑,也提高了表情识别的精准度。The embodiment of the application obtains the face image to be processed, extracts the face key points of the face image, and then locates the expression sensitive area in the face image based on the extracted face key points, and compares the facial expression sensitive areas based on the expression sensitive area. The facial image performs facial expression recognition. Among them, the expression sensitive area is a local area of the face with dense expression feature information, such as the eye area and the mouth area. Because the expression sensitive area is specially introduced into the expression recognition, the expression of the characteristics required for expression recognition is more comprehensive. In addition, the expression sensitive area can better reflect the difference between different expressions compared with other areas of the face. Difference, therefore, is also a sensitive area with a high degree of expression difference. By introducing the consideration of the sensitive area of expression, the accuracy of expression recognition is also improved.
在步骤110中,获取待处理的人脸图像。In step 110, a face image to be processed is acquired.
在一实施例中,获取待处理的人脸图像,包括:In an embodiment, acquiring the face image to be processed includes:
获取待处理的输入图像;Obtain the input image to be processed;
基于预设的人脸检测算法对该输入图像进行处理,定位该输入图像中人脸所在位置;Process the input image based on a preset face detection algorithm, and locate the position of the face in the input image;
基于该人脸所在位置对该输入图像进行裁剪,获取该输入图像中待处理的人脸图像。The input image is cropped based on the location of the face, and the face image to be processed in the input image is obtained.
本申请实施例中,图像处理终端中预设有人脸检测算法(例如:基于二进小波变换的人脸检测算法、基于面部双眼结构特征的人脸检测算法等),以进行人脸检测。图像处理终端获取到待处理的输入图像后,基于该人脸检测算法对输入图像进行处理,于该输入图像中定位出人脸所在位置(例如:输入图像中,包含有人脸的矩形区域的位置)。进而对该输入图像进行裁剪——将该人脸所在位置之外的部分裁去,得到待处理的人脸图像。必要的话,还可以对裁剪后的输入图像进行缩放,以使得得到的待处理的人脸图像更便于后续的图像处理。In the embodiment of the present application, a face detection algorithm (for example, a face detection algorithm based on binary wavelet transform, a face detection algorithm based on facial binocular structure features, etc.) is preset in the image processing terminal to perform face detection. After the image processing terminal obtains the input image to be processed, it processes the input image based on the face detection algorithm, and locates the position of the human face in the input image (for example, the position of the rectangular area containing the human face in the input image ). Then the input image is cropped-the part outside the position of the face is cropped to obtain the face image to be processed. If necessary, the cropped input image can also be scaled, so that the obtained face image to be processed is more convenient for subsequent image processing.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在步骤120中,提取该人脸图像的人脸关键点。In step 120, the face key points of the face image are extracted.
在步骤130中,基于人脸关键点,定位该人脸图像中的表情敏感区域,该表情敏感区域为表情特征信息密集的人脸局部区域。In step 130, based on the key points of the human face, an expression-sensitive area in the human face image is located, and the expression-sensitive area is a local area of the human face with dense expression feature information.
本申请实施例中,图像处理终端提取到人脸图像的人脸关键点后,基于人脸关键点,定位该人脸图像中的表情敏感区域。In the embodiment of the present application, after the image processing terminal extracts the face key points of the face image, based on the face key points, it locates the expression sensitive area in the face image.
在一实施例中,表情敏感区域为预设的一个人脸局部区域,例如:包含双眼的眼部区域和/或者包含嘴唇的嘴部区域、或者包含双眼以及双眉的区域。In one embodiment, the expression-sensitive area is a preset partial area of the face, for example: an eye area including eyes and/or a mouth area including lips, or an area including eyes and eyebrows.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,该表情敏感区域包括至少两个人脸局部区域,基于该人脸关键点,定位该人脸图像中的表情敏感区域,包括:In an embodiment, the expression-sensitive area includes at least two partial areas of a human face, and locating the expression-sensitive area in the face image based on the key points of the human face includes:
从该人脸关键点中,定位该至少两个人脸局部区域分别对应的区域关键点;From the key points of the face, locate the area key points corresponding to the at least two face local areas respectively;
基于该区域关键点,分别定位该至少两个人脸局部区域。Based on the key points of the area, the at least two face local areas are respectively located.
区域关键点指的是组成对应人脸局部区域的人脸关键点,例如:组成嘴部区域的人脸关键点,即,嘴部关键点为——左嘴角、右嘴角、鼻尖。Regional key points refer to the key points of the face that compose the corresponding local area of the face. For example, the key points of the face that compose the mouth area, that is, the key points of the mouth are-left corner of the mouth, right corner of the mouth, and tip of the nose.
该实施例中,待定位的表情敏感区域包括至少两个人脸局部区域。图 像处理终端提取到人脸关键点后,定位该至少两个人脸局部区域分别对应的区域关键点。其中,区域关键点的定位可以基于预先对人脸关键点的统计进行。具体地,可以理解,正常情况下,区域关键点在人脸中的位置是较为固定的,对人脸关键点进行统计,可以得到区域关键点的统计特征(例如:鼻尖处于人脸的中间线上;左嘴角与右嘴角分别位于鼻尖的两侧,且左嘴角、右嘴角、鼻尖相互连线,可以组成一等腰三角形)。通过预先对人脸关键点的统计,即可在得到的统计特征的基础上,对提取到的人脸关键点进行定位。In this embodiment, the expression sensitive area to be located includes at least two partial areas of the human face. After the image processing terminal extracts the key points of the human face, it locates the area key points corresponding to the at least two local areas of the human face respectively. Among them, the location of the key points of the area can be performed based on the statistics of the key points of the human face in advance. Specifically, it can be understood that under normal circumstances, the position of the key points of the region in the face is relatively fixed, and the statistical features of the key points of the face can be obtained (for example, the tip of the nose is at the middle line of the face). Top; the left and right corners of the mouth are located on both sides of the tip of the nose, and the left corner, the right corner of the mouth, and the tip of the nose are connected to each other to form an isosceles triangle. By pre-calculating the key points of the face, the extracted key points of the face can be located on the basis of the obtained statistical features.
定位到人脸局部区域分别对应的区域关键点后,即可根据对应于同一人脸局部区域的各区域关键点的坐标位置,确定各区域关键点所围成的区域,从而定位到该人脸局部区域。After locating the area key points corresponding to the local areas of the face, you can determine the area enclosed by the key points of each area according to the coordinate positions of the key points of the areas corresponding to the same local area of the face, so as to locate the face Partial area.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,至少两个人脸局部区域包括眼部区域、嘴部区域。In an embodiment, the at least two partial areas of the human face include an eye area and a mouth area.
从该人脸关键点中,定位该至少两个人脸局部区域分别对应的区域关键点,包括:从该人脸关键点中,定位该眼部区域对应的眼部关键点、该嘴部区域对应的嘴部关键点。From the face key points, locating the area key points corresponding to the at least two face local areas respectively includes: locating the eye key points corresponding to the eye area and the mouth area from the face key points The key point of the mouth.
基于该区域关键点,分别定位该至少两个人脸局部区域,包括:Based on the key points of the area, respectively locating the at least two face local areas, including:
基于该眼部关键点,定位该眼部区域;Based on the key points of the eye, locate the eye area;
基于该嘴部关键点,定位该嘴部区域。Based on the key points of the mouth, locate the mouth area.
该实施例中,待定位的表情敏感区域包括眼部区域、嘴部区域。图像处理终端提取到人脸关键点后,从中定位到眼部区域对应的眼部关键点(例如:左眼外眼角、右眼外眼角、鼻尖)、嘴部区域对应的嘴部关键点(例如:左嘴角、右嘴角、鼻尖)。进而基于定位到的眼部关键点,定位该眼部区域(例如:将预设宽边宽度、下长边经过鼻尖、以左眼外眼角与右眼外眼角的连接线段长度为长边长度、以该连接线段为平分对折线的矩形区域,定位为该眼部区域);基于定位到的嘴部关键点,定位该嘴部区域(例如:将预设宽边宽度、上长边经过鼻尖、以左嘴角与右嘴角的连接线段长度为长边长度、以该连接线段为平分对折线的矩形区域,定位为该嘴部区域)。In this embodiment, the expression sensitive area to be located includes the eye area and the mouth area. After the image processing terminal extracts the key points of the face, it locates the key points of the eyes corresponding to the eye area (for example: the outer corner of the left eye, the outer corner of the right eye, the tip of the nose), and the key point of the mouth corresponding to the mouth area (for example, : Left corner of mouth, right corner of mouth, tip of nose). Then, based on the key points of the eye, locate the eye area (for example: preset the width of the wide side, the lower long side through the tip of the nose, and the length of the connecting line segment between the outer corner of the left eye and the outer corner of the right eye is the long side length, Take the connecting line segment as the rectangular area that bisects the fold line, and locate it as the eye area; based on the key points of the mouth, locate the mouth area (for example: preset the width of the wide side, the upper long side through the tip of the nose, The length of the connecting line segment between the left and right corners of the mouth is taken as the length of the long side, and the connecting line segment is taken as the rectangular area bisecting the fold line, and it is positioned as the mouth area).
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在步骤140中,基于该表情敏感区域,对该人脸图像进行表情识别。In step 140, facial expression recognition is performed on the facial image based on the facial expression sensitive area.
在一实施例中,基于该表情敏感区域,对该人脸图像进行表情识别,包括:In an embodiment, performing expression recognition on the face image based on the expression sensitive area includes:
提取该人脸图像对应的全局特征;Extract the global features corresponding to the face image;
从该表情敏感区域中提取该表情敏感区域对应的区域特征;Extracting regional features corresponding to the expression-sensitive area from the expression-sensitive area;
基于该全局特征以及该区域特征,对该人脸图像进行表情识别。Based on the global feature and the regional feature, facial expression recognition is performed on the face image.
该实施例中,图像处理终端将人脸图像对应的全局特征以及表情敏感区域对应的区域特征相结合,在此基础上对人脸图像进行表情识别。由于相比起人脸图像中的其他区域,表情敏感区域的表情表达更为集中,也就是说,表情敏感区域中表情相关特征更为丰富。通过单独提取出表情敏感区域对应的区域特征,实现了对表情敏感区域中的表情相关特征的增强,从而提高了对特征的表达能力,使得在此基础上进行的表情识别的精准度得到提高。In this embodiment, the image processing terminal combines the global feature corresponding to the face image and the regional feature corresponding to the expression sensitive area, and on this basis performs expression recognition on the face image. Compared with other areas in the face image, the expression of the expression-sensitive area is more concentrated, that is, the expression-related features in the expression-sensitive area are more abundant. By separately extracting the regional features corresponding to the expression-sensitive area, the enhancement of the expression-related features in the expression-sensitive area is realized, thereby improving the ability to express features, and improving the accuracy of expression recognition on this basis.
在一实施例中,该表情敏感区域包括至少两个人脸局部区域,从该表情敏感区域中提取该表情敏感区域对应的区域特征,包括:从该至少两个人脸局部区域中分别提取该至少两个人脸局部区域分别对应的区域特征。In one embodiment, the expression-sensitive area includes at least two partial areas of a human face, and extracting the regional features corresponding to the expression-sensitive area from the expression-sensitive area includes: extracting the at least two facial partial areas from the at least two facial partial areas. The regional features corresponding to the local areas of the face.
在基于该全局特征以及该区域特征,对该人脸图像进行表情识别之前,还包括:Before performing expression recognition on the face image based on the global feature and the regional feature, the method further includes:
对该至少两个人脸局部区域分别对应的区域特征进行拼接,获取该至少两个人脸局部区域的拼接特征;Stitching the regional features corresponding to the at least two face local areas respectively, to obtain the stitching features of the at least two face local areas;
对该拼接特征进行融合,获取该至少两个人脸局部区域的融合特征。The splicing feature is fused to obtain the fusion feature of the at least two face local areas.
基于该全局特征以及该区域特征,对该人脸图像进行表情识别,包括:基于该全局特征以及该融合特征,对该人脸图像进行表情识别。Performing expression recognition on the face image based on the global feature and the regional feature includes: performing expression recognition on the face image based on the global feature and the fusion feature.
全局特征指的是人脸图像整体所表现的特征,例如:人脸图像整体的纹理特征、整体像素灰度值的分布特征。Global features refer to the characteristics of the overall face image, such as the overall texture feature of the face image and the distribution feature of the overall pixel gray value.
区域特征指的是对应人脸局部区域所表现的特征,例如:眼部区域的纹理特征、眼部区域像素灰度值的分布特征、眼部的轮廓特征。Regional features refer to the features that correspond to the local area of the face, such as the texture features of the eye area, the distribution feature of the pixel gray value of the eye area, and the contour feature of the eye.
该实施例中,待定位的表情敏感区域包括至少两个人脸局部区域,图 像处理终端分别定位到该至少两个人脸局部区域后,从该至少两个人脸局部区域中分别提取到该至少两个人脸局部区域分别对应的区域特征。例如:待定位的表情敏感区域包括眼部区域、嘴部区域,分别定位到该眼部区域、该嘴部区域后,从该眼部区域中提取到眼部特征,从该嘴部区域中提取到嘴部特征。In this embodiment, the expression sensitive area to be located includes at least two face partial areas, and after the image processing terminal respectively locates the at least two face partial areas, the at least two persons are extracted from the at least two face partial areas. Regional features corresponding to the local areas of the face. For example: the expression sensitive area to be located includes the eye area and the mouth area. After locating to the eye area and the mouth area respectively, the eye features are extracted from the eye area and the mouth area is extracted To the mouth features.
图像处理终端提取到该至少两个人脸局部区域分别对应的区域特征后,对这至少两个区域特征进行拼接,获取到对应的拼接特征。例如:提取到眼部区域对应的眼部特征、嘴部区域对应的嘴部特征后,将该眼部特征与该嘴部特征进行拼接,得到对应的拼接特征。具体地,在神经网络的处理中,眼部特征、嘴部特征均可以以特征图(feature map)的形式存在——眼部特征图、嘴部特征图。对眼部特征与嘴部特征进行拼接,即为对该眼部特征图与该嘴部特征图在相同的空间位置进行堆放,相当于将“眼部特征图”这张纸与“嘴部特征图”这张纸进行叠放。After the image processing terminal extracts the respective regional features corresponding to the at least two human face partial regions, it stitches the at least two regional features to obtain the corresponding stitched features. For example, after extracting the eye feature corresponding to the eye area and the mouth feature corresponding to the mouth area, the eye feature and the mouth feature are spliced to obtain the corresponding splicing feature. Specifically, in the processing of the neural network, both eye features and mouth features can exist in the form of feature maps-eye feature maps and mouth feature maps. The splicing of the eye feature and the mouth feature is to stack the eye feature map and the mouth feature map in the same spatial position, which is equivalent to combining the "eye feature map" with the "mouth feature map" Picture" this piece of paper is stacked.
图像处理终端获取到拼接特征后,对该拼接特征进行融合,获取到这至少两个区域特征的融合特征。其中,对拼接特征进行融合得到对应的融合特征的主要目的在于对拼接特征进行降维,以方便后续的处理。例如:获取到对眼部特征与嘴部特征进行拼接得到的拼接特征后,对该拼接特征进行融合,获取到该眼部特征与该嘴部特征的融合特征。具体地,在神经网络的处理中,若将对眼部特征与嘴部特征进行拼接看作将“眼部特征图”这张纸与“嘴部特征图”这张纸进行叠放,则可以将融合过程看作将叠放着的“眼部特征图”这张纸与“嘴部特征图”这张纸融为同一张纸。After acquiring the splicing feature, the image processing terminal fuses the splicing feature to obtain the fusion feature of the at least two regional features. Among them, the main purpose of fusing the spliced features to obtain the corresponding fused features is to reduce the dimensionality of the spliced features to facilitate subsequent processing. For example, after obtaining the splicing feature obtained by splicing the eye feature and the mouth feature, the splicing feature is merged, and the fusion feature of the eye feature and the mouth feature is obtained. Specifically, in the processing of the neural network, if the splicing of eye features and mouth features is regarded as stacking the paper "eye feature map" and the paper "mouth feature map", you can Think of the fusion process as fusing the stacked paper "eye feature map" and "mouth feature map" into the same paper.
图像处理终端获取到这至少两个人脸局部区域的融合特征后,与人脸图像的全局特征进行结合,进而对人脸图像进行表情识别。例如:获取到小明的人脸图像后——提取出小明人脸的全局特征;提取出小明的眼部特征;提取出小明的嘴部特征。进而将小明的眼部特征与嘴部特征进行拼接,得到对应的拼接特征;对该拼接特征进行融合,得到对应的融合特征;再结合小明人脸的全局特征与该融合特征,对小明的表情进行识别。After the image processing terminal obtains the fusion features of the at least two face local areas, they are combined with the global features of the face image to perform expression recognition on the face image. For example: after acquiring Xiao Ming's face image-extracting the global features of Xiao Ming's face; extracting Xiao Ming's eye features; extracting Xiao Ming's mouth features. Furthermore, Xiaoming’s eye features and mouth features are spliced to obtain the corresponding splicing feature; the splicing feature is merged to obtain the corresponding fusion feature; and then the global feature of Xiaoming’s face is combined with the fusion feature to obtain the corresponding splicing feature. Identify it.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,对眼部特征图与嘴部特征图的拼接过程可以表达为:
Figure PCTCN2020121349-appb-000001
在此之后的融合过程可以表达为:y conv=y cat*f+b。其中,y cat表示特征拼接后的特征图;f cat表示拼接过程;
Figure PCTCN2020121349-appb-000002
表示眼部特征图;
Figure PCTCN2020121349-appb-000003
表示嘴部特征图;y conv表示融合后的特征图;f表示滤波器,用于降低特征的维度,并在相同的空间位置融合
Figure PCTCN2020121349-appb-000004
Figure PCTCN2020121349-appb-000005
b为偏置项。具体地,可以采用尺寸为1×1×2D×DC的卷积滤波器来降低特征的维度,并在相同的空间位置融合
Figure PCTCN2020121349-appb-000006
Figure PCTCN2020121349-appb-000007
其中,DC表示输出通道数。
In an embodiment, the splicing process of the eye feature map and the mouth feature map can be expressed as:
Figure PCTCN2020121349-appb-000001
The fusion process after this can be expressed as: y conv =y cat *f+b. Among them, y cat represents the feature map after feature stitching; f cat represents the stitching process;
Figure PCTCN2020121349-appb-000002
Represents the eye feature map;
Figure PCTCN2020121349-appb-000003
Represents the feature map of the mouth; y conv represents the feature map after fusion; f represents the filter, which is used to reduce the dimensionality of the feature and fuse it in the same spatial position
Figure PCTCN2020121349-appb-000004
versus
Figure PCTCN2020121349-appb-000005
b is the bias term. Specifically, a convolution filter with a size of 1×1×2D×DC can be used to reduce the dimensionality of features and fuse them at the same spatial position
Figure PCTCN2020121349-appb-000006
versus
Figure PCTCN2020121349-appb-000007
Among them, DC represents the number of output channels.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
通过上述说明可知,本申请实施例可由图像处理终端中预训练的神经网络执行。具体地,神经网络的预训练可以由预训练终端执行。可以理解的,该预训练终端可以与图像处理终端为同一终端,也可以为不同终端。下面对预训练神经网络的过程中,作出的用以提高特征表达精准度的改进作出详细说明。It can be seen from the above description that the embodiments of the present application can be executed by a neural network pre-trained in an image processing terminal. Specifically, the pre-training of the neural network may be performed by the pre-training terminal. It is understandable that the pre-training terminal and the image processing terminal may be the same terminal or different terminals. The following is a detailed description of the improvements made in the process of pre-training the neural network to improve the accuracy of feature expression.
下面对预训练神经网络所使用的损失函数作出的改进进行详细说明。The improvement of the loss function used by the pre-training neural network will be described in detail below.
在一实施例中,基于引入类间距离的中心损失函数L IC,对用于该图像处理(特别是进行人脸图像的表情识别)的神经网络进行预训练。损失函数是将随机事件或其有关随机变量的取值映射为非负实数,以表示该随机事件的“风险”或“损失”的函数。损失函数通常作为学习准则与优化问题相联系,即通过最小化损失函数求解和评估模型。 In one embodiment, the neural network used for image processing (especially for facial expression recognition) is pre-trained based on the center loss function L IC that introduces the distance between classes. The loss function is a function that maps the value of a random event or its related random variable to a non-negative real number to express the "risk" or "loss" of the random event. The loss function is usually associated with the optimization problem as a learning criterion, that is, solving and evaluating the model by minimizing the loss function.
其中,该类间距离包括当前输入特征对应的第一中心表情与该当前输入特征对应的第二中心表情之间的距离,该中心损失函数L IC表达为如下公式: Wherein, the inter-class distance includes the distance between the first central expression corresponding to the current input feature and the second central expression corresponding to the current input feature, and the central loss function L IC is expressed as the following formula:
Figure PCTCN2020121349-appb-000008
其中,x i为该当前输入特征,c yi为该第一中心表情,c k为该第二中心表情,m为训练该神经网络时所使用的训练数据集所包含的训练数据(或训练样本)的数量,该当前输入特征为该训练数据集中的一训练数据,n为表情的类别数,β为预设的平衡因子。
Figure PCTCN2020121349-appb-000008
Where x i is the current input feature, c yi is the first central expression, c k is the second central expression, and m is the training data (or training sample) contained in the training data set used when training the neural network ), the current input feature is a piece of training data in the training data set, n is the number of expression categories, and β is a preset balance factor.
该实施例中,对中心损失函数进行了改进——引入对类间距离的考虑,即,引入对各中心表情之间距离的考虑。In this embodiment, the central loss function is improved-the consideration of the distance between classes is introduced, that is, the consideration of the distance between each central expression is introduced.
在一些神经网络的训练方法中,中心损失函数对于每一个表情类别, 学习出一个类别中心,并通过一个惩罚函数对当前输入特征和该当前输入特征对应的类别中心之间的距离进行惩罚,从而实现减小类内距离的目的。换言之,在表情识别的情况下,中心损失函数对于每一类表情学习出一个作为标准的中心表情,并通过一个惩罚函数对当前输入特征和该当前输入特征对应的中心表情之间的距离进行惩罚,从而减小属于同一类表情的各表情间的距离,使得同一类表情更加向对应的中心表情进行靠拢。In some neural network training methods, the center loss function learns a category center for each expression category, and penalizes the distance between the current input feature and the category center corresponding to the current input feature through a penalty function, thus To achieve the purpose of reducing the distance within the class. In other words, in the case of expression recognition, the central loss function learns a standard central expression for each type of expression, and penalizes the distance between the current input feature and the central expression corresponding to the current input feature through a penalty function , Thereby reducing the distance between expressions belonging to the same type of expression, making the same type of expression closer to the corresponding central expression.
然而,在这些训练方法中,中心损失函数通常只考虑了不同表情类别的类内距离,而忽略了不同类之间的类间距离。如果两个类别中心距离过近时则可能会导致特征的聚类失败。换言之,在表情识别的情况下,中心损失函数只考虑同一类表情之间的距离,而未考虑各中心表情之间的距离。即使与中心表情A属于同一类表情的各表情均向中心表情A靠拢、与中心表情B属于同一类表情的各表情均向中心表情B靠拢,若中心表情A与中心表情B之间的距离过近,则可能会导致这两类表情覆盖的范围会出现部分重叠的情况(类似于两个圆发生部分重叠),这就会导致重叠范围内的表情既有50%左右的可能被判定为与中心表情A属于同一类表情,也有50%左右的可能被判定为与中心表情B属于同一类表情。聚类失败,极易发生混淆的情况。However, in these training methods, the central loss function usually only considers the intra-class distances of different expression categories, while ignoring the inter-class distances between different classes. If the centers of the two categories are too close, it may cause the clustering of features to fail. In other words, in the case of expression recognition, the central loss function only considers the distance between the same type of expressions, but does not consider the distance between the central expressions. Even if all the expressions belonging to the same type of expression as the center expression A move closer to the center expression A, and all the expressions belonging to the same type of expression as the center expression B move closer to the center expression B, if the distance between the center expression A and the center expression B exceeds Close, it may cause the two types of expressions to partially overlap (similar to the partial overlap of two circles), which will cause the expressions in the overlapped range to be judged as the same as 50%. The central expression A belongs to the same type of expression, and it may be determined as the same type of expression as the central expression B about 50%. If clustering fails, confusion can easily occur.
该实施例中,预训练终端引入对类间距离的考虑,在减小当前输入特征与中心表情的类内距离的同时,增大不同中心表情之间的类间距离。得到如下所示的中心损失函数L ICIn this embodiment, the pre-training terminal introduces the consideration of the inter-class distance, which reduces the intra-class distance between the current input feature and the central expression while increasing the inter-class distance between different central expressions. Obtain the central loss function L IC as shown below:
Figure PCTCN2020121349-appb-000009
Figure PCTCN2020121349-appb-000009
其中,x i为所述当前输入特征,c yi为所述第一中心表情,c k为所述第二中心表情,m为训练所述神经网络时所使用的训练数据集所包含的训练数据的数量,所述当前输入特征为所述训练数据集中的一训练数据,n为表情的类别数,β为预设的平衡因子。 Where x i is the current input feature, c yi is the first central expression, c k is the second central expression, and m is the training data contained in the training data set used when training the neural network The current input feature is a piece of training data in the training data set, n is the number of expression categories, and β is a preset balance factor.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,对该神经网络进行预训练,包括:基于预设的softmax损失函数L S与该中心损失函数L IC组成的联合损失函数L,对该神经网络进行联合监督预训练。 In one embodiment, pre-training the neural network includes: performing joint supervised pre-training on the neural network based on a joint loss function L composed of a preset softmax loss function L S and the central loss function L IC.
其中,该联合损失函数L表达为如下公式:Among them, the joint loss function L is expressed as the following formula:
L=L S+λL IC,其中,λ为预设的尺度因子。 L = L S + λL IC , where λ is a preset scale factor.
该实施例中,预训练终端采用预设的softmax损失函数与引入了类间距离的中心损失函数L IC组成的联合损失函数L,对神经网络进行联合监督预训练。具体地,该联合损失函数L可以表达为: In this embodiment, the pre-training terminal adopts a joint loss function L composed of a preset softmax loss function and a central loss function L IC that introduces the inter-class distance to perform joint supervised pre-training on the neural network. Specifically, the joint loss function L can be expressed as:
Figure PCTCN2020121349-appb-000010
Figure PCTCN2020121349-appb-000010
其中,加号的前面部分即为softmax损失函数L S,加号的后面部分即为引入了类间距离的中心损失函数L IC。w表示权重;b表示偏置项;λ为预设的尺度因子,用于平衡L S与L IC。当λ被设置为0时,联合监督的损失函数变成传统的softmax损失函数。 Among them, the front part of the plus sign is the softmax loss function L S , and the back part of the plus sign is the center loss function L IC that introduces the distance between classes. w represents the weight; b represents the bias term; λ is the preset scale factor, used to balance L S and L IC . When λ is set to 0, the loss function of joint supervision becomes the traditional softmax loss function.
通过本申请实施例的图像处理方法所训练的神经网络,可以明确地区分不同表情之间的差异性,从而可以准确地进行表情特征的划分和识别。另外,针对于表情识别中的类内差异性问题,同样具有较好的同类表情识别效果。The neural network trained by the image processing method of the embodiment of the present application can clearly distinguish the differences between different expressions, so that the division and recognition of expression features can be accurately performed. In addition, for the problem of intra-class differences in facial expression recognition, it also has a better effect of similar facial expression recognition.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
下面对预训练神经网络所使用的样本图像集作出的改进进行详细说明。The following is a detailed description of the improvements made to the sample image set used by the pre-training neural network.
在一实施例中,对神经网络进行预训练,包括:In one embodiment, pre-training the neural network includes:
获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
基于对该样本图像的变换,扩充该样本图像集;Based on the transformation of the sample image, expand the sample image set;
基于该扩充后的样本图像集,对用于该图像处理的神经网络进行预训练。Based on the expanded sample image set, the neural network used for image processing is pre-trained.
预训练终端在对神经网络进行预训练时,需要将样本图像输入神经网络,进而根据神经网络反馈的结果对神经网络参数进行调整。当样本图像集中样本图像的数量不足时,易出现过拟合的现象——可以看作神经网络“学习”到的“知识”过于狭隘。该实施例中,对样本图像进行变换,扩充样本图像集,进而基于扩充后的样本图像集对神经网络进行预训练,以避免过拟合现象的发生。When the pre-training terminal pre-trains the neural network, it needs to input the sample image into the neural network, and then adjust the neural network parameters according to the feedback result of the neural network. When the number of sample images in the sample image set is insufficient, over-fitting is prone to occur-it can be regarded as the "knowledge" "learned" by the neural network is too narrow. In this embodiment, the sample image is transformed, the sample image set is expanded, and the neural network is pre-trained based on the expanded sample image set to avoid the occurrence of overfitting.
在一实施例中,基于对该样本图像的变换,扩充该样本图像集,包括:In an embodiment, expanding the sample image set based on the transformation of the sample image includes:
对该样本图像进行翻转,获取该样本图像对应的翻转图像;Flip the sample image to obtain a flip image corresponding to the sample image;
将该翻转图像加入该样本图像集中,以扩充该样本图像集。The flipped image is added to the sample image set to expand the sample image set.
该实施例中,预训练终端扩充样本图像集时,对样本图像进行的变换为对样本图像进行翻转(例如:水平翻转、垂直翻转),获取到对应的翻转图像。若对每一样本图像都进行一次翻转,并将得到的对应的翻转图像加入样本图像集中,则该样本图像集被扩充为原先的2倍大小。In this embodiment, when the pre-training terminal expands the sample image set, the transformation performed on the sample image is to flip the sample image (for example, horizontal flip, vertical flip), and the corresponding flip image is obtained. If each sample image is flipped once, and the corresponding flipped image obtained is added to the sample image set, the sample image set is expanded to twice the original size.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,基于对该样本图像的变换,扩充该样本图像集,包括:In an embodiment, expanding the sample image set based on the transformation of the sample image includes:
对该样本图像进行预设角度的旋转,获取该样本图像对应的旋转图像;Rotate the sample image by a preset angle, and obtain a rotated image corresponding to the sample image;
将该旋转图像加入该样本图像集中,以扩充该样本图像集。The rotated image is added to the sample image set to expand the sample image set.
该实施例中,预训练终端扩充样本图像集时,对样本图像进行的变换为对样本图像进行旋转,获取到对应的旋转图像。每对每一样本均进行一次预设角度的旋转,并将得到的对应的旋转图像加入样本图像集中,则该样本图像集都会被多扩充1倍。In this embodiment, when the pre-training terminal expands the sample image set, the transformation performed on the sample image is to rotate the sample image to obtain the corresponding rotated image. Each sample is rotated by a preset angle once, and the corresponding rotated image obtained is added to the sample image set, the sample image set will be expanded by a factor of 1.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,对神经网络进行预训练,包括:In one embodiment, pre-training the neural network includes:
获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
对该样本图像进行人脸检测,获取该样本图像中的人脸图像;Perform face detection on the sample image, and obtain a face image in the sample image;
对该人脸图像进行预设的裁剪、缩放,获取包含该裁剪、缩放后的人脸图像的人脸图像集;Perform preset cropping and scaling on the face image, and obtain a face image set containing the cropped and scaled face image;
基于该人脸图像集,对用于该图像处理的神经网络进行预训练。Based on the face image set, the neural network used for the image processing is pre-trained.
该实施例中,为了降低人脸尺度的变化——预训练终端对样本图像进行人脸检测,获取样本图像中的人脸图像;对该人脸图像进行预设的裁剪、缩放,获取到预设像素大小的人脸图像;进而基于由该像素大小的人脸图像组成的人脸图像集,对神经网络进行预训练。In this embodiment, in order to reduce the change in the scale of the face—the pre-training terminal performs face detection on the sample image to obtain the face image in the sample image; preset cropping and scaling of the face image are performed to obtain the pre-trained face image. Set the face image of the pixel size; and then pre-train the neural network based on the face image set composed of the face image of the pixel size.
例如:预训练终端对样本图像进行人脸检测,检测出样本图像中人脸图像所在的人脸框;将人脸框之外的部分裁掉,得到人脸图像;将人脸图 像缩放至122×96像素;进而基于由122×96像素大小的人脸图像组成的人脸图像集,对神经网络进行预训练。For example: the pre-training terminal performs face detection on the sample image, and detects the face frame in the sample image where the face image is located; cuts off the part outside the face frame to obtain the face image; scales the face image to 122 ×96 pixels; and based on the face image set composed of face images with the size of 122 × 96 pixels, the neural network is pre-trained.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
在一实施例中,在获取包含该裁剪、缩放后的人脸图像的人脸图像集后,包括:基于对该裁剪、缩放后的人脸图像的变换,扩充该人脸图像集。In one embodiment, after acquiring the face image set containing the cropped and scaled face images, the method includes: expanding the face image set based on the transformation of the cropped and scaled face images.
基于该人脸图像集,对用于该图像处理的神经网络进行预训练,包括:基于该扩充后的人脸图像集,对该神经网络进行预训练。Pre-training the neural network used for image processing based on the face image set includes: pre-training the neural network based on the expanded face image set.
该实施例中,预训练终端对人脸图像进行预设的裁剪、缩放后,得到人脸图像集后,还可以在此基础上进一步对该人脸图像进行变换,以扩充该人脸图像集,在降低了人脸尺度的变化的基础上,进一步地防止过拟合现象的出现。可以理解的,对该人脸图像进行的变换可以参考上述实施例中对样本图像进行的变换,故在此不再赘述。In this embodiment, after the pre-training terminal performs preset cropping and scaling on the face image to obtain the face image set, the face image set can be further transformed on this basis to expand the face image set , On the basis of reducing the change of face scale, further preventing the appearance of over-fitting phenomenon. It is understandable that the transformation performed on the face image can refer to the transformation performed on the sample image in the above-mentioned embodiment, so it will not be repeated here.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
下面参考图2对本申请一实施例中使用预训练的神经网络进行图像处理的过程进行示例性的详细描述。In the following, referring to FIG. 2, the process of image processing using a pre-trained neural network in an embodiment of the present application will be exemplarily described in detail.
参考图2所示,在一实施例中提出了一种表情敏感区域增强网络ESAEnNet(Expression Sensitive Area Enhancement Network)对人脸图像进行图像处理。该网络通过多任务级联卷积网络MTCNN(Multitask Cascaded Convolutional Network)提取人脸图像中的人脸关键点——左眼外眼角、右眼外眼角、鼻尖、左嘴角、右嘴角,从而定位到人脸图像中的眼部区域、嘴部区域。Referring to FIG. 2, in one embodiment, an expression sensitive area enhancement network ESAEnNet (Expression Sensitive Area Enhancement Network) is proposed to perform image processing on face images. The network uses the multitask cascaded convolutional network MTCNN (Multitask Cascaded Convolutional Network) to extract the key points of the face in the face image-the outer corner of the left eye, the outer corner of the right eye, the tip of the nose, the corner of the left mouth, and the corner of the right mouth to locate The eye area and mouth area in the face image.
该网络的主网络结构采用HCNet64,其中,HCNet64的内部具体结构参考图3。该网络通过HCNet64对眼部区域、嘴部区域进行处理,提取出眼部区域的眼部特征、嘴部区域的嘴部特征;该网络通过HCNet64对人物图像的原始图像进行处理,提取出人物图像的全局特征。The main network structure of this network adopts HCNet64, among which, refer to Fig. 3 for the specific internal structure of HCNet64. The network processes the eye area and mouth area through HCNet64, and extracts the eye features of the eye area and the mouth feature of the mouth area; the network processes the original image of the person image through HCNet64 to extract the person image Global characteristics.
提取出的眼部特征、嘴部特征于特征拼接层进行拼接,然后经过卷积层、池化层和Fusion Dense Block的微调后,进行融合。其中,Fusion Dense  Block的网络结构包含6层,依次为:BN(Batch Normalization)层、ReLU(Rectified linear unit)函数、1×1卷积核的卷积层、BN层、ReLU函数、3×3卷积核的卷积层。每个卷积层包含12个滤波器,除了全局平均池化层以外所有池化层的核大小均为2×2,并且步长为2。The extracted eye features and mouth features are stitched in the feature stitching layer, and then fine-tuned by the convolutional layer, pooling layer, and Fusion Dense Block before fusion. Among them, the network structure of Fusion Dense Block contains 6 layers, in order: BN (Batch Normalization) layer, ReLU (Rectified linear unit) function, 1×1 convolution kernel convolution layer, BN layer, ReLU function, 3×3 The convolutional layer of the convolution kernel. Each convolutional layer contains 12 filters, except for the global average pooling layer, the core size of all pooling layers is 2×2, and the step size is 2.
全局特征在提取过程中会经过三组注意力模块(Attention Block)——注意力模块1、注意力模块2、注意力模块3,注意力模块的内部具体结构参考图5。三组注意力模块的引入可以有效提高表情特征的表达能力。In the extraction process, the global features will pass through three groups of Attention Blocks—Attention Module 1, Attention Module 2, Attention Module 3. Refer to Figure 5 for the specific internal structure of the attention module. The introduction of three groups of attention modules can effectively improve the expression ability of expression features.
最后,该网络通过全卷积层,即,FC层,将眼部特征、嘴部特征、全局特征进行融合,进而在此基础上进行表情识别。Finally, the network integrates eye features, mouth features, and global features through a fully convolutional layer, that is, the FC layer, and then performs expression recognition on this basis.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
图3示出了本申请一实施例HCNet64的内部具体结构。该实施例中,HCNet64由4ResBlock(残差块)、8ResBlock、4ResBlock组成。Figure 3 shows the specific internal structure of HCNet64 in an embodiment of the present application. In this embodiment, HCNet64 is composed of 4ResBlock (residual block), 8ResBlock, and 4ResBlock.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
图4示出了本申请一实施例的ResBlock残差块的内部具体结构。其中,每个ResBlock残差块由两个1×1卷积核的卷积层、一个3×3卷积核的卷积层连接而成。FIG. 4 shows the specific internal structure of the ResBlock residual block in an embodiment of the present application. Among them, each ResBlock residual block is formed by connecting two convolutional layers of 1×1 convolution kernels and a convolutional layer of 3×3 convolution kernels.
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
图5示出了本申请一实施例卷积注意力模块的内部具体结构。该实施例中,卷积注意力模块CBAM(Convolutional Block Attention Module)主要由频道注意力模块(Channel Attention Module)、空间注意力模块(Spatial Attention Module)组成。输入特征(Input Feature)输入注意力模块后,依次经由频道注意力模块、空间注意力模块的处理后,提取出精炼特征(Refined Feature)。FIG. 5 shows the specific internal structure of the convolutional attention module according to an embodiment of the present application. In this embodiment, the convolutional attention module CBAM (Convolutional Block Attention Module) is mainly composed of a channel attention module (Channel Attention Module) and a spatial attention module (Spatial Attention Module). After the input feature is input to the attention module, the channel attention module and the spatial attention module are processed in turn to extract the refined feature (Refined Feature).
需要说明的是,该实施例只是示例性的说明,不应对本申请的功能和使用范围造成限制。It should be noted that this embodiment is only an exemplary description, and should not limit the function and scope of use of the present application.
下面对本申请实施例在实际应用中的实验表现进行展示。具体地,在 相同数据集的基础上,对现有的可进行表情识别的方法:LBP-TOP、HOG 3D、MSR、STM-ExpLet、DTAGN-Joint、3D-CNN、3D-CNN-DAP、GCNetS 1R1、IDEnNet进行表情识别的准确率,与本申请实施例进行表情识别的准确率进行对比。The following shows the experimental performance of the embodiments of the present application in practical applications. Specifically, on the basis of the same data set, the existing expression recognition methods: LBP-TOP, HOG 3D, MSR, STM-ExpLet, DTAGN-Joint, 3D-CNN, 3D-CNN-DAP, GCNetS The accuracy rate of expression recognition performed by 1R1 and IDEnNet is compared with the accuracy rate of expression recognition performed by the embodiment of the present application.
在一实施例中,基于CK+数据集对本申请实施例的实验表现进行展示。其中,CK+数据集是一个最具有代表性的表情识别数据集,也是现如今最被广泛应用的数据集。CK+数据集中包含了123各用户的593个视频序列;CK+数据集中的样本图像被标注为7种不同的表情:不屑、生气、厌恶、害怕、高兴、悲伤、吃惊。In an embodiment, the experimental performance of the embodiment of the present application is displayed based on the CK+ data set. Among them, the CK+ data set is the most representative facial expression recognition data set, and it is also the most widely used data set today. The CK+ dataset contains 593 video sequences of 123 users; the sample images in the CK+ dataset are labeled with 7 different expressions: disdain, angry, disgusted, afraid, happy, sad, surprised.
下表1示出了本申请实施例所提出的ESAEnNet与其他方法在CK+数据集上的表情识别准确率。Table 1 below shows the expression recognition accuracy rates of ESAEnNet and other methods proposed in the embodiments of the present application on the CK+ data set.
方法method 准确率(%)Accuracy(%)
3D-CNN3D-CNN 85.9085.90
LBP-TOPLBP-TOP 88.9988.99
HOG 3DHOG 3D 91.4091.40
MSRMSR 91.4491.44
3D-CNN-DAP3D-CNN-DAP 92.4092.40
STM-ExpLetSTM-ExpLet 94.1994.19
DTAGN-JointDTAGN-Joint 97.2597.25
GCNetS1R1GCNetS1R1 97.9397.93
IDEnNetIDEnNet 98.2398.23
ESAEnNetESAEnNet 99.0699.06
表1Table 1
下表2示出了本申请实施例在CK+数据集上进行应用,所得到的混淆矩阵。Table 2 below shows the confusion matrix obtained by applying the embodiment of the present application on the CK+ data set.
 To 生气pissed off 不屑Disdain 厌恶disgust 害怕Scared 高兴happy 悲伤sad 吃惊be surprised
生气pissed off 100%100% 0%0% 0%0% 0%0% 0%0% 0%0% 0%0%
不屑Disdain 0%0% 98.8%98.8% 0%0% 0.8%0.8% 0.4%0.4% 0%0% 0%0%
厌恶disgust 0%0% 0.1%0.1% 98.6%98.6% 0.3%0.3% 0%0% 1.2%1.2% 0%0%
害怕Scared 0%0% 0%0% 0%0% 98.8%98.8% 0%0% 0.5%0.5% 0.7%0.7%
高兴happy 0%0% 0%0% 0%0% 0%0% 100%100% 0%0% 0%0%
悲伤sad 0%0% 0%0% 1.1%1.1% 0.7%0.7% 0%0% 98.2%98.2% 0%0%
吃惊be surprised 0%0% 0%0% 0%0% 0%0% 0%0% 1.1%1.1% 98.9%98.9%
表2Table 2
其中,混淆矩阵的行表头表示实际识别的表情,列表头表示预先标注的表情,对应的数值则表示对应预先标注的表情有多少被识别为对应实际识别的表情。以表2中第2行(不计入表头)的数据为例:第2行第2列的数值为98.8%,则说明本申请实施例在将98.8%的预先标注为“不屑”的表情正确识别为“不屑”;第2行第4列的数值为0.8%,则说明本申请实施例将0.8%的预先标注为“不屑”的表情错误识别为“害怕”;第2行第5列的数值为0.4%,则说明本申请实施例将0.4%的预先标注为“不屑”的表情错误识别为“高兴”。Among them, the row header of the confusion matrix represents the actually recognized expressions, the list header represents the pre-labeled expressions, and the corresponding value indicates how many of the pre-labeled expressions are recognized as corresponding to the actually recognized expressions. Take the data in the second row of Table 2 (not included in the header) as an example: the value in the second row and the second column is 98.8%, which means that 98.8% of the emoticons are pre-marked as "disdain" in the embodiment of this application. Correctly recognized as "disdain"; the value in the second row and the fourth column is 0.8%, which means that 0.8% of the emoticons pre-marked as "disdain" are incorrectly recognized as "fear" in the embodiment of the present application; the second row and fifth column The value of is 0.4%, which means that the embodiment of the present application mistakenly recognizes 0.4% of the expressions pre-marked as "disdain" as "happy".
在一实施例中,基于MMI数据集对本申请实施例的实验表现进行展示。MMI数据集中包含了30个用户的312个视频序列;MMI数据集中的样本图像被标注了6种不同的表情——生气、厌恶、害怕、高兴、悲伤、吃惊。In an embodiment, the experimental performance of the embodiment of the present application is displayed based on the MMI data set. The MMI data set contains 312 video sequences of 30 users; the sample images in the MMI data set are labeled with 6 different expressions-angry, disgusted, afraid, happy, sad, surprised.
下表3示出了本申请实施例所提出的ESAEnNet与其他方法在MMI数据集上的表情识别准确率。Table 3 below shows the expression recognition accuracy rates of the ESAEnNet and other methods proposed in the embodiments of the present application on the MMI data set.
方法method 准确率(%)Accuracy(%)
3D-CNN3D-CNN 53.2053.20
LBP-TOPLBP-TOP 59.5159.51
HOG 3DHOG 3D 60.8960.89
3D-CNN-DAP3D-CNN-DAP 63.4063.40
DTAGN-JointDTAGN-Joint 70.2470.24
CSPLCSPL 73.5373.53
STM-ExpLetSTM-ExpLet 75.1275.12
GCNetS1R1GCNetS1R1 81.5381.53
IDEnNetIDEnNet 91.9791.97
ESAEnNetESAEnNet 93.4193.41
表3table 3
下表4示出了本申请实施例在MMI数据集上进行应用,所得到的混淆矩阵。Table 4 below shows the confusion matrix obtained by applying the embodiment of the present application on the MMI data set.
 To 生气pissed off 厌恶disgust 害怕Scared 高兴happy 悲伤sad 吃惊be surprised
生气pissed off 98.8%98.8% 0%0% 1.0%1.0% 0%0% 0.2%0.2% 0%0%
厌恶disgust 0.5%0.5% 94.0%94.0% 0.3%0.3% 1.0%1.0% 4.2%4.2% 0%0%
害怕Scared 0%0% 3.1%3.1% 94.6%94.6% 0.9%0.9% 1.4%1.4% 0%0%
高兴happy 0%0% 0%0% 0%0% 89.9%89.9% 0%0% 0%0%
悲伤sad 12.9%12.9% 3.6%3.6% 0.7%0.7% 1.9%1.9% 80.9%80.9% 0%0%
吃惊be surprised 0%0% 0%0% 0.4%0.4% 0%0% 0.6%0.6% 99.0%99.0%
表4Table 4
在一实施例中,基于Oulu-CASIA数据集的VIS子集对本申请实施例的实验表现进行展示。Oulu-CASIA数据集中包含了80个用户的480个视频序列;Oulu-CASIA数据集中的样本图像被标注了6种不同的表情——生气、厌恶、害怕、高兴、悲伤、吃惊。其中,Oulu-CASIA数据集的VIS子集,是指通过VIS相机在强光条件下捕捉到的视频序列。In an embodiment, the experimental performance of the embodiment of the present application is displayed based on the VIS subset of the Oulu-CASIA data set. The Oulu-CASIA dataset contains 480 video sequences of 80 users; the sample images in the Oulu-CASIA dataset are annotated with 6 different expressions-angry, disgusted, afraid, happy, sad, surprised. Among them, the VIS subset of the Oulu-CASIA data set refers to a video sequence captured by a VIS camera under strong light conditions.
下表5示出了本申请实施例所提出的ESAEnNet与其他方法在Oulu-CASIA数据集的VIS子集上的表情识别准确率。Table 5 below shows the expression recognition accuracy of the ESAEnNet and other methods proposed in the embodiment of the application on the VIS subset of the Oulu-CASIA data set.
方法method 准确率(%)Accuracy(%)
HOG 3DHOG 3D 70.6070.60
AdaLBPAdaLBP 73.5473.54
STM-ExpLetSTM-ExpLet 74.5974.59
AtlasesAtlases 75.5275.52
DTAGN-JointDTAGN-Joint 81.4681.46
PPDNPPDN 84.5984.59
GCNetS1R1GCNetS1R1 86.3986.39
FN2ENFN2EN 87.7187.71
IDEnNetIDEnNet 87.1887.18
ESAEnNetESAEnNet 91.0891.08
表5table 5
下表6示出了本申请实施例在Oulu-CASIA数据集的VIS子集上进行应用,所得到的混淆矩阵。Table 6 below shows the confusion matrix obtained by applying the embodiment of the present application on the VIS subset of the Oulu-CASIA data set.
 To 生气pissed off 厌恶disgust 害怕Scared 高兴happy 悲伤sad 吃惊be surprised
生气pissed off 78.9%78.9% 7.9%7.9% 0.9%0.9% 0%0% 12.3%12.3% 0%0%
厌恶disgust 5.0%5.0% 82.6%82.6% 0.9%0.9% 0%0% 11.5%11.5% 0%0%
害怕Scared 0%0% 0%0% 86.8%86.8% 3.3%3.3% 0.8%0.8% 9.1%9.1%
高兴happy 0%0% 0%0% 0%0% 99.1%99.1% 0%0% 0.9%0.9%
悲伤sad 0%0% 0.5%0.5% 0%0% 0%0% 99.5%99.5% 0%0%
吃惊be surprised 0%0% 0%0% 0.3%0.3% 0%0% 0%0% 99.7%99.7%
表6Table 6
通过以上实验数据可见:本申请实施例的表情识别的准确率,无论是在CK+数据集上的表现、在MMI数据集上的表现、还是在强光条件下的Oulu-CASIA数据集的VIS子集上的表现,相较现有的图像处理方法的表情识别的准确率,均存在明显的提升。From the above experimental data, it can be seen that the accuracy of expression recognition in the embodiments of this application, whether it is the performance on the CK+ data set, the performance on the MMI data set, or the VIS sub of the Oulu-CASIA data set under strong light conditions The performance on the collection is significantly improved compared with the accuracy of facial expression recognition of existing image processing methods.
根据本申请一实施例,如图6所示,还提供了一种图像处理装置,所述装置包括:According to an embodiment of the present application, as shown in FIG. 6, an image processing device is also provided, the device including:
获取模块210,配置为获取待处理的人脸图像;The obtaining module 210 is configured to obtain a face image to be processed;
提取模块220,配置为提取所述人脸图像的关键点;The extraction module 220 is configured to extract key points of the face image;
定位模块230,配置为基于所述关键点,定位所述人脸图像中的表情敏感区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;The positioning module 230 is configured to locate an expression-sensitive area in the face image based on the key point, where the expression-sensitive area is a local area of the face with dense expression feature information;
识别模块240,配置为基于所述表情敏感区域,对所述人脸图像进行表情识别。The recognition module 240 is configured to perform facial expression recognition on the facial image based on the facial expression sensitive area.
在本申请的一示例性实施例中,所述表情敏感区域包括至少两个人脸局部区域,所述定位模块230配置为:In an exemplary embodiment of the present application, the expression sensitive area includes at least two partial areas of a human face, and the positioning module 230 is configured to:
从所述人脸关键点中,定位所述至少两个人脸局部区域分别对应的区域关键点;From the key points of the face, locate the area key points corresponding to the at least two face local areas respectively;
基于所述区域关键点,分别定位所述至少两个人脸局部区域。Based on the key points of the area, the at least two face local areas are respectively located.
在本申请的一示例性实施例中,所述至少两个人脸局部区域包括眼部 区域、嘴部区域,所述定位模块230配置为:In an exemplary embodiment of the present application, the at least two human face partial areas include an eye area and a mouth area, and the positioning module 230 is configured to:
从所述人脸关键点中,定位所述眼部区域对应的眼部关键点、所述嘴部区域对应的嘴部关键点;From the key points of the face, locate the key points of the eyes corresponding to the eye area and the key points of the mouth corresponding to the mouth area;
基于所述眼部关键点,定位所述眼部区域;Locating the eye area based on the key points of the eye;
基于所述嘴部关键点,定位所述嘴部区域。Based on the key points of the mouth, the mouth area is located.
在本申请的一示例性实施例中,所述识别模块240配置为:提取所述人脸图像对应的全局特征;In an exemplary embodiment of the present application, the recognition module 240 is configured to: extract global features corresponding to the face image;
从所述表情敏感区域中提取所述表情敏感区域对应的区域特征;Extracting regional features corresponding to the expression-sensitive area from the expression-sensitive area;
基于所述全局特征以及所述区域特征,对所述人脸图像进行表情识别。Perform expression recognition on the face image based on the global feature and the regional feature.
在本申请的一示例性实施例中,所述表情敏感区域包括至少两个人脸局部区域,所述识别模块240配置为:In an exemplary embodiment of the present application, the expression sensitive area includes at least two partial areas of a human face, and the recognition module 240 is configured to:
从所述至少两个人脸局部区域中分别提取所述至少两个人脸局部区域分别对应的区域特征;Extracting, respectively, regional features corresponding to the at least two human face local regions from the at least two human face local regions;
对所述至少两个人脸局部区域分别对应的区域特征进行拼接,获取所述至少两个人脸局部区域的拼接特征;Splicing regional features corresponding to the at least two human face local regions respectively, to obtain the splicing features of the at least two human face local regions;
对所述拼接特征进行融合,获取所述至少两个人脸局部区域的融合特征;Fusing the splicing features to obtain the fusion features of the at least two face local areas;
基于所述全局特征以及所述融合特征,对所述人脸图像进行表情识别。Perform expression recognition on the face image based on the global feature and the fusion feature.
在本申请的一示例性实施例中,所述装置配置为:基于引入类间距离的中心损失函数L IC,对用于所述图像处理的神经网络进行预训练,其中,所述类间距离包括当前输入特征对应的第一中心表情、与所述当前输入特征对应的第二中心表情之间的距离,所述中心损失函数L IC表达为如下公式: In an exemplary embodiment of the present application, the device is configured to pre-train the neural network for the image processing based on the central loss function L IC that introduces the inter-class distance, wherein the inter-class distance Including the distance between the first central expression corresponding to the current input feature and the second central expression corresponding to the current input feature, the central loss function L IC is expressed as the following formula:
Figure PCTCN2020121349-appb-000011
其中,x i为所述当前输入特征,cyi为所述第一中心表情,c k为所述第二中心表情,m为训练所述神经网络时所使用的训练数据集所包含的训练数据(或训练样本)的数量,所述当前输入特征为所述训练数据集中的一训练数据,n为表情的类别数,β为预设的平衡因子。
Figure PCTCN2020121349-appb-000011
Where x i is the current input feature, cyi is the first central expression, c k is the second central expression, and m is the training data contained in the training data set used when training the neural network ( Or the number of training samples), the current input feature is a piece of training data in the training data set, n is the number of expression categories, and β is a preset balance factor.
在本申请的一示例性实施例中,所述装置配置为:基于预设的softmax损失函数L S与所述中心损失函数L IC组成的联合损失函数L,对所述神经网络进行联合监督预训练,其中,所述联合损失函数L表达为如下公式: In an exemplary embodiment of the present application, the device is configured to perform joint supervision and prediction on the neural network based on a joint loss function L composed of a preset softmax loss function L S and the central loss function L IC Training, where the joint loss function L is expressed as the following formula:
L=L S+λL IC,其中,λ为预设的尺度因子。关于联合损失函数L的描述可以参见方法实施例部分。 L = L S + λL IC , where λ is a preset scale factor. For the description of the joint loss function L, please refer to the method embodiment section.
在本申请的一示例性实施例中,所述装置配置为:In an exemplary embodiment of the present application, the device is configured as:
获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
基于对所述样本图像的变换,扩充所述样本图像集;Expanding the sample image set based on the transformation of the sample image;
基于所述扩充后的样本图像集,对用于所述图像处理的神经网络进行预训练。Based on the expanded sample image set, the neural network used for image processing is pre-trained.
在本申请的一示例性实施例中,所述装置配置为:In an exemplary embodiment of the present application, the device is configured as:
对所述样本图像进行翻转,获取所述样本图像对应的翻转图像;Flipping the sample image to obtain a flipped image corresponding to the sample image;
将所述翻转图像加入所述样本图像集中,以扩充所述样本图像集。The flipped image is added to the sample image set to expand the sample image set.
在本申请的一示例性实施例中,所述装置配置为:In an exemplary embodiment of the present application, the device is configured as:
对所述样本图像进行预设角度的旋转,获取所述样本图像对应的旋转图像;Rotate the sample image by a preset angle, and obtain a rotated image corresponding to the sample image;
将所述旋转图像加入所述样本图像集中,以扩充所述样本图像集。The rotated image is added to the sample image set to expand the sample image set.
在本申请的一示例性实施例中,所述装置配置为:In an exemplary embodiment of the present application, the device is configured as:
获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
对所述样本图像进行人脸检测,获取所述样本图像中的人脸图像;Performing face detection on the sample image, and obtaining a face image in the sample image;
对所述人脸图像进行预设的裁剪、缩放,获取包含所述裁剪、缩放后的人脸图像的人脸图像集;Performing preset cropping and scaling on the face image, and obtaining a face image set containing the cropped and scaled face images;
基于所述人脸图像集,对用于所述图像处理的神经网络进行预训练。Pre-training the neural network used for the image processing based on the face image set.
在本申请的一示例性实施例中,所述装置配置为:In an exemplary embodiment of the present application, the device is configured as:
基于对所述裁剪、缩放后的人脸图像的变换,扩充所述人脸图像集;Expanding the face image set based on the transformation of the cropped and zoomed face images;
基于所述扩充后的人脸图像集,对所述神经网络进行预训练。Pre-training the neural network based on the expanded face image set.
下面参考图7来描述根据本申请实施例的图像处理电子设备30。图7显示的图像处理电子设备30仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。The image processing electronic device 30 according to an embodiment of the present application will be described below with reference to FIG. 7. The image processing electronic device 30 shown in FIG. 7 is only an example, and should not bring any limitation to the functions and scope of use of the embodiments of the present application.
如图7所示,图像处理电子设备30以通用计算设备的形式表现。图像处理电子设备30的组件可以包括但不限于:上述至少一个处理单元310、上述至少一个存储单元320、连接不同***组件(包括存储单元320和处理 单元310)的总线330。As shown in FIG. 7, the image processing electronic device 30 is represented in the form of a general-purpose computing device. The components of the image processing electronic device 30 may include, but are not limited to: the aforementioned at least one processing unit 310, the aforementioned at least one storage unit 320, and a bus 330 connecting different system components (including the storage unit 320 and the processing unit 310).
其中,所述存储单元存储有程序代码,所述程序代码可以被所述处理单元310执行,使得所述处理单元310执行本说明书上述示例性方法的描述部分中描述的根据本发明各种示例性实施方式的步骤。例如,所述处理单元310可以执行如图1B中所示的各个步骤。Wherein, the storage unit stores program code, and the program code can be executed by the processing unit 310, so that the processing unit 310 executes the various exemplary methods described in the description section of the exemplary method in this specification. Steps of implementation. For example, the processing unit 310 may perform various steps as shown in FIG. 1B.
存储单元320可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)3201和/或高速缓存存储单元3202,还可以进一步包括只读存储单元(ROM)3203。The storage unit 320 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 3201 and/or a cache storage unit 3202, and may further include a read-only storage unit (ROM) 3203.
存储单元320还可以包括具有一组(至少一个)程序模块3205的程序/实用工具3204,这样的程序模块3205包括但不限于:操作***、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。The storage unit 320 may also include a program/utility tool 3204 having a set of (at least one) program modules 3205. Such program modules 3205 include but are not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
总线330可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、***总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。The bus 330 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
图像处理电子设备30也可以与一个或多个外部设备400(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该图像处理电子设备30交互的设备通信,和/或与使得该图像处理电子设备30能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口350进行。输入/输出(I/O)接口350与显示单元340相连。并且,图像处理电子设备30还可以通过网络适配器360与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器360通过总线330与图像处理电子设备30的其它模块通信。应当明白,尽管图中未示出,可以结合图像处理电子设备30使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID***、磁带驱动器以及数据备份存储***等。The image processing electronic device 30 may also communicate with one or more external devices 400 (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the image processing electronic device 30. And/or communicate with any device (such as a router, modem, etc.) that enables the image processing electronic device 30 to communicate with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 350. An input/output (I/O) interface 350 is connected to the display unit 340. In addition, the image processing electronic device 30 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 360. As shown in the figure, the network adapter 360 communicates with other modules of the image processing electronic device 30 through the bus 330. It should be understood that although not shown in the figure, other hardware and/or software modules can be used in conjunction with the image processing electronic device 30, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, Tape drives and data backup storage systems, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方 式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
在本申请的示例性实施例中,还提供了一种计算机可读存储介质,其上存储有计算机可读指令,当所述计算机可读指令被计算机的处理器执行时,使计算机执行上述方法实施例部分描述的方法。In the exemplary embodiment of the present application, there is also provided a computer-readable storage medium having computer-readable instructions stored thereon, and when the computer-readable instructions are executed by the processor of the computer, the computer is caused to execute the above method The method described in the example section.
根据本申请的一个实施例,还提供了一种用于实现上述方法实施例中的方法的程序产品,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本发明的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行***、装置或者器件使用或者与其结合使用。According to an embodiment of the present application, there is also provided a program product for implementing the method in the above method embodiment, which can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be used in the terminal Running on equipment, such as a personal computer. However, the program product of the present invention is not limited to this. In this document, the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or combined with an instruction execution system, device, or device.
所述程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的***、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RGM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。The program product can use any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RGM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行***、装置或者器件使用或者与其结合使用的程序。The computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. The readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。The program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the foregoing.
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,所述程序设计语言包括面向对象的程序设计语言-诸如JAVA、C++等,还包括常规的过程式程序设计语言-诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(KGN)或广域网(WGN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。The program code used to perform the operations of this application can be written in any combination of one or more programming languages. The programming languages include object-oriented programming languages-such as JAVA, C++, etc., as well as conventional procedural programming languages. Programming language-such as "C" language or similar programming language. The program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on. In the case of a remote computing device, the remote computing device can be connected to a user computing device through any kind of network, including a local area network (KGN) or a wide area network (WGN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
应当注意,尽管在上文详细描述中提及了用于动作执行的设备的若干模块或者单元,但是这种划分并非强制性的。实际上,根据本申请的实施方式,上文描述的两个或更多模块或者单元的特征和功能可以在一个模块或者单元中具体化。反之,上文描述的一个模块或者单元的特征和功能可以进一步划分为由多个模块或者单元来具体化。It should be noted that although several modules or units of the device for action execution are mentioned in the above detailed description, this division is not mandatory. In fact, according to the embodiments of the present application, the features and functions of two or more modules or units described above may be embodied in one module or unit. Conversely, the features and functions of a module or unit described above can be further divided into multiple modules or units to be embodied.
此外,尽管在附图中以特定顺序描述了本申请中方法的各个步骤,但是,这并非要求或者暗示必须按照该特定顺序来执行这些步骤,或是必须执行全部所示的步骤才能实现期望的结果。附加的或备选的,可以省略某些步骤,将多个步骤合并为一个步骤执行,以及/或者将一个步骤分解为多个步骤执行等。In addition, although the various steps of the method in the present application are described in a specific order in the drawings, this does not require or imply that these steps must be performed in the specific order, or that all the steps shown must be performed to achieve the desired result. Additionally or alternatively, some steps may be omitted, multiple steps may be combined into one step for execution, and/or one step may be decomposed into multiple steps for execution, etc.
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、移动终端、或者网络设备等)执行根据本申请实施方式的方法。Through the description of the above embodiments, those skilled in the art can easily understand that the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a mobile terminal, or a network device, etc.) execute the method according to the embodiment of the present application.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实 施例仅被视为示例性的,本申请的真正范围和精神由所附的权利要求指出。After considering the specification and practicing the invention disclosed herein, those skilled in the art will easily think of other embodiments of the present application. This application is intended to cover any variations, uses, or adaptive changes of this application. These variations, uses, or adaptive changes follow the general principles of this application and include common knowledge or customary technical means in the technical field that are not disclosed in this application. . The description and embodiments are only regarded as exemplary, and the true scope and spirit of the application are pointed out by the appended claims.

Claims (15)

  1. 一种图像处理方法,由电子设备执行,其特征在于,所述方法包括:An image processing method executed by an electronic device, characterized in that the method includes:
    获取待处理的人脸图像;Obtain the face image to be processed;
    提取所述人脸图像的人脸关键点;Extracting face key points of the face image;
    基于所述人脸关键点,定位所述人脸图像中的表情敏感区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;Locating an expression-sensitive area in the face image based on the key points of the human face, where the expression-sensitive area is a local area of the human face with dense expression feature information;
    基于所述表情敏感区域,对所述人脸图像进行表情识别。Based on the expression sensitive area, perform expression recognition on the face image.
  2. 根据权利要求1所述的方法,其特征在于,所述表情敏感区域包括至少两个人脸局部区域,基于所述人脸关键点,定位所述人脸图像中的表情敏感区域,包括:The method according to claim 1, wherein the expression-sensitive area includes at least two partial areas of a human face, and locating the expression-sensitive area in the face image based on the key points of the human face comprises:
    从所述人脸关键点中,获取所述至少两个人脸局部区域分别对应的区域关键点;Acquiring, from the key points of the face, the area key points corresponding to the at least two face local areas respectively;
    基于所述区域关键点,分别定位所述至少两个人脸局部区域。Based on the key points of the area, the at least two face local areas are respectively located.
  3. 根据权利要求2所述的方法,其特征在于,所述至少两个人脸局部区域包括眼部区域和嘴部区域,The method according to claim 2, wherein the at least two partial areas of the human face include an eye area and a mouth area,
    从所述人脸关键点中,获取所述至少两个人脸局部区域分别对应的区域关键点,包括:从所述人脸关键点中,获取所述眼部区域对应的眼部关键点、所述嘴部区域对应的嘴部关键点;From the face key points, obtaining the area key points corresponding to the at least two face local areas respectively includes: obtaining, from the face key points, the eye key points corresponding to the eye area, and the key points. Describe the key points of the mouth corresponding to the mouth area;
    基于所述区域关键点,分别定位所述至少两个人脸局部区域,包括:Based on the region key points, respectively locating the at least two human face local regions includes:
    基于所述眼部关键点,定位所述眼部区域;Locating the eye area based on the key points of the eye;
    基于所述嘴部关键点,定位所述嘴部区域。Based on the key points of the mouth, the mouth area is located.
  4. 根据权利要求1所述的方法,其特征在于,基于所述表情敏感区域,对所述人脸图像进行表情识别,包括:The method according to claim 1, wherein performing expression recognition on the face image based on the expression sensitive area comprises:
    提取所述人脸图像对应的全局特征;Extracting global features corresponding to the face image;
    从所述表情敏感区域中提取所述表情敏感区域对应的区域特征;Extracting regional features corresponding to the expression-sensitive area from the expression-sensitive area;
    基于所述全局特征以及所述区域特征,对所述人脸图像进行表情识别。Perform expression recognition on the face image based on the global feature and the regional feature.
  5. 根据权利要求4所述的方法,其特征在于,所述表情敏感区域包括至少两个人脸局部区域,从所述表情敏感区域中提取所述表情敏感区域对应的区域特征,包括:从所述至少两个人脸局部区域中分别提取所述至少两个人脸局部区域分别对应的区域特征;The method according to claim 4, wherein the expression-sensitive area includes at least two partial areas of a human face, and extracting the regional features corresponding to the expression-sensitive area from the expression-sensitive area includes: Extracting respectively the regional features corresponding to the at least two human face local regions from the two human face local regions;
    在基于所述全局特征以及所述区域特征,对所述人脸图像进行表情识别之前,还包括:Before performing expression recognition on the face image based on the global feature and the regional feature, the method further includes:
    对所述至少两个人脸局部区域分别对应的区域特征进行拼接,获取所述至少两个人脸局部区域的拼接特征;Splicing regional features corresponding to the at least two human face local regions respectively, to obtain the splicing features of the at least two human face local regions;
    对所述拼接特征进行融合,获取所述至少两个人脸局部区域的融合特征;Fusing the splicing features to obtain the fusion features of the at least two face local areas;
    基于所述全局特征以及所述区域特征,对所述人脸图像进行表情识别,包括:基于所述全局特征以及所述融合特征,对所述人脸图像进行表情识别。Performing expression recognition on the face image based on the global feature and the regional feature includes: performing expression recognition on the face image based on the global feature and the fusion feature.
  6. 根据权利要求1至5任一项权利要求所述的方法,其特征在于,所述方法还包括:基于引入类间距离的中心损失函数L IC,对用于所述图像处理的神经网络进行预训练, The method according to any one of claims 1 to 5, wherein the method further comprises: pre-processing the neural network used for the image processing based on the center loss function L IC that introduces the inter-class distance training,
    其中,所述类间距离包括当前输入特征对应的第一中心表情与所述当前输入特征对应的第二中心表情之间的距离,所述中心损失函数L IC表达为如下公式: Wherein, the inter-class distance includes the distance between the first central expression corresponding to the current input feature and the second central expression corresponding to the current input feature, and the central loss function L IC is expressed as the following formula:
    Figure PCTCN2020121349-appb-100001
    其中,x i为所述当前输入特征,c yi为所述第一中心表情,c k为所述第二中心表情,m为训练所述神经网络时所使用的训练数据集所包含的训练数据的数量,所述当前输入特征为所述训练数据集中的一训练数据,n为表情的类别数,β为预设的平衡因子。
    Figure PCTCN2020121349-appb-100001
    Where x i is the current input feature, c yi is the first central expression, c k is the second central expression, and m is the training data contained in the training data set used when training the neural network The current input feature is a piece of training data in the training data set, n is the number of expression categories, and β is a preset balance factor.
  7. 根据权利要求6所述的方法,其特征在于,对所述神经网络进行预训练,包括:基于预设的softmax损失函数L S与所述中心损失函数L IC组成的联合损失函数L,对所述神经网络进行联合监督预训练, The method according to claim 6, wherein the pre-training of the neural network comprises: based on a joint loss function L composed of a preset softmax loss function L S and the central loss function L IC, The neural network performs joint supervised pre-training,
    其中,所述联合损失函数L表达为如下公式:Wherein, the joint loss function L is expressed as the following formula:
    L=L S+λL IC,其中,λ为预设的尺度因子。 L = L S + λL IC , where λ is a preset scale factor.
  8. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
    基于对所述样本图像的变换,扩充所述样本图像集;Expanding the sample image set based on the transformation of the sample image;
    基于所述扩充后的样本图像集,对用于所述图像处理的神经网络进行预训练。Based on the expanded sample image set, the neural network used for image processing is pre-trained.
  9. 根据权利要求8所述的方法,其特征在于,基于对所述样本图像的 变换,扩充所述样本图像集,包括:The method according to claim 8, characterized in that, based on the transformation of the sample image, expanding the sample image set comprises:
    对所述样本图像进行翻转,获取所述样本图像对应的翻转图像;Flipping the sample image to obtain a flipped image corresponding to the sample image;
    将所述翻转图像加入所述样本图像集中,以扩充所述样本图像集。The flipped image is added to the sample image set to expand the sample image set.
  10. 根据权利要求8所述的方法,其特征在于,基于对所述样本图像的变换,扩充所述样本图像集,包括:The method according to claim 8, characterized in that, based on the transformation of the sample image, expanding the sample image set comprises:
    对所述样本图像进行预设角度的旋转,获取所述样本图像对应的旋转图像;Rotate the sample image by a preset angle, and obtain a rotated image corresponding to the sample image;
    将所述旋转图像加入所述样本图像集中,以扩充所述样本图像集。The rotated image is added to the sample image set to expand the sample image set.
  11. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取包含样本图像的样本图像集;Obtain a sample image set containing sample images;
    对所述样本图像进行人脸检测,获取所述样本图像中的人脸图像;Performing face detection on the sample image, and obtaining a face image in the sample image;
    对所述人脸图像进行预设的裁剪、缩放,获取包含所述裁剪、缩放后的人脸图像的人脸图像集;Performing preset cropping and scaling on the face image, and obtaining a face image set containing the cropped and scaled face images;
    基于所述人脸图像集,对用于所述图像处理的神经网络进行预训练。Pre-training the neural network used for the image processing based on the face image set.
  12. 根据权利要求11所述的方法,其特征在于,在获取包含所述裁剪、缩放后的人脸图像的人脸图像集后,包括:基于对所述裁剪、缩放后的人脸图像的变换,扩充所述人脸图像集;The method according to claim 11, characterized in that, after acquiring a face image set containing the cropped and scaled face images, it comprises: based on the transformation of the cropped and scaled face images, Expanding the face image set;
    基于所述人脸图像集,对用于所述图像处理的神经网络进行预训练,包括:基于所述扩充后的人脸图像集,对所述神经网络进行预训练。Pre-training the neural network used for image processing based on the face image set includes: pre-training the neural network based on the expanded face image set.
  13. 一种图像处理装置,其特征在于,所述装置包括:An image processing device, characterized in that the device includes:
    获取模块,配置为获取待处理的人脸图像;The obtaining module is configured to obtain the face image to be processed;
    提取模块,配置为提取所述人脸图像的关键点;An extraction module, configured to extract key points of the face image;
    定位模块,配置为基于所述关键点,定位所述人脸图像中的表情敏感区域,所述表情敏感区域为表情特征信息密集的人脸局部区域;A positioning module configured to locate an expression-sensitive area in the face image based on the key point, where the expression-sensitive area is a local area of the face with dense expression feature information;
    识别模块,配置为基于所述表情敏感区域,对所述人脸图像进行表情识别。The recognition module is configured to perform facial expression recognition on the facial image based on the facial expression sensitive area.
  14. 一种图像处理电子设备,其特征在于,包括:An image processing electronic device, characterized in that it comprises:
    存储器,存储有计算机可读指令;The memory stores computer-readable instructions;
    处理器,读取存储器存储的计算机可读指令,以执行权利要求1-12中的任一项所述的方法。The processor reads computer-readable instructions stored in the memory to execute the method according to any one of claims 1-12.
  15. 一种计算机可读存储介质,其特征在于,其上存储有计算机可读指令,当所述计算机可读指令被计算机的处理器执行时,使计算机执行权利要求1-12中的任一项所述的方法。A computer-readable storage medium, characterized in that computer-readable instructions are stored thereon, and when the computer-readable instructions are executed by the processor of the computer, the computer is caused to execute any one of claims 1-12. The method described.
PCT/CN2020/121349 2019-12-30 2020-10-16 Image processing method and apparatus, electronic device, and storage medium WO2021135509A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911398384.6 2019-12-30
CN201911398384.6A CN111144348A (en) 2019-12-30 2019-12-30 Image processing method, image processing device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2021135509A1 true WO2021135509A1 (en) 2021-07-08

Family

ID=70521940

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/121349 WO2021135509A1 (en) 2019-12-30 2020-10-16 Image processing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN111144348A (en)
WO (1) WO2021135509A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658582A (en) * 2021-07-15 2021-11-16 中国科学院计算技术研究所 Voice-video cooperative lip language identification method and system

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111144348A (en) * 2019-12-30 2020-05-12 腾讯科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium
CN113228045A (en) * 2020-05-18 2021-08-06 深圳市大疆创新科技有限公司 Image processing method, apparatus, removable platform, and storage medium
CN111967353B (en) * 2020-07-31 2024-05-14 北京金山云网络技术有限公司 Picture identification method, device, electronic equipment and medium
CN112085035A (en) * 2020-09-14 2020-12-15 北京字节跳动网络技术有限公司 Image processing method, image processing device, electronic equipment and computer readable medium
CN112257635A (en) * 2020-10-30 2021-01-22 杭州魔点科技有限公司 Method, system, electronic device and storage medium for filtering face false detection
CN112329683B (en) * 2020-11-16 2024-01-26 常州大学 Multi-channel convolutional neural network facial expression recognition method
CN112651301A (en) * 2020-12-08 2021-04-13 浙江工业大学 Expression recognition method integrating global and local features of human face
CN112614213B (en) * 2020-12-14 2024-01-23 杭州网易云音乐科技有限公司 Facial expression determining method, expression parameter determining model, medium and equipment
CN113111789B (en) * 2021-04-15 2022-12-20 山东大学 Facial expression recognition method and system based on video stream
CN113486867B (en) * 2021-09-07 2021-12-14 北京世纪好未来教育科技有限公司 Face micro-expression recognition method and device, electronic equipment and storage medium
CN114612987A (en) * 2022-03-17 2022-06-10 深圳集智数字科技有限公司 Expression recognition method and device
CN115035581A (en) * 2022-06-27 2022-09-09 闻泰通讯股份有限公司 Facial expression recognition method, terminal device and storage medium
CN115661909A (en) * 2022-12-14 2023-01-31 深圳大学 Face image processing method, device and computer readable storage medium
CN115937372B (en) * 2022-12-19 2023-10-03 北京字跳网络技术有限公司 Facial expression simulation method, device, equipment and storage medium
CN116912924B (en) * 2023-09-12 2024-01-05 深圳须弥云图空间科技有限公司 Target image recognition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095827A (en) * 2014-04-18 2015-11-25 汉王科技股份有限公司 Facial expression recognition device and facial expression recognition method
CN105825192A (en) * 2016-03-24 2016-08-03 深圳大学 Facial expression identification method and system
CN106295566A (en) * 2016-08-10 2017-01-04 北京小米移动软件有限公司 Facial expression recognizing method and device
CN108256450A (en) * 2018-01-04 2018-07-06 天津大学 A kind of supervised learning method of recognition of face and face verification based on deep learning
CN109344693A (en) * 2018-08-13 2019-02-15 华南理工大学 A kind of face multizone fusion expression recognition method based on deep learning
WO2019143962A1 (en) * 2018-01-19 2019-07-25 Board Of Regents, The University Of Texas System Systems and methods for evaluating individual, group, and crowd emotion engagement and attention
CN111144348A (en) * 2019-12-30 2020-05-12 腾讯科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107292256B (en) * 2017-06-14 2019-12-24 西安电子科技大学 Auxiliary task-based deep convolution wavelet neural network expression recognition method
CN108573232B (en) * 2018-04-17 2021-07-23 中国民航大学 Human body action recognition method based on convolutional neural network
GB2586260B (en) * 2019-08-15 2021-09-15 Huawei Tech Co Ltd Facial image processing

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095827A (en) * 2014-04-18 2015-11-25 汉王科技股份有限公司 Facial expression recognition device and facial expression recognition method
CN105825192A (en) * 2016-03-24 2016-08-03 深圳大学 Facial expression identification method and system
CN106295566A (en) * 2016-08-10 2017-01-04 北京小米移动软件有限公司 Facial expression recognizing method and device
CN108256450A (en) * 2018-01-04 2018-07-06 天津大学 A kind of supervised learning method of recognition of face and face verification based on deep learning
WO2019143962A1 (en) * 2018-01-19 2019-07-25 Board Of Regents, The University Of Texas System Systems and methods for evaluating individual, group, and crowd emotion engagement and attention
CN109344693A (en) * 2018-08-13 2019-02-15 华南理工大学 A kind of face multizone fusion expression recognition method based on deep learning
CN111144348A (en) * 2019-12-30 2020-05-12 腾讯科技(深圳)有限公司 Image processing method, image processing device, electronic equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113658582A (en) * 2021-07-15 2021-11-16 中国科学院计算技术研究所 Voice-video cooperative lip language identification method and system
CN113658582B (en) * 2021-07-15 2024-05-07 中国科学院计算技术研究所 Lip language identification method and system for audio-visual collaboration

Also Published As

Publication number Publication date
CN111144348A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
WO2021135509A1 (en) Image processing method and apparatus, electronic device, and storage medium
US20210279503A1 (en) Image processing method, apparatus, and device, and storage medium
US11810377B2 (en) Point cloud segmentation method, computer-readable storage medium, and computer device
US20220189142A1 (en) Ai-based object classification method and apparatus, and medical imaging device and storage medium
CN111563502B (en) Image text recognition method and device, electronic equipment and computer storage medium
EP3757905A1 (en) Deep neural network training method and apparatus
WO2021139324A1 (en) Image recognition method and apparatus, computer-readable storage medium and electronic device
WO2020182121A1 (en) Expression recognition method and related device
WO2020103700A1 (en) Image recognition method based on micro facial expressions, apparatus and related device
CN114119638A (en) Medical image segmentation method integrating multi-scale features and attention mechanism
WO2021196389A1 (en) Facial action unit recognition method and apparatus, electronic device, and storage medium
WO2023098128A1 (en) Living body detection method and apparatus, and training method and apparatus for living body detection system
CN107609466A (en) Face cluster method, apparatus, equipment and storage medium
CN108830237B (en) Facial expression recognition method
WO2021203865A9 (en) Molecular binding site detection method and apparatus, electronic device and storage medium
Sun et al. Facial expression recognition using optimized active regions
CN112800903A (en) Dynamic expression recognition method and system based on space-time diagram convolutional neural network
WO2021047587A1 (en) Gesture recognition method, electronic device, computer-readable storage medium, and chip
WO2021218238A1 (en) Image processing method and image processing apparatus
US20230095182A1 (en) Method and apparatus for extracting biological features, device, medium, and program product
WO2023284182A1 (en) Training method for recognizing moving target, method and device for recognizing moving target
US20220207913A1 (en) Method and device for training multi-task recognition model and computer-readable storage medium
WO2021127916A1 (en) Facial emotion recognition method, smart device and computer-readabel storage medium
CN110837777A (en) Partial occlusion facial expression recognition method based on improved VGG-Net
CN113269089A (en) Real-time gesture recognition method and system based on deep learning

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911135

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911135

Country of ref document: EP

Kind code of ref document: A1