CN111260697A

CN111260697A - Target object identification method, system, device and medium

Info

Publication number: CN111260697A
Application number: CN202010058535.XA
Authority: CN
Inventors: 周曦; 姚志强; 吴学纯
Original assignee: Shanghai Yunconghuilin Artificial Intelligence Technology Co Ltd
Current assignee: Shanghai Yunconghuilin Artificial Intelligence Technology Co Ltd
Priority date: 2020-01-19
Filing date: 2020-01-19
Publication date: 2020-06-09

Abstract

The invention provides a method, a system, equipment and a medium for identifying a target object, which comprise the following steps: acquiring a single-frame or multi-frame image containing one or more human faces or human bodies; inputting the image containing one or more human faces or human bodies into a layered vectorization model, and acquiring a human face or human body feature vector of the image; and identifying whether the image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the image. The invention can identify whether the single-frame or multi-frame image contains the human face or the human body of one or more target objects, then judge the image acquisition equipment where the single-frame or multi-frame image comes from, and generate the motion information of one or more target objects according to the corresponding geographic position of the image acquisition equipment, thereby carrying out cross-border tracking on one or more target objects.

Description

Target object identification method, system, device and medium

Technical Field

The present invention relates to image recognition technologies, and in particular, to a method, a system, a device, and a medium for recognizing a target object.

Background

In recent years, the face recognition technology is widely applied to the aspects of building smart cities, safe cities and the like. However, in the existing cameras, more than 80% of the cameras can not shoot clear faces under any circumstances, and in addition, the anti-reconnaissance capability of the criminals is improved, the criminals can intentionally avoid the cameras, catch face information in time, and give an alarm in time, and the difficulty in handling is large; moreover, in an actual scene, one camera often cannot cover all areas, and multiple cameras generally do not overlap with each other. Therefore, the application provides a method, a system, equipment and a medium for identifying a target object, which can construct a complete moving track of a pedestrian by identifying a face in a video, and realize cross-lens tracking of the pedestrian.

Disclosure of Invention

In view of the above-mentioned shortcomings of the prior art, it is an object of the present invention to provide a method, system, device and medium for identifying a target object, which are used to solve the technical problems in the prior art.

To achieve the above and other related objects, the present invention provides a method for identifying a target object, comprising the steps of:

acquiring an image containing one or more human faces or human bodies;

inputting the image containing one or more human faces or human bodies into a layered vectorization model, and acquiring a human face or human body feature vector of the image;

and identifying whether the image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the image.

Optionally, the image comprises a single-frame or multi-frame image, the multi-frame image comprising one or more continuous frame images, a plurality of single-frame images;

acquiring a single-frame or multi-frame image containing one or more human faces or human bodies;

inputting a certain frame of image containing one or more human faces or human bodies into a layered vectorization model to obtain a human face or human body feature vector of the certain frame of image;

and identifying whether the certain frame image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the certain frame image. Optionally, inputting the certain frame of image containing one or more human faces or human bodies into the layered vectorization model;

dividing the certain frame of image containing one or more human faces or human bodies into one or more image blocks;

extracting local features of each image block, and acquiring a local feature descriptor of each image block according to the local features;

quantizing the local feature descriptors of each image block to generate an image block feature dictionary;

according to the mapping between the image block feature dictionary and the certain frame image, encoding to form a human face or human body feature vector of the certain frame image;

and acquiring the face or human body feature vector of the certain frame of image.

Optionally, if a face or a human body containing the one or more target objects in a certain frame of image is identified;

and acquiring each frame of image of the human face or the human body containing the one or more target objects, and determining the motion information of the one or more target objects according to the acquired each frame of image of the human face or the human body containing the one or more target objects.

Optionally, the motion information comprises at least one of: time of movement, geographical location of movement.

Optionally, each layer in the layered vectorization model includes one or more trained deep neural networks.

Optionally, the local features comprise at least one of: eye shape, nose shape, mouth shape, eye separation distance, position of five sense organs, face contour.

Optionally, taking the shape of the eyes, the shape of the nose, the shape of the mouth as one layer; the separation distance of the eyes, the position of the five sense organs, the outline of the face as another layer.

Optionally, the face or body feature vector is not affected by interference factors, and the interference factors include at least one of: illumination, shading, angle, age, race.

Optionally, the image including one or more human faces or human bodies is acquired by one or more image acquisition devices.

Optionally, the geographical location set by the one or more image capturing devices comprises at least one of: residential areas, schools, stations, airports, markets and hospitals.

The invention also provides a system for identifying the target object, which comprises the following components:

the image module is used for acquiring images containing one or more human faces or human bodies;

the characteristic module is used for inputting the image containing the one or more human faces or the human bodies into a layered vectorization model to obtain the human face or human body characteristic vector of the image;

and the identification module is used for identifying whether the image contains the human face or the human body of one or more target objects according to the human face or the human body characteristic vector of the image.

Optionally, the image comprises a single-frame or multi-frame image, and the multi-frame image comprises one or more continuous frame images and a plurality of single-frame images.

Optionally, the feature module is specifically configured to:

inputting the certain frame of image containing one or more human faces or human bodies into the layered vectorization model;

Optionally, if the recognition module recognizes a face or a human body containing the one or more target objects in a certain frame of image;

Optionally, one or more consecutive frame images containing one or more faces are acquired by one or more image acquisition devices.

The invention also provides a target object recognition device, which comprises:

acquiring an image containing one or more human faces or human bodies;

The present invention also provides an apparatus comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform a method as described in one or more of the above.

The present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the methods as described in one or more of the above.

As described above, the method, system, device and medium for identifying a target object provided by the present invention have the following advantages: acquiring an image containing one or more human faces or human bodies; inputting the image containing one or more human faces or human bodies into a layered vectorization model, and acquiring a human face or human body feature vector of the image; and identifying whether the image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the image. The invention can identify whether the single-frame or multi-frame image contains the human face or the human body of one or more target objects, then judge the image acquisition equipment where the single-frame or multi-frame image comes from, and generate the motion information of one or more target objects according to the corresponding geographic position of the image acquisition equipment, thereby carrying out cross-border tracking on one or more target objects.

Drawings

Fig. 1 is a schematic flowchart of a target object identification method according to an embodiment;

fig. 2 is a schematic flowchart of a target object identification method according to another embodiment;

fig. 3 is a schematic diagram of a connection structure of a layered vectorization model according to an embodiment;

fig. 4 is a schematic hardware structure diagram of a target object recognition system according to an embodiment;

fig. 5 is a schematic hardware structure diagram of a terminal device according to an embodiment;

fig. 6 is a schematic diagram of a hardware structure of a terminal device according to another embodiment.

Description of the element reference numerals

M10 image module

M20 feature Module

M30 identification module

1100 input device

1101 first processor

1102 output device

1103 first memory

1104 communication bus

1200 processing assembly

1201 second processor

1202 second memory

1203 communication assembly

1204 Power supply Assembly

1205 multimedia assembly

1206 voice assembly

1207 input/output interface

1208 sensor assembly

Detailed Description

The embodiments of the present invention are described below with reference to specific embodiments, and other advantages and effects of the present invention will be easily understood by those skilled in the art from the disclosure of the present specification. The invention is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present invention. It is to be noted that the features in the following embodiments and examples may be combined with each other without conflict.

It should be noted that the drawings provided in the following embodiments are only for illustrating the basic idea of the present invention, and the components related to the present invention are only shown in the drawings rather than drawn according to the number, shape and size of the components in actual implementation, and the type, quantity and proportion of the components in actual implementation may be changed freely, and the layout of the components may be more complicated.

The invention provides a target object identification method, which comprises the following steps:

acquiring an image containing one or more human faces or human bodies; the image comprises a single-frame image or a multi-frame image, and the multi-frame image comprises one or more continuous frame images and a plurality of single-frame images; as an example, a plurality of single-frame images may be synthesized into one multi-frame image.

Specifically, as shown in fig. 1, in one embodiment,

s100, acquiring a single-frame or multi-frame image containing one or more human faces or human bodies;

s200, inputting a certain frame of image containing one or more human faces or human bodies into a layered vectorization model, and acquiring a human face or human body feature vector of the certain frame of image;

s300, identifying whether the certain frame image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the certain frame image.

The method comprises the steps of obtaining a single-frame or multi-frame image containing a human face or a human body, inputting one frame of image in the single-frame or multi-frame image into a layered vectorization model, and obtaining a human face or human body feature vector of the frame of image; and then, identifying whether one or more human faces or human bodies in the frame image contain human faces or human bodies of one or more target objects according to the human face or human body feature vectors of the frame image, and judging whether the target objects appear in the single-frame or multi-frame image.

The layered vectorization model is actually a multi-layer feature coding process. A single layer signature consists of the following steps: firstly, all images containing human faces or human bodies in a picture library are partitioned; secondly, extracting local features (such as LBP and SIFT) of each block of region to form a local feature descriptor; then, quantizing all local feature descriptors to form a dictionary; and finally, according to the mapping of the dictionary information and the face or human body image, coding to form a face or human body feature vector of the face or human body image, and defining the face or human body feature vector as face or human body DNA.

In an exemplary embodiment, in particular:

s210, inputting the certain frame of image containing one or more human faces or human bodies into the layered vectorization model; each layer in the layered vectorization model comprises one or more deep neural networks after training is completed.

S200, segmenting the certain frame of image containing one or more human faces or human bodies into one or more image blocks;

s230, extracting local features of each image block, and acquiring a local feature descriptor of each image block according to the local features; the local features in the embodiment of the present application include at least one of the following: eye shape, nose shape, mouth shape, eye separation distance, position of five sense organs, face contour. As an example, in a computer face recognition process, the apparent appearance of a human face may be: eye size shape (e.g., danfeng eye, great brow eye, etc.), nose shape (e.g., olecranon nose, flat nose, etc.), mouth size shape (e.g., cherry little mouth, etc.) as the first layer; the distance of the eyes, the position of the five sense organs, the contour of the face, etc. are taken as the second layer.

S240, quantizing the local feature descriptors of each image block to generate an image block feature dictionary;

s250, according to the mapping between the image block feature dictionary and the certain frame image, encoding to form a human face or human body feature vector of the certain frame image;

s260, obtaining the face or body feature vector of the certain frame of image, and defining the face or body feature vector of the certain frame of image as face or body DNA. The human face or human body feature vector is not affected by interference factors, and the interference factors comprise at least one of the following: illumination, shading, angle, age, race.

In some exemplary embodiments, a single or multiple frame image containing one or more faces or bodies is acquired by one or more image acquisition devices. By way of example, the image acquisition device in the application can be a camera, for example, a network camera which is built in the past is multiplexed, and the monitoring video is acquired by multiplexing the built camera. The flow of people in residential areas, schools, stations, airports, shopping malls, hospitals and other places is usually large, and the number of covered people is large; the geographical location set by one or more image capturing devices in embodiments of the present application includes at least one of: residential areas, schools, stations, airports, markets and hospitals.

In an exemplary embodiment, if a face or a human body including the one or more target objects in the certain frame of image is identified; and acquiring each frame of image of the face or the human body containing the one or more target objects from the single-frame or multi-frame image, and determining the motion information of the one or more target objects according to the acquired each frame of image of the face or the human body containing the one or more target objects. The motion information includes at least one of: time of movement, geographical location of movement. Specifically, one or more videos shot by one or more cameras are obtained, and then whether a picture presented by each frame of image in the videos contains a human face or a human body of one or more target objects is determined. If the pictures presented by some frame images in some videos contain the human faces or human bodies of one or more target objects, the moving time and the moving geographic position of one or more target objects are determined according to the image pictures. As an example, for example, videos shot by 5 cameras in a certain residential area are obtained, each camera shoots a section of video, whether a human face or a human body exists in the 5 sections of video is manually watched, video segments of the human face or the human body existing in the 5 sections of video are cut out, the video segments of the human face or the human body existing in the 5 sections of video are cut into frames and frames, images of the human face or the human body are included in the frames, then the images of the frames, including the human face or the human body, are input into a hierarchical vectorization model, and a human face or human body feature vector of each frame of image is obtained; and identifying whether the certain frame of image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of each frame of image. Each layer in the layered vectorization model comprises one or more deep neural networks after training is completed, and the deep neural networks are trained according to images of human faces or human bodies containing target objects. If the human face or the human body of one or more target objects in some video segments is identified, the motion time of the one or more target objects is directly obtained from the video segments, then the video segments are judged to come from which cameras, and the motion geographic position of the one or more target objects can be approximately obtained according to the installation positions of the cameras; cross-shot tracking may thus be achieved for the one or more target objects. The target object in the embodiment of the present application is a person, such as a lost child, a suspect in a certain state, or the like.

The method comprises the steps of obtaining a single-frame or multi-frame image containing one or more human faces or human bodies; inputting a certain frame of image containing one or more human faces or human bodies into a layered vectorization model to obtain a human face or human body feature vector of the certain frame of image; and identifying whether the certain frame image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the certain frame image. The method can identify whether the single-frame or multi-frame image contains the human face or the human body of one or more target objects, then judge the image acquisition equipment where the single-frame or multi-frame image comes from, and generate the motion information of one or more target objects according to the geographic position corresponding to the image acquisition equipment, thereby carrying out cross-border tracking on one or more target objects.

As shown in fig. 3 and 4, the present invention further provides a target object recognition system, which includes:

an image module M10, configured to obtain an image including one or more human faces or human bodies; the image comprises a single-frame image or a multi-frame image, and the multi-frame image comprises one or more continuous frame images and a plurality of single-frame images; as an example, a plurality of single-frame images may be synthesized into one multi-frame image.

The feature module M20 is configured to input an image including the one or more faces or human bodies into a hierarchical vectorization model, and obtain a face or human body feature vector of the image;

and the identification module M30 is configured to identify whether the image contains a human face or a human body of one or more target objects according to the human face or human body feature vector of the image.

In particular, in one embodiment,

an image module M10, configured to obtain a single-frame or multi-frame image containing one or more human faces or human bodies;

a feature module M20, configured to input a certain frame of image including the one or more faces or human bodies into a layered vectorization model, and obtain a face or human body feature vector of the certain frame of image;

and the identification module M30 is configured to identify whether the certain frame image includes a face or a human body of one or more target objects according to the face or human body feature vector of the certain frame image.

The system obtains a single-frame or multi-frame image containing a human face or a human body, and inputs a certain frame of image in the single-frame or multi-frame image into a layered vectorization model to obtain a human face or human body feature vector of the frame of image; and then, identifying whether one or more human faces or human bodies in the frame image contain human faces or human bodies of one or more target objects according to the human face or human body feature vectors of the frame image, and judging whether the target objects appear in the single-frame or multi-frame image.

As shown in fig. 3, in an exemplary embodiment, the feature module is specifically configured to:

inputting the certain frame of image containing one or more human faces or human bodies into the layered vectorization model; each layer in the layered vectorization model comprises one or more deep neural networks after training is completed.

extracting local features of each image block, and acquiring a local feature descriptor of each image block according to the local features; the local features in the embodiment of the present application include at least one of the following: eye shape, nose shape, mouth shape, eye separation distance, position of five sense organs, face contour. As an example, in a computer face recognition process, the apparent appearance of a human face may be: eye size shape (e.g., danfeng eye, great brow eye, etc.), nose shape (e.g., olecranon nose, flat nose, etc.), mouth size shape (e.g., cherry little mouth, etc.) as the first layer; the distance of the eyes, the position of the five sense organs, the contour of the face, etc. are taken as the second layer.

and acquiring the face or human body feature vector of the certain frame of image, and defining the face or human body feature vector of the certain frame of image as face or human body DNA (deoxyribonucleic acid) as an example. The human face or human body feature vector is not affected by interference factors, and the interference factors comprise at least one of the following: illumination, shading, angle, age, race.

In some exemplary embodiments, a single or multiple frame image containing one or more faces or bodies is acquired by one or more image acquisition devices. By way of example, the image acquisition device in the application can be a camera, for example, a network camera which is built in the past is multiplexed, and the monitoring video is acquired by multiplexing the built camera. The flow of people is usually large in residential areas, schools, stations, airports, shopping malls, hospitals and other places, and the number of covered people is large. The geographical location set by one or more image capturing devices in embodiments of the present application includes at least one of: residential areas, schools, stations, airports, markets and hospitals.

In an exemplary embodiment, if a face or a human body including the one or more target objects in the certain frame of image is identified; and acquiring each frame of image of the face or the human body containing the one or more target objects from the single-frame or multi-frame image, and determining the motion information of the one or more target objects according to the acquired each frame of image of the face or the human body containing the one or more target objects. The motion information includes at least one of: time of movement, geographical location of movement. Specifically, one or more videos shot by one or more cameras are obtained, and then whether a picture presented by each frame of image in the videos contains a human face or a human body of one or more target objects is determined. If the pictures presented by some frame images in some videos contain the human faces or human bodies of one or more target objects, the moving time and the moving geographic position of one or more target objects are determined according to the image pictures. As an example, for example, videos shot by 10 cameras in a certain hospital are obtained, each camera shoots a section of video, whether a human face or a human body exists in the 10 sections of video is manually watched, video segments of the human face or the human body existing in the 10 sections of video are cut out, the video segments of the human face or the human body existing in the 10 sections of video are cut into frames and frames, images of the human face or the human body are included in the frames, then the images of the frames including the human face or the human body are input into a hierarchical vectorization model, and a human face or human body feature vector of each frame of image is obtained; and identifying whether the certain frame of image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of each frame of image. Each layer in the layered vectorization model comprises one or more deep neural networks after training is completed, and the deep neural networks are trained according to images of human faces or human bodies containing target objects. If the human face or the human body of one or more target objects in some video segments is identified, the motion time of the one or more target objects is directly obtained from the video segments, then the video segments are judged to come from which cameras, and the motion geographic position of the one or more target objects can be approximately obtained according to the installation positions of the cameras; cross-shot tracking may thus be achieved for the one or more target objects. The target object in the embodiment of the present application is a person, such as a doctor, a patient, a ticket vendor, or the like.

The system acquires single-frame or multi-frame images containing one or more human faces or human bodies through an image module; inputting a certain frame of image containing one or more human faces or human bodies into a layered vectorization model through a feature module, and acquiring a human face or human body feature vector of the certain frame of image; and identifying whether the certain frame image contains the human face or the human body of one or more target objects or not through an identification module according to the human face or the human body characteristic vector of the certain frame image. The system can identify whether a single-frame or multi-frame image contains the human face or the human body of one or more target objects, then judge the image acquisition equipment where the single-frame or multi-frame image comes from, and generate the motion information of one or more target objects according to the corresponding geographic position of the image acquisition equipment, thereby carrying out cross-border tracking on one or more target objects.

The embodiment of the present application further provides a target object identification device, including:

and identifying whether the certain frame image contains the human face or the human body of one or more target objects or not according to the human face or the human body feature vector of the certain frame image.

In this embodiment, the identification device of the target object executes the system or the method, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

An embodiment of the present application further provides an apparatus, which may include: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method of fig. 1. In practical applications, the device may be used as a terminal device, and may also be used as a server, where examples of the terminal device may include: smart phone, tablet computer, e-book reader, MP3 (Moving Picture Experts Group Audio Layer III) player, MP4 (Moving Picture Experts Group Audio Layer IV) player, laptop, mobile computer, car computer, desktop computer, set-top box, smart tv, wearable device, etc. in the embodiments of the present application, specific devices are not added

Embodiments of the present application also provide a non-transitory readable storage medium, where one or more modules (programs) are stored in the storage medium, and when the one or more modules are applied to a device, the device may execute instructions (instructions) included in the method in fig. 1 according to the embodiments of the present application.

Fig. 5 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103, and at least one communication bus 1104. The communication bus 1104 is used to implement communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may also include a non-volatile storage NVM, such as at least one disk memory, and the first memory 1103 may store various programs for performing various processing functions and implementing the method steps of the present embodiment.

Alternatively, the first processor 1101 may be, for example, a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Optionally, the input device 1100 may include a variety of input devices, such as at least one of a user-oriented user interface, a device-oriented device interface, a software programmable interface, a camera, and a sensor. Optionally, the device interface facing the device may be a wired interface for data transmission between devices, or may be a hardware plug-in interface (e.g., a USB interface, a serial port, etc.) for data transmission between devices; optionally, the user-facing user interface may be, for example, a user-facing control key, a voice input device for receiving voice input, and a touch sensing device (e.g., a touch screen with a touch sensing function, a touch pad, etc.) for receiving user touch input; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, such as an input pin interface or an input interface of a chip; the output devices 1102 may include output devices such as a display, audio, and the like.

In this embodiment, the processor of the terminal device includes a function for executing each module of the speech recognition apparatus in each device, and specific functions and technical effects may refer to the above embodiments, which are not described herein again.

Fig. 6 is a schematic hardware structure diagram of a terminal device according to an embodiment of the present application. FIG. 6 is a specific embodiment of the implementation of FIG. 5. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the method described in fig. 1 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, videos, and so forth. The second memory 1202 may include a Random Access Memory (RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: communication component 1203, power component 1204, multimedia component 1205, speech component 1206, input/output interfaces 1207, and/or sensor component 1208. The specific components included in the terminal device are set according to actual requirements, which is not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps of the data processing method described above. Further, the processing component 1200 can include one or more modules that facilitate interaction between the processing component 1200 and other components. For example, the processing component 1200 can include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. The power components 1204 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the terminal device.

The multimedia components 1205 include a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received speech signal may further be stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the speech component 1206 further comprises a speaker for outputting speech signals.

The input/output interface 1207 provides an interface between the processing component 1200 and peripheral interface modules, which may be click wheels, buttons, etc. These buttons may include, but are not limited to: a volume button, a start button, and a lock button.

The sensor component 1208 includes one or more sensors for providing various aspects of status assessment for the terminal device. For example, the sensor component 1208 may detect an open/closed state of the terminal device, relative positioning of the components, presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communications between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card therein, so that the terminal device may log onto a GPRS network to establish communication with the server via the internet.

As can be seen from the above, the communication component 1203, the voice component 1206, the input/output interface 1207 and the sensor component 1208 referred to in the embodiment of fig. 6 can be implemented as the input device in the embodiment of fig. 5.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Any person skilled in the art can modify or change the above-mentioned embodiments without departing from the spirit and scope of the present invention. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical spirit of the present invention be covered by the claims of the present invention.

Claims

1. A method for identifying a target object, comprising the steps of:

acquiring an image containing one or more human faces or human bodies;

2. The method according to claim 1, wherein the image comprises a single-frame or multi-frame image, the multi-frame image comprising one or more continuous frame images, a plurality of single-frame images;

3. The method for identifying a target object according to claim 1 or 2, wherein the certain frame of image containing one or more human faces or human bodies is input into the layered vectorization model;

4. The method according to claim 1 or 2, wherein if a certain frame of image is recognized to contain the human face or human body of the one or more target objects;

5. The method of claim 4, wherein the motion information comprises at least one of: time of movement, geographical location of movement.

6. The method for identifying a target object according to claim 1 or 2, wherein each layer in the layered vectorization model comprises one or more deep neural networks after training.

7. The method of claim 3, wherein the local features comprise at least one of: eye shape, nose shape, mouth shape, eye separation distance, position of five sense organs, face contour.

8. The method according to claim 7, characterized in that the shape of the eyes, the shape of the nose, and the shape of the mouth are taken as one layer; the separation distance of the eyes, the position of the five sense organs, the outline of the face as another layer.

9. The method according to any one of claims 1 to 3, wherein the face or body feature vectors are not affected by interference factors, the interference factors including at least one of: illumination, shading, angle, age, race.

10. The method according to claim 1 or 2, wherein the image containing one or more human faces or bodies is acquired by one or more image acquisition devices.

11. The method of claim 10, wherein the geographic location set by the one or more image capture devices comprises at least one of: residential areas, schools, stations, airports, markets and hospitals.

12. A system for identifying a target object, comprising:

13. The system of claim 12, wherein the image comprises one or more frames of images, the frames of images comprising one or more consecutive frames of images, a plurality of single frames of images.

14. The system for identifying a target object of claim 12 or 13, wherein the feature module is specifically configured to:

15. The system for identifying a target object according to claim 12 or 13, wherein if the identification module identifies that a frame of image contains the human face or the human body of the one or more target objects;

16. The system of claim 15, wherein the motion information comprises at least one of: time of movement, geographical location of movement.

17. A target object recognition system according to claim 12 or 13, wherein each layer of the layered vectorized model comprises one or more trained deep neural networks.

18. The target object identification system of claim 14, wherein the local features comprise at least one of: eye shape, nose shape, mouth shape, eye separation distance, position of five sense organs, face contour.

19. The system for identifying a target object according to claim 18, wherein the shape of the eyes, the shape of the nose, and the shape of the mouth are taken as one layer; the separation distance of the eyes, the position of the five sense organs, the outline of the face as another layer.

20. The system of any one of claims 12 to 14, wherein the face or body feature vectors are not affected by interference factors, the interference factors including at least one of: illumination, shading, angle, age, race.

21. A target object recognition system according to claim 12 or 13, wherein one or more successive frame images containing one or more human faces are acquired by one or more image acquisition devices.

22. The system of claim 21, wherein the geographic location set by the one or more image capture devices comprises at least one of: residential areas, schools, stations, airports, markets and hospitals.

23. An apparatus for identifying a target object, comprising:

acquiring an image containing one or more human faces or human bodies;

24. An apparatus, comprising:

one or more processors; and

one or more machine-readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the method recited by one or more of claims 1-11.

25. One or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform the method recited by one or more of claims 1-11.