WO2024001617A1 - 玩手机行为识别方法及装置 - Google Patents

玩手机行为识别方法及装置 Download PDF

Info

Publication number
WO2024001617A1
WO2024001617A1 PCT/CN2023/095778 CN2023095778W WO2024001617A1 WO 2024001617 A1 WO2024001617 A1 WO 2024001617A1 CN 2023095778 W CN2023095778 W CN 2023095778W WO 2024001617 A1 WO2024001617 A1 WO 2024001617A1
Authority
WO
WIPO (PCT)
Prior art keywords
mobile phone
target person
behavior
behavior recognition
person
Prior art date
Application number
PCT/CN2023/095778
Other languages
English (en)
French (fr)
Inventor
徐志红
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Publication of WO2024001617A1 publication Critical patent/WO2024001617A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/59Context or environment of the image inside of a vehicle, e.g. relating to seat occupancy, driver state or inner lighting conditions
    • G06V20/597Recognising the driver's state or behaviour, e.g. attention or drowsiness
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/62Analysis of geometric attributes of area, perimeter, diameter or volume
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/60Analysis of geometric attributes
    • G06T7/66Analysis of geometric attributes of image moments or centre of gravity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/46Descriptors for shape, contour or point-related descriptors, e.g. scale invariant feature transform [SIFT] or bags of words [BoW]; Salient regional features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20088Trinocular vision calculations; trifocal tensor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30248Vehicle exterior or interior
    • G06T2207/30268Vehicle interior

Definitions

  • the present disclosure relates to the field of computer vision technology, and in particular to a method and device for identifying mobile phone playing behavior.
  • a mobile phone playing behavior recognition method includes: obtaining an image to be recognized, and extracting an image of a region of interest containing a target person from the image to be recognized.
  • the image of the area of interest is input into the first behavior recognition model to obtain the first behavior recognition result of the target person.
  • the first behavior recognition result is used to indicate whether the target person has the behavior of playing with a mobile phone.
  • the image of the area of interest is input into the second behavior recognition model to obtain the second behavior recognition result of the target person.
  • the second behavior recognition result is used to indicate whether the target person has the behavior of playing with a mobile phone. Compare the first behavior recognition result and the second behavior recognition result. If the first behavior result is inconsistent with the second behavior result, perform behavior recognition processing on the target person based on the area of interest image to determine whether the target person has the behavior of playing with a mobile phone.
  • the above method further includes: if the first behavior recognition result is consistent with the second behavior recognition result, determining whether the target person has mobile phone playing behavior based on the first behavior recognition result or the second behavior recognition result.
  • the above-mentioned behavior recognition processing is performed on the target person based on the image of the area of interest to determine whether the target person has the behavior of playing with a mobile phone, including: inputting the image of the area of interest into the mobile phone detection model, and inputting the image of the area of interest into the person. Detection model; if the mobile phone is not detected from the image of the area of interest, it is determined that the target person does not play with the mobile phone; if If a mobile phone is detected in the image of the area of interest, it is determined whether the target person has mobile phone behavior based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model.
  • determining whether the target person has mobile phone playing behavior based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model includes: determining the mobile phone frame The degree of coincidence with the character frame; if the degree of coincidence is greater than or equal to the preset coincidence degree threshold, it is determined that the target character has mobile phone playing behavior; if the coincidence degree is less than the preset coincidence degree threshold, it is determined that the target character does not have mobile phone playing behavior.
  • the above-mentioned determination of the overlap between the mobile phone frame and the person frame includes: determining the area of the overlapping area of the mobile phone frame and the person frame in the area of interest image; The ratio between the areas occupied is used as the degree of coincidence.
  • the above method before determining the overlap between the mobile phone frame and the person frame, further includes: determining the distance between the target person and the mobile phone based on the mobile phone frame and the person frame; When the distance is greater than the preset distance threshold, it is determined that the target person does not play with the mobile phone; the above-mentioned determination of the overlap between the mobile phone frame and the character frame includes: when the distance between the target person and the mobile phone is less than or equal to the preset distance threshold , determine the overlap between the mobile phone frame and the character frame.
  • the above-mentioned method based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model determines whether the target person has mobile phone playing behavior, including: based on the target person Determine the distance between the target person and the mobile phone based on the person frame, mobile phone frame and area of interest image; determine the distance between the non-target person and the mobile phone based on the person frame, mobile phone frame and area of interest image of the non-target person; When the distance between the target character and the mobile phone is less than the distance between all non-target characters and the mobile phone, it is determined that the target character is playing with the mobile phone; when the distance between the target character and the mobile phone is greater than or equal to the distance between any non-target character and the mobile phone distance, it is determined that the target person does not play with mobile phones.
  • the above-mentioned method of determining the distance between the target person and the mobile phone based on the target person's character frame, the mobile phone frame, and the area of interest image includes: based on the target person's character frame and the area of interest image, Hand recognition determines the center position of the target person's hand; determines the center position of the phone based on the phone frame and the area of interest image; determines the distance between the target person and the phone based on the center position of the target person's hand and the center position of the phone. distance between.
  • the first behavior recognition model is an inception network model.
  • the second behavior recognition model is a residual network model.
  • a behavior recognition device in another aspect, includes: a communication unit for obtaining an image to be recognized; a processing unit for: extracting an image of a region of interest containing a target person from the image to be recognized; The image of the region of interest is input to the first behavior recognition model to obtain the first behavior recognition result of the target person.
  • the first behavior recognition result is used to indicate whether the target person has the behavior of playing with a mobile phone; the image of the region of interest is input to the second behavior recognition model.
  • the second behavior recognition result of the target person obtains the second behavior recognition result of the target person, and the second behavior recognition result is used to indicate whether the target person has the behavior of playing with a mobile phone; if the first behavior recognition result is inconsistent with the second behavior recognition result, based on the area of interest image, the target The character performs behavior recognition processing to determine whether the target character has the behavior of playing with mobile phones.
  • the above-mentioned processing unit is also used to determine whether the target person has mobile phone playing behavior based on the first behavior recognition result or the second behavior recognition result if the first behavior recognition result is consistent with the second behavior recognition result.
  • the above-mentioned processing unit is specifically used to: input the area of interest image into the mobile phone detection model, and input the area of interest image into the person detection model; if the mobile phone is not detected from the area of interest image, determine that the target person is not There is a behavior of playing with a mobile phone; if a mobile phone is detected from the image of the area of interest, it is determined whether the target person has a behavior of playing with a mobile phone based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model.
  • the above-mentioned processing unit is specifically used to: determine the overlap between the mobile phone frame and the person frame; if the overlap is greater than or equal to the preset overlap threshold, then It is determined that the target person has the behavior of playing on the mobile phone; if the coincidence degree is less than the preset coincidence degree threshold, it is determined that the target person does not have the behavior of playing with the mobile phone.
  • the above processing unit is specifically used to: determine the area of the overlapping area of the mobile phone frame and the character frame in the area of interest image; calculate the area of the overlapping area and the area occupied by the mobile phone frame in the area of interest. The ratio between them is taken as the degree of coincidence.
  • the above processing unit is also used to: determine the distance between the target person and the mobile phone based on the mobile phone frame and the person frame; when the distance between the target person and the mobile phone is greater than the preset distance threshold, determine the target person There is no behavior of playing with mobile phones; the above processing unit is specifically used to determine the overlap between the mobile phone frame and the character frame when the distance between the target character and the mobile phone is less than or equal to the preset distance threshold.
  • the above-mentioned processing unit is specifically used to: determine the person frame of the target person and the non-target person from the multiple person frames.
  • the non-target person is the person other than the target person in the image of the area of interest; based on the person frame of the target person, the mobile phone frame and the area of interest image, determine the distance between the target person and the mobile phone; based on the non-target person Determine the distance between the non-target person and the mobile phone using the target person's person frame, mobile phone frame and area of interest image; determine the target person when the distance between the target person and the mobile phone is less than the distance between all non-target people and the mobile phone There is mobile phone playing behavior; when the distance between the target person and the mobile phone is greater than or equal to the distance between any non-target person and the mobile phone, it is determined that the target person does not have mobile phone playing behavior.
  • the above-mentioned processing unit is specifically used to perform hand recognition on the target person based on the target person's character frame and the area of interest image, and determine the center position of the target person's hand; based on the mobile phone frame and the area of interest Image to determine the center position of the phone; determine the distance between the target person and the phone based on the center position of the target person's hand and the center position of the phone.
  • the first behavior recognition model is an inception network model
  • the second behavior recognition model is a residual network model
  • a behavior recognition device in another aspect, includes a memory and a processor; the memory is coupled to the processor; the memory is used to store computer program code, and the computer program code includes computer instructions.
  • the behavior recognition device executes the computer instructions, the behavior recognition device is caused to execute the mobile phone playing behavior recognition method as described in any of the above embodiments.
  • a non-transitory computer-readable storage medium stores computer program instructions.
  • the processor executes one or more of the mobile phone playing behavior identification methods described in any of the above embodiments. steps.
  • a computer program product includes computer program instructions.
  • the computer program instructions When the computer program instructions are executed on a computer, the computer program instructions cause the computer to perform one or more steps in the method for identifying mobile phone playing behavior as described in any of the above embodiments. .
  • a computer program When the computer program is executed on a computer, the computer program causes the computer to perform one or more steps in the mobile phone playing behavior identification method described in any of the above embodiments.
  • Figure 1 is a composition diagram of a mobile phone behavior recognition system according to some embodiments.
  • Figure 2 is a hardware structure diagram of a behavior recognition device according to some embodiments.
  • Figure 3 is a flowchart 1 of a method for identifying mobile phone playing behavior according to some embodiments
  • Figure 4 is an architectural diagram 1 of the inception structure according to some embodiments.
  • Figure 5 is an architectural diagram 2 of the inception structure according to some embodiments.
  • Figure 6 is an architecture diagram 1 of the resnet18 model according to some embodiments.
  • Figure 7 is an architectural diagram 2 of the resnet18 model according to some embodiments.
  • Figure 8 is a flowchart 2 of a mobile phone behavior recognition method according to some embodiments.
  • Figure 9 is a flowchart 3 of a mobile phone play behavior recognition method according to some embodiments.
  • Figure 10 is a flowchart 4 of a mobile phone behavior identification method according to some embodiments.
  • Figure 11 is a schematic diagram of a region of interest image according to some embodiments.
  • Figure 12 is a flow chart 5 of a method for identifying mobile phone playing behavior according to some embodiments.
  • Figure 13 is a flowchart 6 of a mobile phone behavior recognition method according to some embodiments.
  • Figure 14 is a flow chart of a mobile phone behavior recognition process according to some embodiments.
  • Figure 15 is a structural diagram of a behavior recognition device according to some embodiments.
  • first and second are used for descriptive purposes only and cannot be understood as indicating or implying relative importance or implicitly indicating the quantity of indicated technical features. Therefore, it is limited to “first”, A “second” feature may explicitly or implicitly include one or more of these features.
  • plurality means two or more.
  • At least one of A, B and C has the same meaning as “at least one of A, B or C” and includes the following combinations of A, B and C: A only, B only, C only, A and B The combination of A and C, the combination of B and C, and the combination of A, B and C.
  • a and/or B includes the following three combinations: A only, B only, and a combination of A and B.
  • the term “if” is optionally interpreted to mean “when” or “in response to” or “in response to determining” or “in response to detecting,” depending on the context.
  • the phrase “if it is determined" or “if [stated condition or event] is detected” is optionally interpreted to mean “when it is determined" or “in response to the determination" or “on detection of [stated condition or event]” or “in response to detection of [stated condition or event]”.
  • the adverse effects caused by people's behavior of playing with mobile phones can be that the behavior of vehicle drivers playing with mobile phones while driving the vehicle will increase the probability of car accidents.
  • the recognition of people's mobile phone behavior is to recognize the mobile phone behavior by inputting the image of people's environment as a whole into the behavior recognition model. Since the image of people's environment contains There is a lot of redundant data and only one identification is made through the behavior recognition model, which results in low accuracy in identifying the behavior of playing with mobile phones. It is impossible to identify in time whether people have the behavior of playing with mobile phones, and thus cannot do anything when people have the behavior of playing with mobile phones. to prompt reminders.
  • embodiments of the present disclosure provide a mobile phone playing behavior recognition method.
  • This method obtains the image to be recognized, extracts the image of the area of interest containing the target person from the image, and extracts the image of the area of interest containing the target person from the image.
  • the image of the person's area of interest is used to determine whether the target person has the behavior of playing with the mobile phone, instead of using the image to be identified that contains a large amount of redundant data to determine whether the target person has the behavior of playing with the mobile phone, which reduces the number of redundant data pairs in the image to be identified.
  • the interference of mobile phone behavior recognition improves the accuracy of mobile phone behavior recognition.
  • the mobile phone playing behavior recognition method first uses the first behavior recognition model and the second behavior recognition model to identify whether the target person has the behavior of playing with mobile phones. After the first behavior recognition result of the first behavior recognition model and the second behavior of the second behavior recognition model, When the behavior recognition results are consistent, the first behavior recognition result or the second behavior recognition result is used as the result of whether the target person has the behavior of playing with mobile phones. This enables dual recognition of whether the target person has the behavior of playing with mobile phones, which improves the performance of playing with mobile phones. Behavior recognition accuracy.
  • the behavior recognition of the target person is again performed based on the image of the area of interest containing the target person in the image to be recognized. Processing to determine whether the target person plays mobile phone behavior.
  • the mobile phone playing behavior recognition method provided by the embodiment of the present disclosure performs multiple mobile phone playing behavior recognition on the area of interest image containing the target person. The recognition result of the user's mobile phone playing behavior recognition is higher, which improves the accuracy of the mobile phone playing behavior recognition. The accuracy of behavior recognition can make it possible to issue timely reminders when the target person plays mobile phone behavior to avoid adverse effects caused by the target person playing mobile phone behavior.
  • the mobile phone playing behavior recognition method provided by the embodiments of the present disclosure can be applied to scenarios such as vehicle driving, guard booths, office areas, and classrooms.
  • the behavior recognition device can identify the vehicle terminal at this time.
  • the images and mobile phone behavior recognition results are uploaded to the background management server of the vehicle terminal for management personnel to view.
  • the behavior recognition device can control the vehicle terminal to issue an alarm message to remind the vehicle driver not to use mobile phone and pay attention to driving safety.
  • the behavior recognition device can identify the students in the classroom at this time.
  • the images and mobile phone behavior recognition results are uploaded to the teacher's terminal device for the teacher to view, so that the teacher can maintain the classroom teaching environment based on the mobile phone behavior recognition results displayed on the terminal device.
  • an embodiment of the present disclosure provides a composition of a mobile phone behavior recognition system.
  • the mobile phone playing behavior recognition system includes: a behavior recognition device 10 and a photographing device 20 .
  • the behavior recognition device 10 and the photographing device 20 may be connected in a wired or wireless manner.
  • the photographing device 20 may be installed near the supervision area.
  • the supervision area is a vehicle cab
  • the photographing device 20 can be installed on the top of the vehicle cab.
  • the embodiment of the present disclosure does not limit the specific installation method and specific installation location of the shooting device 20 .
  • the photographing device 20 may be used to photograph images of the supervision area to be identified.
  • the photographing device 20 may use a color camera to capture color images.
  • the color camera may be an RGB camera.
  • the RGB camera uses the RGB color mode to obtain a variety of images through changes in the three color channels of red (R), green (greed, G), and blue (blue, B) and their superposition. color.
  • RGB cameras use three different cables to provide three basic color components, and use three independent charge coupled device (CCD) sensors to acquire the three color signals.
  • CCD charge coupled device
  • the shooting device may use a depth camera to capture depth images.
  • the depth camera can be a time of flight (TOF) camera.
  • TOF camera uses TOF technology.
  • the imaging principle of the TOF camera is as follows: the laser light source emits modulated pulsed infrared light, which is reflected when it encounters an object.
  • the light source detector receives the light source reflected by the object, and calculates the time difference or phase between the emission and reflection of the light source. Difference, to convert the distance between the TOF camera and the photographed object, and then obtain the depth value of each point in the scene based on the distance between the TOF camera and the photographed object.
  • the behavior recognition device 10 is used to obtain the image to be recognized captured by the photography device 20 , and based on the image to be recognized captured by the photography device 20 , determine whether the person in the supervision area has the behavior of playing with a mobile phone.
  • the behavior recognition device 10 may be an independent server, a server cluster or a distributed system composed of multiple servers, or may provide cloud services, cloud databases, cloud computing, cloud storage, network services, Cloud servers for basic cloud computing services such as cloud communications, middleware services, domain name services, security services, content distribution networks, and big data service networks.
  • the behavior recognition device 10 may be a mobile phone, a tablet computer, a desktop, a laptop, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, and a cellular phone, Personal digital assistant (PDA), augmented reality (AR) ⁇ Virtual reality (VR) equipment, etc.
  • the behavior recognition device 10 may be a vehicle terminal.
  • Vehicle terminals are front-end devices used for vehicle communication and management and can be installed in various vehicles.
  • the behavior recognition device 10 can communicate with other terminal devices in a wired or wireless manner, such as communicating with a vehicle administrator's terminal device in a vehicle driving scenario, or with a teacher's terminal in a classroom scenario. devices communicate.
  • the behavior recognition device 10 determines the recognition result of the mobile phone behavior in the classroom based on the image to be recognized captured by the shooting device 20, the recognition result of the mobile phone behavior can be expressed in voice, text or video.
  • method is sent to the teacher’s terminal device for the teacher to view.
  • the behavior recognition device 10 may be integrated with the photographing device 20 .
  • Figure 2 is a hardware structure diagram of a behavior recognition device provided by an embodiment of the present disclosure.
  • the behavior recognition device may include a processor 41 , a memory 42 , a communication interface 43 , and a bus 44 .
  • the processor 41, the memory 42 and the communication interface 43 may be connected through a bus 44.
  • the processor 41 is the control center of the behavior recognition device, and may be a processor or a collective name for multiple processing elements.
  • the processor 41 may be a general-purpose CPU or other general-purpose processor.
  • the general processor may be a microprocessor or any conventional processor.
  • the processor 41 may include one or more CPUs, such as CPU 0 and CPU 1 shown in FIG. 2 .
  • Memory 42 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions.
  • ROM read-only memory
  • RAM random access memory
  • a dynamic storage device can also be an electrically erasable programmable read-only memory (EEPROM), a disk storage medium or other magnetic storage device, or can be used to carry or store instructions or data structures.
  • EEPROM electrically erasable programmable read-only memory
  • the memory 42 may exist independently of the processor 41, and the memory 42 may be connected to the processor 41 through the bus 44 for storing instructions or program codes.
  • the processor 41 calls and executes the instructions or program codes stored in the memory 42, it can implement the mobile phone playing behavior identification method provided by the following embodiments of the present disclosure.
  • the memory 42 can also be integrated with the processor 41 .
  • the communication interface 43 is used to connect the behavior recognition device with other devices through a communication network.
  • the communication network can be Ethernet, wireless access network (radio access network, RAN), wireless local area networks (WLAN), etc.
  • the communication interface 43 may include a receiving unit for receiving data, and a transmitting unit for transmitting data.
  • Bus 44 may be an Industry Standard Architecture (Industry Standard Architecture, ISA) bus, a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc.
  • ISA Industry Standard Architecture
  • PCI Peripheral Component Interconnect
  • EISA Extended Industry Standard Architecture
  • the bus can be divided into address bus, data bus, control bus, etc. For ease of presentation, only one thick line is used in Figure 2, but it does not mean that there is only one bus or one type of bus.
  • the structure shown in Figure 2 does not constitute a limitation on the behavior recognition device.
  • the behavior recognition device may include more or less components than shown in the figure, or Combining certain parts, or different arrangements of parts.
  • the embodiment of the present disclosure provides a mobile phone playing behavior recognition method, which method is applied to a behavior recognition device.
  • the behavior recognition device may be the behavior recognition device 10 in the above mobile phone playing behavior recognition system, or the processor of the behavior recognition device 10 .
  • the method includes the following steps:
  • the image to be recognized is an image obtained by photographing the supervision area by a photographing device.
  • the supervision area is an area where it is necessary to supervise whether users play with mobile phones. For example, vehicle cabs, classrooms, office areas and guard booths, etc.
  • the supervision area may be determined by a behavior recognition device. For example, if there are multiple photographing devices connected to the behavior recognition device, the behavior recognition device may regard the area where each of the multiple photographing devices is located as a supervision area.
  • the supervision area may be determined by the user in a direct or indirect manner.
  • a classroom scenario a school has M classrooms, and each classroom is equipped with a corresponding shooting device. If there are no students in N classrooms among the M classrooms, the user can choose to turn off the shooting of N classrooms. device, the behavior recognition device can select each of these M-N classrooms as a supervision area. In this way, the behavior recognition device does not need to identify mobile phone playing behavior in N classrooms to save computing resources.
  • M and N are both positive integers.
  • the image to be recognized is used to record the images of K people contained in the supervision area at the current moment.
  • K is a positive integer.
  • the behavior recognition device after turning on the mobile phone behavior recognition function, the behavior recognition device executes this Disclosed embodiments provide a mobile phone playing behavior identification method. Correspondingly, if the behavior recognition device turns off the mobile phone playing behavior recognition function, the behavior recognition device will not execute or stop executing the mobile phone playing behavior recognition method provided by the embodiment of the present disclosure.
  • the behavior recognition device turns on the mobile phone behavior recognition function by default.
  • the behavior recognition device periodically turns on the mobile phone behavior recognition function. For example, in a classroom scenario, the behavior recognition device automatically turns on the mobile phone behavior recognition function between 8:00 am and 17:30 pm, and automatically turns off the mobile phone behavior recognition function between 17:30 pm and 8:00 am.
  • the behavior recognition device determines to turn on/off the mobile phone behavior recognition function according to instructions from the terminal device.
  • the vehicle manager when used in vehicle driving scenarios, while the driver is driving the vehicle, the vehicle manager issues an instruction to the behavior recognition device through the terminal device to enable the mobile phone behavior recognition function. In response to the instruction, the behavior recognition device turns on the mobile phone behavior recognition function. Alternatively, after the driver stops driving the vehicle, the vehicle manager issues an instruction to turn off the mobile phone behavior recognition function to the behavior recognition device through the terminal device. In response to the instruction, the behavior recognition device turns off the mobile phone behavior recognition function.
  • the behavior recognition device when the preset conditions are met, the behavior recognition device obtains the image to be recognized of the supervision area through the shooting device.
  • the above preset conditions include: the shooting device detects the presence of a person in the vehicle cab.
  • the behavior recognition device only needs to recognize the mobile phone behavior when there are people in the vehicle cab, and does not need to recognize the mobile phone behavior when there are no people in the vehicle cab, which helps to reduce the calculation amount of the behavior recognition device. .
  • the behavior recognition device obtains the image to be recognized of the supervision area through the shooting device, which can be implemented as follows: the behavior recognition device sends a shooting instruction to the shooting device, and the shooting instruction is used to instruct the shooting device to shoot the image of the supervision area; and then , the behavior recognition device receives the image to be recognized from the supervision area of the shooting device.
  • the image to be recognized may be photographed by the photographing device before receiving the photographing instruction, or may be photographed by the photographing device after receiving the photographing instruction.
  • the image to be recognized can be processed for human body recognition, and then the target person is determined from the K characters in the image to be recognized.
  • the target character can be any one of the K characters, or a specific character among the K characters, and K is a positive integer.
  • the behavior recognition device can only detect specific people in the supervision area. It is not necessary to identify the mobile phone behavior of every character in the supervision area. For example, in a vehicle driving scenario, the behavior recognition device only needs to recognize the behavior of the vehicle driver playing with mobile phones, and does not need to recognize the behavior of other passengers in the vehicle playing with mobile phones, which can reduce the calculation amount of the behavior recognition device.
  • the behavior recognition device can perform identity recognition processing on the image to be recognized to identify the identity of each of the K characters included in the supervision area, and then identify the K characters.
  • the identity recognition result of the person is sent to the terminal device for viewing by the user of the terminal device. If the user of the terminal device selects a certain character among the K characters to identify the mobile phone playing behavior based on the identity recognition results of the K characters, the behavior recognition device determines that this character is the target character. If the user of the terminal device selects K characters to identify the mobile phone playing behavior based on the identity recognition results of the K characters, the behavior recognition device determines any one of the K characters as the target character.
  • the behavior recognition device performs identity recognition processing on the image to be recognized to identify the identity of each of the K characters included in the supervision area. This can be specifically implemented as follows: input the image to be recognized into the identity recognition model, and obtain each The identification result of a person.
  • a trained identity recognition model is pre-stored in the memory of the behavior recognition device. After acquiring the image to be recognized, the behavior recognition device can input the image to be recognized into the identity recognition model to obtain the K contained in the supervision area. Identification results for each of the individuals.
  • the above-mentioned identity recognition model can be a convolutional neural network model (convolutional neural networks, CNN), for example, it can be implemented using the model structure of VGG-16.
  • CNN convolutional neural network
  • the behavior recognition device can perform image segmentation processing on the image to be recognized, and then Extract the region of interest image containing the target person from the image to be recognized.
  • the target person is presented in the form of a detection frame in the image to be identified, and the detection frame of the target person in the image to be identified is enlarged and expanded to form an image of the area. as an image of a region of interest containing the target person.
  • extracting the region of interest image containing the target person from the image to be recognized can be specifically implemented as follows: input the image to be recognized into the image segmentation model to obtain the region of interest image corresponding to each person.
  • the trained image segmentation model is pre-stored in the memory of the behavior recognition device. After acquiring the image to be recognized, the behavior recognition device can input the image to be recognized into the training system. The trained image segmentation model is used to obtain the region of interest image corresponding to each of the K characters contained in the supervision area.
  • the above-mentioned image segmentation model may be a deep neural network (DNN) model.
  • DNN deep neural network
  • deep neural networks can automatically extract and learn more essential features in images from massive training data. Applying deep neural networks to image segmentation will significantly enhance the classification effect and further improve subsequent playback. Accuracy of mobile phone behavior recognition.
  • the above image segmentation model may be built based on the Deeplab v3+ semantic segmentation algorithm.
  • the above-mentioned region-of-interest image of the target person can be a region-of-interest image that has been repaired, so as to ensure that the result of subsequent identification of the target person's mobile phone playing behavior based on the region-of-interest image of the target person is accurate.
  • the trained first behavior recognition model is pre-stored in the memory of the behavior recognition device.
  • the image of the area of interest of the target person can be input into the first behavior recognition model to obtain the first behavior recognition result of the target person.
  • the first behavior recognition result is used to indicate whether the target person has the behavior of playing with mobile phones.
  • the behavior of playing with mobile phones includes the target person holding the mobile phone with his hand to send text messages and voice messages, placing the mobile phone on a table or other objects to send text messages and voice messages, and holding the mobile phone to his ear to make calls and listen to voice messages, etc.
  • the above-mentioned first behavior recognition model is an inception network model, which may be, for example, the inception-v3 model.
  • the inception-v3 model can include multiple inception structures.
  • the inception structure in the inception-v3 model combines different convolutional layers in a well-connected manner.
  • the first behavior recognition mode can adopt the inception structure in the related art (such as the inception structure shown in Figure 4), or can also use the improved inception structure provided by the embodiment of the present application (such as the inception structure shown in Figure 5 structure).
  • Figure 4 shows a schematic diagram of an inception structure.
  • the inception structure in related technologies includes an output layer, a fully connected layer, and four learning paths between the output layer and the fully connected layer.
  • the first learning path includes 1*1 convolution kernels connected in sequence, 3*3 convolution kernel and 3*3 convolution kernel.
  • the second learning path includes 1*1 convolution kernel and 3*3 convolution kernel connected in sequence.
  • the third learning path includes pool and 1*1 convolution kernel.
  • the fourth learning path includes 1*1 convolution kernel.
  • this application uses 1*7 convolution kernel and 7*1 convolution kernel to replace the originally used 3*3 convolution kernel.
  • this embodiment of the present disclosure provides a schematic diagram of an improved inception structure.
  • the improved inception structure includes an output layer, a fully connected layer, and 10 learning paths between the output layer and the fully connected layer.
  • the first learning path includes 1*7 convolution kernel, 7*7 convolution kernel and 1*7 convolution kernel.
  • the second learning path includes 7*1 convolution kernel, 7*7 convolution kernel and 7*1 convolution kernel connected in sequence.
  • the third learning path includes 1*1 convolution kernel and 1*7 convolution kernel connected in sequence.
  • the fourth learning path includes 1*1 convolution kernel and 7*1 convolution kernel connected in sequence.
  • the fifth learning path includes the sequentially connected Pool and 1*7 convolution kernel.
  • the sixth learning path includes the sequentially connected Pool and 7*1 convolution kernel.
  • the seventh learning path includes 1*7 convolution kernel.
  • the eighth learning path includes a 7*1 convolution kernel.
  • the ninth learning path includes the sequentially connected Pool and 1*7 convolution kernel.
  • the tenth learning path includes the sequentially connected Pool and 7*1 convolution kernel.
  • the first behavior recognition result is yes, it means that the behavior recognition result of the first behavior recognition model of the target person based on the target person's area of interest image is that the target person has the behavior of playing with a mobile phone; if the first behavior recognition result is If not, it means that the first behavior recognition model identifies the target person's behavior based on the target person's region of interest image and the result is that the target person does not have the behavior of playing with a mobile phone.
  • S104 Input the region of interest image to the second behavior recognition model to obtain the second behavior recognition result of the target person.
  • the trained second behavior recognition model is pre-stored in the memory of the behavior recognition device.
  • the image of the area of interest of the target person can be input into the second behavior recognition model to obtain the second behavior recognition result of the target person.
  • the second behavior recognition result is used to indicate whether the target user has mobile phone playing behavior.
  • the above-mentioned second behavior recognition model is a residual network model, which may be a resnet18 model, for example.
  • the resnet18 model is a serial network structure based on basicblock, which cleverly uses shortcut connections to solve the problem of model degradation in deep networks. It should be understood that the above-mentioned second behavior recognition model can adopt the resnet18 model in related technologies (such as the resnet18 model shown in Figure 6), or can adopt the improved resnet18 model provided by the embodiment of the present application (such as the resnet18 model shown in Figure 7 ).
  • Figure 6 shows the architecture diagram of a resnet18 model.
  • the resnet18 model in related technologies includes an output layer, a 7*7 convolution layer, a maximum pooling (Maxpool) layer, a 3*3 convolution layer, and a 3*3 convolution layer connected in sequence.
  • Product layer 3*3 convolution layer, 3*3 convolution layer, average pooling layer (average pooling, Argpool) and output layer.
  • the improved resnet18 model in order to speed up the training and convergence of the resnet18 model, adds at least one normalization (batch normalization, BN) layer.
  • the new BN layer can be located between two 3*3 convolutional layers.
  • the improved resnet18 model includes a sequentially connected output layer, 7*7 convolution layer, maximum pooling layer, 3*3 convolution layer, 3*3 convolution layer, BN layer, and 3*3 convolution layer.
  • the second behavior recognition result is yes, it means that the second behavior recognition model recognizes the target person's behavior based on the target person's region of interest image and the result is that the target person has the behavior of playing with a mobile phone; if the second behavior recognition result is If not, it means that the second behavior recognition model recognizes the behavior of the target character based on the region of interest image of the target character and the result is that the target character does not have the behavior of playing with a mobile phone.
  • step S103 may be executed first, and then step S104 may be executed; or step S104 may be executed first, and then step S103 may be executed; or step S103 and step S104 may be executed simultaneously.
  • the inception-v3 model introduces the method of splitting a larger two-dimensional convolution into two smaller one-dimensional convolutions.
  • a 7 ⁇ 7 convolution kernel can be split into a 1 ⁇ 7 convolution kernel and a 7 ⁇ 1 convolution kernel.
  • the 3x3 convolution kernel can also be split into a 1 ⁇ 3 convolution kernel and a 3 ⁇ l convolution kernel. This is called the idea of factorization into small convolutions.
  • This asymmetric convolution structure splitting is better than the symmetric convolution structure splitting in processing more and richer spatial features and increasing feature diversity, while reducing the amount of calculation. For example, replacing one 5 ⁇ 5 convolution with two 3 ⁇ 3 convolutions can reduce the computational effort by 28%.
  • the advantages of choosing the resnet18 model as the second behavior recognition model for mobile phone behavior recognition are: compared with the traditional VGG model, the complexity of the resnet18 model is reduced, the amount of parameters required is reduced, and the network depth is deeper, so there will be no The vanishing gradient phenomenon solves the deep-level network degradation problem, accelerates network convergence, and prevents overfitting.
  • the first behavior recognition model and the second behavior recognition model are two different recognition models. model, so the region-of-interest images for the same target user may have different behavior recognition results.
  • the first behavior recognition result indicates that the target person has the behavior of playing with mobile phones
  • the second behavior recognition result indicates that the target does not have the behavior of playing with mobile phones
  • the first behavior recognition result indicates that the target person does not have the behavior of playing with mobile phones
  • the second behavior recognition result indicates that the target person does not have the behavior of playing with mobile phones.
  • the target person has the behavior of playing with mobile phones.
  • the mobile phone playing behavior identification method provided by the embodiment of the present disclosure performs multiple behavior identifications on whether the user has mobile phone playing behavior, which improves the accuracy of mobile phone playing behavior identification. This is so that when it is recognized that the target person has the behavior of playing with mobile phones, a reminder message can be sent out in time to avoid the occurrence of adverse effects caused by the target person's behavior of playing with mobile phones.
  • step S104 the method further includes the following steps:
  • the first behavior recognition result is consistent with the second behavior recognition result, it means that the first behavior recognition model and the second behavior recognition model have consistent recognition results on whether the target person has the behavior of playing with a mobile phone. Since the first behavior recognition model and the second behavior recognition model are behavior recognition models based on different algorithms, and the behavior recognition models based on different algorithms output consistent recognition results, and the accuracy of the recognition results is high, then the first behavior recognition results can be used Or the second recognition result determines whether the target person has the behavior of playing with mobile phones.
  • the target person has the behavior of playing with mobile phones
  • the second behavior recognition result indicates that the target person has the behavior of playing with mobile phones
  • step S105 can be specifically implemented as the following steps:
  • the mobile phone needs to be present in the area where the target person is located, that is, the mobile phone must be present in the image of the target person's area of interest. If the target person is interested If there is no mobile phone in the area image, that is, there is no mobile phone in the area where the target person is, then the target person has no possibility of playing with the mobile phone.
  • the trained mobile phone detection model is pre-stored in the memory of the behavior recognition device.
  • the image of the area of interest can be input into the mobile phone detection model to detect whether there is a mobile phone in the image of the area of interest.
  • the mobile phone detection model outputs at least one mobile phone frame, it means that there is a mobile phone in the area of interest image, and the target person has the possibility of playing with the mobile phone. If the mobile phone detection model outputs 0 mobile phone frames, it means that there is no mobile phone in the image of the area of interest, and the target person has no possibility of playing with the mobile phone.
  • the area of interest image of the target person is the image of the area where the detection frame of the target person is located.
  • the area of interest image of the target person can not only contain the target person. Due to the shooting angle of the shooting device, the area of interest image of the target person It can also include other characters (also called non-target characters) and objects (such as walls, mobile phones, etc.) besides the target character. For example, when a non-target character is standing close to the target character, the target This non-target person can also be included in the person's region of interest image.
  • the target person's area of interest image contains non-target people other than the target person
  • the non-target people other than the target person in the area of interest image will have an impact on whether the target person has the behavior of playing with a mobile phone.
  • the recognition results cause interference.
  • the trained person detection model is pre-stored in the memory of the behavior recognition device.
  • the area of interest image can be input to the person detection model to detect whether there are non-target people in the area of interest image.
  • this person frame is also the person frame of the target person, which means that there is no non-identity in the area of interest image.
  • the target person that is, there is no non-target person in the area where the target person is.
  • the person detection model outputs at least one person frame, it means that there is a non-target person in the image of the area of interest, that is, there is a non-target person in the area where the target person is.
  • the above-mentioned mobile phone detection models include: yolov5 model and yolox model.
  • the above-mentioned pedestrian detection models include: yolov5 model, yolov4 model, yolov3 model, mobilenetv1_ssd model, mobilenetv2_ssd model and mobilenetv3_ssd model.
  • the area of interest image reflects the area where the target person is located. If the mobile phone is not detected from the area of interest image, it means that there is no mobile phone in the area where the target person is located to a certain extent. If there is no mobile phone in the area where the target person is located, there is no possibility that the target person will play with the mobile phone. Therefore, if the mobile phone is not detected from the image of the area of interest, it is determined that the target person does not play with the mobile phone.
  • step S1052 it directly determines whether the target person has a mobile phone behavior based on whether there is a mobile phone in the area of interest image.
  • the behavior recognition device no longer needs to perform cumbersome calculations, and can improve the accuracy of the target user's mobile phone behavior recognition. At the same time, the calculation amount of the behavior recognition device is reduced.
  • a mobile phone is detected from the area of interest image, determine whether the target person has mobile phone behavior based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model.
  • a mobile phone is detected from the image of the area of interest, it means that there is a mobile phone in the area where the target person is, that is, there is a possibility that the target person plays mobile phone behavior.
  • a mobile phone after a mobile phone is detected from the area of interest image, it can be determined whether the target person has mobile phone playing behavior based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model.
  • determining whether the target person has the behavior of playing with mobile phones may specifically include the following situations.
  • step S1053 can be specifically implemented as the following steps:
  • the degree of overlap between the mobile phone frame and the character frame is positively correlated with the possibility of the target character playing mobile phone behavior, that is, the higher the overlap, the higher the possibility that the target person will engage in mobile phone playing behavior.
  • the target person plays with a mobile phone
  • the mobile phone should be present around the target person.
  • the target person and the mobile phone both exist in the form of detection frames.
  • the overlap between the mobile phone frame and the person frame can reflect the distance between the target person and the mobile phone. Therefore, the overlap between the mobile phone frame and the person frame It is positively related to the possibility that the target person has mobile phone behavior.
  • the process of determining the overlap between the mobile phone frame and the character frame is as follows:
  • Step 1 Determine the area of the overlapping area between the mobile phone frame and the person frame in the area of interest image.
  • the shape and coordinates of the corresponding pixel area of the mobile phone frame in the area of interest image can be determined based on the upper, lower, left and right boundaries of the mobile phone frame.
  • the shape of the pixel area corresponding to the mobile phone frame in the area of interest image is a rectangle
  • the coordinates of the pixel area corresponding to the mobile phone frame in the area of interest image are (X a min, Y a min, X a max, Y a max ), where X a min is the minimum abscissa value of the mobile phone frame in the pixel area, Y a min is the minimum vertical coordinate value of the mobile phone frame in the pixel area, a max is the maximum value of the ordinate of the mobile phone frame in the pixel area. Then, the area occupied by the mobile phone frame in the image of the region of interest is obtained based on the coordinates of the corresponding pixel area of the mobile phone frame in the image of the region of interest.
  • the shape and coordinates of the corresponding pixel area of the character frame in the region of interest image can be determined based on the upper, lower, left and right boundaries of the character frame.
  • the shape of the pixel area corresponding to the person frame in the area of interest image is a rectangle
  • the coordinates of the pixel area corresponding to the person frame in the area of interest image are (X b min, Y b min, X b max, Y b max ), where X b min is the minimum value of the abscissa of the character frame in the pixel area, Y b min is the minimum value of the ordinate of the character frame in the pixel area, b max is the maximum value of the ordinate of the character frame in the pixel area.
  • the area occupied by the character frame in the region of interest image is obtained based on the coordinates of the corresponding pixel area of the character frame in the region of interest image.
  • the dotted line frame shown on the left side in Figure 11 is the mobile phone frame
  • the dotted line frame shown on the right side is the character frame.
  • the coordinates of the pixel area corresponding to the mobile phone frame in the area of interest image and the coordinates of the pixel area corresponding to the character frame in the area of interest image can be used to obtain the overlapping area between the mobile phone frame and the character frame in the region of interest image, and then the area of the overlapping region can be obtained.
  • A is used to represent the overlapping area
  • renwu is used to represent the coordinates of the character frame in the image of the area of interest
  • shouji is used to represent the coordinates of the mobile phone frame in the image of the area of interest.
  • Step 2 Use the ratio between the area of the overlapping area and the area occupied by the mobile phone frame in the area of interest as the degree of overlap.
  • B is used to represent the degree of coincidence
  • a sq is used to represent the area of the overlapping area
  • shouji sq is used to represent the area occupied by the mobile phone frame in the image of the area of interest.
  • the preset coincidence threshold may be preset by managers based on manual experience.
  • the preset coincidence threshold is 80%. That is, when the ratio of the area of the overlapping area between the mobile phone frame and the character frame to the area of the mobile phone frame is greater than or equal to 80%, it is determined that the target person has mobile phone playing behavior.
  • the embodiment of the present disclosure provides a method for identifying mobile phone playing behavior, which determines that the target person has mobile phone playing behavior based on the fact that the coincidence degree is greater than or equal to the preset coincidence degree threshold, thereby improving the accuracy of mobile phone playing behavior recognition.
  • the coincidence degree is less than the preset coincidence degree threshold, it means that the possibility of the target person playing mobile phone behavior is low, so it can be determined that the target person does not play mobile phone behavior.
  • the above-mentioned mobile phone behavior recognition method may also include step S301 before step S201, and step S201 may be specifically implemented as step S303.
  • steps S201 to S203 are descriptions in the case where there is an overlapping area between the mobile phone frame and the character frame by default. It is understandable that if the target character does not play with mobile phones, there will be no overlapping area between the mobile phone frame and the character frame. If there is no overlapping area between the mobile phone frame and the person frame, continuing to calculate the overlap between the mobile phone frame and the person frame will increase the calculation amount of the behavior recognition device and cause a waste of computing resources of the behavior recognition device.
  • the behavior recognition device can obtain the corresponding coordinates of the pixel area of the mobile phone frame in the area of interest image and obtain the corresponding center position of the mobile phone frame in the area of interest image.
  • the coordinates of the pixel area Referred to as the coordinates of the center position of the mobile phone frame.
  • the coordinates of the pixel area corresponding to the character frame in the region of interest image are obtained. Referred to as the coordinates of the center position of the character frame.
  • the distance between the center position of the mobile phone frame and the center position of the character frame can be obtained.
  • the distance between the center position of the mobile phone frame and the center position of the character frame is taken as the distance between the target character and the mobile phone.
  • the distance between the mobile phone and the target person is greater than the preset distance threshold, it means that there is no intersection between the mobile phone frame and the person frame, that is, there is no overlapping area between the mobile phone frame and the person frame.
  • there is no overlapping area between the mobile phone frame and the character frame it means that the mobile phone is far away from the target person, and the target person is less likely to play with the mobile phone. It can be directly determined that the target person does not play with the mobile phone, and no calculation is needed.
  • the overlap between the mobile phone frame and the character frame not only reduces the calculation amount of the behavior recognition device, but also reduces the waste of computing resources of the behavior recognition device.
  • the distance threshold is used to indicate the distance threshold value when the mobile phone frame and the person frame do not intersect.
  • the preset distance threshold may be calculated in real time by the behavior recognition device based on the resolution of the region of interest image.
  • the embodiment of the present disclosure provides a method for determining a preset distance threshold.
  • the behavior recognition device determines the distance from the center position of the mobile phone frame to any of the upper left corner, upper right corner, lower left corner, and lower right corner of the mobile phone frame, and The distance from the center of the character frame to any of the upper left corner, upper right corner, lower left corner, and lower right corner of the character frame is the sum of the two distances as the preset distance threshold.
  • the preset distance threshold may be preset by managers based on manual experience.
  • the above step S201 can be specifically implemented as: when the distance between the target person and the mobile phone is less than or equal to the preset distance threshold, determine the coincidence degree between the mobile phone frame and the person frame.
  • the distance between the target person and the mobile phone is less than or equal to the preset distance threshold, it means that there is an intersection between the mobile phone frame and the person frame, that is, there is an overlapping area between the mobile phone frame and the person frame.
  • there is a mobile phone in the area where the target character represented by the representative character frame is, that is, the target character has the possibility of playing with the mobile phone It can be further determined based on the relationship between the character frame and the mobile phone frame. The degree of coincidence is used to determine whether the target person plays mobile phone behavior.
  • step S201 Regarding the specific implementation of determining the overlap between the mobile phone frame and the character frame, reference may be made to the description of step S201 above, which will not be described again here.
  • the mobile phone behavior recognition method provided by the embodiments of the present disclosure also includes the following situations:
  • the person detection model inputs multiple person frames.
  • step S1053 can also be specifically implemented as the following steps:
  • S401 Determine the character frame of the target person and the character frame of the non-target person from multiple character frames.
  • the behavior recognition device performs image segmentation on the image to be recognized based on the image segmentation model.
  • the behavior recognition device establishes each person corresponding to each person.
  • An identity identifier an identity identifier is used to uniquely indicate a person.
  • the behavior recognition device can determine the person frame of the target person and the non-target person from the multiple person frames based on the identity of the person corresponding to each person frame in the multiple person frames.
  • the character's character frame can be determined from the target person and the non-target person from the multiple person frames based on the identity of the person corresponding to each person frame in the multiple person frames.
  • S402. Determine the distance between the target person and the mobile phone based on the target person's person frame, mobile phone frame, and area of interest image.
  • determining the distance between the target person and the mobile phone based on the target person's person frame, mobile phone frame, and area of interest image may include one or more of the following methods:
  • the behavior recognition device determines the distance between the target person and the mobile phone based on the center position of the target person and the center position of the mobile phone.
  • the behavior recognition device can determine the shape and coordinates of the corresponding pixel area of the target person's frame in the region of interest image based on the upper, lower, left, and right boundaries of the target person's frame.
  • the shape of the pixel area corresponding to the target person's character frame in the area of interest image is a rectangle.
  • the behavior recognition device can determine the shape and coordinates of the corresponding pixel area of the mobile phone frame in the area of interest image based on the upper, lower, left and right boundaries of the mobile phone frame.
  • the shape of the pixel area corresponding to the mobile phone frame in the area of interest image is a rectangle.
  • the behavior recognition device After obtaining the coordinates of the pixel area corresponding to the target person's character frame in the area of interest image, the behavior recognition device can obtain the coordinates of the pixel area corresponding to the center position of the target person in the area of interest image.
  • the behavior recognition device can also obtain the coordinates of the pixel area corresponding to the center position of the mobile phone in the area of interest image.
  • the distance between the center position of the mobile phone and the center position of the target person can be obtained . Then the center position of the mobile phone and the target person The distance from the center of the object is used as the distance between the target person and the mobile phone.
  • Method 2 The behavior recognition device determines the distance between the target person and the mobile phone based on the center position of the target person's hand and the center position of the mobile phone.
  • the distance between the center position of the target person and the center position of the mobile phone is used as the distance between the target person and the mobile phone. It can be understood that under normal circumstances, if the target person has the behavior of playing with the mobile phone, the target person plays with the mobile phone with his hands. In order to improve the accuracy of identifying the behavior of playing with the mobile phone, the embodiment of the present disclosure proposes to use the center position of the target person's hand and the The distance from the center of the phone is used as the distance between the target person and the phone.
  • the above method 2 may include the following steps:
  • the trained hand recognition model is pre-stored in the memory of the service area.
  • the service area can input the image of the area of interest containing the character frame of the target person into the hand recognition model to obtain the hand of the target person. frame.
  • the shape and coordinates of the corresponding pixel area of the target person's hand frame in the region of interest image can be determined based on the upper, lower, left, and right boundaries of the target person's hand frame. Among them, the shape of the pixel area corresponding to the target person in the region of interest image is a rectangle. Furthermore, based on the coordinates of the pixel area corresponding to the target person's hand frame in the region of interest image, the center position of the target person's hand can be obtained.
  • the above-mentioned hand recognition model may be a hand recognition model based on the Faster R-CNN algorithm.
  • the hand of the target person can be obtained based on the coordinates of the pixel area corresponding to the center position of the target person's hand in the area of interest image, and the coordinates of the pixel area corresponding to the center position of the mobile phone in the area of interest image.
  • Method 3 The behavior recognition device determines the distance between the target person and the mobile phone based on the center position of the target person's eyes and the center position of the mobile phone.
  • the embodiment of the present disclosure uses the distance between the center position of the target person's eyes and the center position of the mobile phone as the target. The distance between the character and the phone.
  • the above method 3 may include the following steps:
  • the eye recognition model is pre-stored in the memory of the behavior recognition device.
  • the image of the region of interest containing the character frame of the target person can be input into the eye recognition model to obtain the eye frame of the target person.
  • the method of obtaining the center position of the target person's eyes based on the target person's eye frame may refer to the method of obtaining the center position of the target person's hand based on the target person's hand frame in S1 above, which will not be described again here.
  • the above-mentioned eye recognition model may be an eye recognition model based on a scale-invariant feature transform (SIFT) algorithm.
  • SIFT scale-invariant feature transform
  • P2. Determine the center position of the mobile phone based on the mobile phone frame and the area of interest image.
  • S403. Determine the distance between the non-target person and the mobile phone based on the non-target person's person frame, mobile phone frame and area of interest image.
  • step S403 For the description of step S403, reference may be made to the above description of step S402, which will not be described again here.
  • the behavior recognition device when there are multiple non-target persons in the region of interest image of the target person, the behavior recognition device performs the above calculation on each non-target person in the multiple non-target persons to obtain each non-target person. The distance between the target person and the phone.
  • the behavior recognition device uses method 1 in S402 above to determine the distance between the target person and the mobile phone, the behavior recognition device also uses method 1 in S402 above. to determine the distance between the non-target person and the phone. Similarly, if the behavior recognition device uses Method 2 in S402 to determine the distance between the target person and the mobile phone, the behavior recognition device also uses Method 2 in S402 to determine the distance between the non-target person and the mobile phone.
  • step S402 may be executed first, and then step S403 may be executed; or step S403 may be executed first, and then step S403 may be executed. S402; or, perform step S402 and step S403 at the same time.
  • the target character and the mobile phone are smaller than the distance between all non-target characters and the mobile phone, it means that the target character is closest to the mobile phone, that is, the target character is the most likely person among multiple characters to play with mobile phones. , so it is determined that the target person has the behavior of playing with mobile phones.
  • the behavior recognition device determines that the target person does not play with mobile phones.
  • the person detection model outputs multiple person frames, not only the target person but also non-target people exist in the area where the target person is located.
  • the distance between each character and the mobile phone when the distance between the target character and the mobile phone is the shortest, it is determined that the target character has the behavior of playing with mobile phones. This eliminates the influence of non-target persons on the identification of whether the target person has mobile phone playing behavior, and improves the accuracy of identification of mobile phone playing behavior.
  • the image shown in Figure 14 is an image to be recognized, and the image to be recognized includes Person 1 and Person 2.
  • the image to be recognized is subjected to image segmentation processing to obtain the area of interest image of person 1 and the area of interest image of person 2.
  • the image of the area of interest of Person 2 can be input to the person detection model and the mobile phone detection model to detect the characters existing in the area where Person 2 is and the person 2 Whether there are mobile phones in the area.
  • the person detection model only outputs one person frame
  • the mobile phone detection model outputs a mobile phone frame, which means that there is only one person in the area where person 2 is located, and there is a mobile phone in the area where person 2 is located. Then it can be determined whether character 2 is playing with a mobile phone based on the overlap between the mobile phone frame output by the mobile phone detection model and the character frame output by the person detection model.
  • the preset overlap threshold is 80%. If the overlap between the mobile phone frame and the character frame is 85%, it is determined that character 2 has mobile phone playing behavior.
  • the behavior recognition device outputs the final recognition result, that is, character 1 has the behavior of playing with mobile phones, and character 2 has the behavior of playing with mobile phones.
  • the behavior recognition device 300 may include: a communication unit 301 and a processing unit 302 .
  • the above-mentioned behavior recognition device 300 may also include a storage unit 303.
  • the above-mentioned communication unit 301 is used to obtain the image to be recognized.
  • the above-mentioned processing unit 302 is used to: extract the region of interest image containing the target person from the image to be recognized; input the region of interest image into the first behavior recognition model to obtain the first behavior recognition result of the target person, the first behavior
  • the recognition result is used to indicate whether the target person has the behavior of playing with mobile phones
  • the image of the area of interest is input to the second behavior recognition model to obtain the second behavior recognition result of the target person.
  • the second behavior recognition result is used to indicate whether the target person has the behavior of playing with mobile phones. Behavior; if the first behavior recognition result is inconsistent with the second behavior recognition result, then based on the image of the area of interest, conduct behavior recognition processing on the target person to determine whether the target person has the behavior of playing with a mobile phone.
  • the above-mentioned processing unit 302 is also used to determine whether the target person has a mobile phone behavior based on the first behavior recognition result or the second behavior recognition result if the first behavior recognition result is consistent with the second behavior recognition result. .
  • the above-mentioned processing unit 302 is specifically used to: input the region of interest image into the mobile phone detection model, and input the region of interest image into the person detection model; if the mobile phone is not detected from the region of interest image, determine the target person There is no behavior of playing with mobile phones; if a mobile phone is detected from the area of interest image, based on the mobile phone frame output by the mobile phone detection model and the person frame output by the person detection model, Determine whether the target person plays mobile phone behavior.
  • the above-mentioned processing unit 302 is specifically used to: determine the overlap between the mobile phone frame and the person frame; if the overlap is greater than or equal to the preset overlap threshold, Then it is determined that the target person has the behavior of playing with mobile phones; if the coincidence degree is less than the preset overlap threshold, it is determined that the target person does not have the behavior of playing with mobile phones.
  • the above-mentioned processing unit 302 is specifically used to determine the area of the overlapping area between the mobile phone frame and the person frame in the area of interest image.
  • the ratio between the area of the overlapping area and the area occupied by the mobile phone frame in the area of interest is taken as the overlap degree.
  • the above processing unit is also used to: determine the distance between the target person and the mobile phone based on the mobile phone frame and the person frame; when the distance between the target person and the mobile phone is greater than the preset distance threshold, determine the target person There is no behavior of playing with mobile phones; the above processing unit is specifically used to determine the overlap between the mobile phone frame and the character frame when the distance between the target character and the mobile phone is less than or equal to the preset distance threshold.
  • the above-mentioned processing unit 302 is specifically used to: determine the person frame of the target person, the person frame of the non-target person, and the non-target person from the multiple person frames. For other people in the area of interest image except the target person; based on the person frame, mobile phone frame and area of interest image of the target person, determine the distance between the target person and the mobile phone; based on the person frame, mobile phone frame of non-target people and the area of interest image to determine the distance between the non-target person and the mobile phone; when the distance between the target person and the mobile phone is smaller than the distance between all non-target people and the mobile phone, it is determined that the target person has mobile phone behavior; when the target person When the distance between the target person and the mobile phone is greater than or equal to the distance between any non-target person and the mobile phone, it is determined that the target person does not play with the mobile phone.
  • the above-mentioned processing unit 302 is specifically used to: perform hand recognition on the target person based on the target person's character frame and the area of interest image, and determine the center position of the target person's hand; Use the area of interest image to determine the center position of the phone; determine the distance between the target person and the phone based on the center position of the target person's hand and the center position of the phone.
  • the above-mentioned storage unit 303 is used to store images to be recognized.
  • the above-mentioned storage unit 303 is used to store the first behavior recognition model, the second behavior recognition model, the person detection model, the mobile phone detection model, the hand recognition model, the identity recognition model and the image segmentation model.
  • the units in Figure 15 may also be called modules, for example, the processing unit may be called a processing module.
  • each unit in Figure 15 is implemented in the form of a software function module and sold as an independent product, When sold or used, it can be stored in a computer-readable storage medium.
  • the computer software product is stored in a storage device.
  • the medium includes several instructions to cause a computer device (which may be a personal computer, a behavior recognition device, or a network device, etc.) or a processor to execute all or part of the steps of the methods of various embodiments of the present application.
  • Storage media for storing computer software products include: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk, etc. that can store program code medium.
  • Some embodiments of the present disclosure provide a computer-readable storage medium (eg, a non-transitory computer-readable storage medium) having computer program instructions stored therein, and the computer program instructions are stored on a processor of a computer.
  • the processor When running, the processor is caused to execute the mobile phone playing behavior recognition method as described in any of the above embodiments.
  • the above-mentioned computer-readable storage media may include, but are not limited to: magnetic storage devices (such as hard disks, floppy disks or tapes, etc.), optical disks (such as CD (Compact Disk, compressed disk), DVD (Digital Versatile Disk, etc.) Digital versatile disk), etc.), smart cards and flash memory devices (e.g., EPROM (Erasable Programmable Read-Only Memory, Erasable Programmable Read-Only Memory), cards, sticks or key drives, etc.).
  • the various computer-readable storage media described in this disclosure may represent one or more devices and/or other machine-readable storage media for storing information.
  • the term "machine-readable storage medium" may include, but is not limited to, wireless channels and various other media capable of storing, containing and/or carrying instructions and/or data.
  • Some embodiments of the present disclosure also provide a computer program product, for example, the computer program product is stored on a non-transitory computer-readable storage medium.
  • the computer program product includes computer program instructions. When the computer program instructions are executed on the computer, the computer program instructions cause the computer to execute the mobile phone playing behavior identification method as described in the above embodiment.
  • Some embodiments of the present disclosure also provide a computer program.
  • the computer program When the computer program is executed on the computer, the computer program causes the computer to perform the mobile phone playing behavior recognition method as described in the above embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Human Computer Interaction (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Ophthalmology & Optometry (AREA)
  • Social Psychology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Psychiatry (AREA)
  • Image Analysis (AREA)
  • Maintenance And Inspection Apparatuses For Elevators (AREA)
  • Alarm Systems (AREA)
  • Telephone Function (AREA)

Abstract

一种玩手机行为识别方法及装置,该方法包括:获取待识别图像,从待识别图像中提取出包含目标人物的感兴趣区域图像。将感兴趣区域图像输入至第一行为识别模型,得到目标人物的第一行为识别结果,第一行为识别结果用于指示目标人物是否存在玩手机行为。将感兴趣区域图像输入至第二行为识别模型,得到目标人物的第二行为识别结果,第二行为识别结果用于指示目标人物是否存在玩手机行为。比较第一行为识别结果和第二行为识别结果,若第一行为结果与第二行为结果不一致,则基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为。

Description

玩手机行为识别方法及装置
本申请要求于2022年6月30日提交的、申请号为202210764212.1的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及计算机视觉技术领域,尤其涉及一种玩手机行为识别方法及装置。
背景技术
随着手机的普及,手机已经成为人们日常生活中不可缺少的一部分,手机给人们的生活带来便捷的同时,人们对于手机的依赖程度愈发严重。在某些场景下玩手机,易给人们的生活带来一定的影响。例如在驾驶场景下,若驾驶员在驾驶途中玩手机,可能导致车祸概率增加。因此,在某些场景下需要准确识别出人们是否存在玩手机行为以进行实时预警。然而相关技术中对于玩手机行为识别的准确度较低。
发明内容
一方面,提供一种玩手机行为识别方法,该方法包括:获取待识别图像,从待识别图像中提取出包含目标人物的感兴趣区域图像。将感兴趣区域图像输入至第一行为识别模型,得到目标人物的第一行为识别结果,第一行为识别结果用于指示目标人物是否存在玩手机行为。将感兴趣区域图像输入至第二行为识别模型,得到目标人物的第二行为识别结果,第二行为识别结果用于指示目标人物是否存在玩手机行为。比较第一行为识别结果和第二行为识别结果,若第一行为结果与第二行为结果不一致,则基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为。
在一些实施例中,上述方法还包括:若第一行为识别结果与第二行为识别结果一致,则基于第一行为识别结果或者第二行为识别结果,确定目标人物是否存在玩手机行为。
另一些实施例中,上述基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为,包括:将感兴趣区域图像输入手机检测模型,以及将感兴趣区域图像输入人物检测模型;若未从感兴趣区域图像检测到手机,确定目标人物不存在玩手机行为;若从 感兴趣区域图像检测到手机,则根据手机检测模型输出的手机框,以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为。
另一些实施例中,在人物检测模型仅输出一个人物框时,上述根据手机检测模型输出的手机框,以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为,包括:确定手机框与人物框之间的重合度;若重合度大于或等于预设重合度阈值,则确定目标人物存在玩手机行为;若重合度小于预设重合度阈值,则确定目标人物不存在玩手机行为。
另一些实施例中,上述确定手机框与人物框之间的重合度,包括:确定手机框与人物框在感兴趣区域图像中重合区域的面积;以重合区域的面积与手机框在感兴趣区域所占的区域的面积之间的比值,作为重合度。
另一些实施例中,在上述确定手机框与人物框之间的重合度之前,上述方法还包括:基于手机框和人物框,确定目标人物与手机之间的距离;在目标人物与手机之间的距离大于预设距离阈值时,确定目标人物不存在玩手机行为;上述确定手机框与人物框之间的重合度,包括:在目标人物与手机之间的距离小于或等于预设距离阈值时,确定手机框与人物框之间的重合度。
另一些实施例中,在人物检测模型输出多个人物框时,上述根据手机检测模型输出的手机框,以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为,包括:基于目标人物的人物框、手机框以及感兴趣区域图像,确定目标人物与手机之间的距离;基于非目标人物的人物框、手机框以及感兴趣区域图像,确定非目标人物与手机之间的距离;在目标人物与手机之间的距离小于所有非目标人物与手机之间的距离时,确定目标人物存在玩手机行为;在目标人物与手机之间的距离大于或等于任意一个非目标人物与手机之间的距离时,确定目标人物不存在玩手机行为。
另一些实施例中,上述基于目标人物的人物框、手机框以及感兴趣区域图像,确定目标人物与手机之间的距离,包括:基于目标人物的人物框和感兴趣区域图像,对目标人物进行手部识别,确定目标人物的手部的中心位置;基于手机框和感兴趣区域图像,确定手机的中心位置;根据目标人物的手部的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
另一些实施例中,上述第一行为识别模型为inception网络模型,上 述第二行为识别模型为残差网络模型。
又一方面,提供一种行为识别装置,该行为识别装置包括:通信单元,用于获取待识别图像;处理单元,用于:从待识别图像中提取出包含目标人物的感兴趣区域图像;将感兴趣区域图像输入至第一行为识别模型,得到目标人物的第一行为识别结果,第一行为识别结果用于指示目标人物是否存在玩手机行为;将感兴趣区域图像输入至第二行为识别模型,得到目标人物的第二行为识别结果,第二行为识别结果用于指示目标人物是否存在玩手机行为;若第一行为识别结果与第二行为识别结果不一致,则基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为。
在一些实施例中,上述处理单元,还用于若第一行为识别结果与第二行为识别结果一致,则基于第一行为识别结果或者第二行为识别结果,确定目标人物是否存在玩手机行为。
另一些实施例中,上述处理单元,具体用于:将感兴趣区域图像输入手机检测模型,以及将感兴趣区域图像输入人物检测模型;若未从感兴趣区域图像检测到手机,确定目标人物不存在玩手机行为;若从感兴趣区域图像检测到手机,则根据手机检测模型输出的手机框,以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为。
另一些实施例中,在人物检测模型仅输出一个人物框时,上述处理单元,具体用于:确定手机框与人物框之间的重合度;若重合度大于或等于预设重合度阈值,则确定目标人物存在玩手机行为;若重合度小于预设重合度阈值,则确定目标人物不存在玩手机行为。
另一些实施例中,上述处理单元,具体用于:确定手机框与人物框在感兴趣区域图像中重合区域的面积;以重合区域的面积与手机框在感兴趣区域所占的区域的面积之间的比值,作为重合度。
另一些实施例中,上述处理单元,还用于:基于手机框和人物框,确定目标人物与手机之间的距离;在目标人物与手机之间的距离大于预设距离阈值时,确定目标人物不存在玩手机行为;上述处理单元,具体用于在目标人物与手机之间的距离小于或等于预设距离阈值时,确定手机框与人物框之间的重合度。
另一些实施例中,在人物检测模型输出多个人物框时,上述处理单元,具体用于:从多个人物框中确定目标人物的人物框,以及非目标人 物的人物框,非目标人物为感兴趣区域图像中除目标人物之外的其他人物;基于目标人物的人物框、手机框以及感兴趣区域图像,确定目标人物与手机之间的距离;基于非目标人物的人物框、手机框以及感兴趣区域图像,确定非目标人物与手机之间的距离;在目标人物与手机之间的距离小于所有非目标人物与手机之间的距离时,确定目标人物存在玩手机行为;在目标人物与手机之间的距离大于或等于任意一个非目标人物与手机之间的距离时,确定目标人物不存在玩手机行为。
另一些实施例中,上述处理单元,具体用于基于目标人物的人物框和感兴趣区域图像,对目标人物进行手部识别,确定目标人物的手部的中心位置;基于手机框和感兴趣区域图像,确定手机的中心位置;根据目标人物的手部的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
另一些实施例中,上述第一行为识别模型为inception网络模型,上述第二行为识别模型为残差网络模型。
再一方面,提供一行为识别装置,该行为识别装置包括存储器和处理器;存储器和处理器耦合;存储器用于存储计算机程序代码,计算机程序代码包括计算机指令。其中,当处理器执行计算机指令时,使得该行为识别装置执行如上述任一实施例中所述的玩手机行为识别方法。
又一方面,提供一种非瞬态的计算机可读存储介质。所述计算机可读存储介质存储有计算机程序指令,所述计算机程序指令在处理器上运行时,使得所述处理器执行如上述任一实施例所述的玩手机行为识别方法中的一个或多个步骤。
又一方面,提供一种计算机程序产品。所述计算机程序产品包括计算机程序指令,在计算机上执行所述计算机程序指令时,所述计算机程序指令使计算机执行如上述任一实施例所述的玩手机行为识别方法中的一个或多个步骤。
又一方面,提供一种计算机程序。当所述计算机程序在计算机上执行时,所述计算机程序使计算机执行如上述任一实施例所述的玩手机行为识别方法中的一个或多个步骤。
附图说明
为了更清楚地说明本公开中的技术方案,下面将对本公开一些实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本公开的一些实施例的附图,对于本领域普通技术人员来讲,还可以根据这些附图获得其他的附图。此外,以下描述中的附图可以视作示意图,并非对本公开实施例所涉及的产品的实际尺寸、方法的实际流 程、信号的实际时序等的限制。
图1为根据一些实施例的一种玩手机行为识别***的组成图;
图2为根据一些实施例的一种行为识别装置的硬件结构图;
图3为根据一些实施例的一种玩手机行为识别方法的流程图一;
图4为根据一些实施例的inception结构的架构图一;
图5为根据一些实施例的inception结构的架构图二;
图6为根据一些实施例的resnet18模型的架构图一;
图7为根据一些实施例的resnet18模型的架构图二;
图8为根据一些实施例的一种玩手机行为识别方法的流程图二;
图9为根据一些实施例的一种玩手机行为识别方法的流程图三;
图10为根据一些实施例的一种玩手机行为识别方法的流程图四;
图11为根据一些实施例的一种感兴趣区域图像的示意图;
图12为根据一些实施例的一种玩手机行为识别方法的流程图五;
图13为根据一些实施例的一种玩手机行为识别方法的流程图六;
图14为根据一些实施例的一种玩手机行为识别过程的流程图;
图15为根据一些实施例的一种行为识别装置的结构图。
具体实施方式
下面将结合附图,对本公开一些实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。基于本公开所提供的实施例,本领域普通技术人员所获得的所有其他实施例,都属于本公开保护的范围。
除非上下文另有要求,否则,在整个说明书和权利要求书中,术语“包括(comprise)”及其其他形式例如第三人称单数形式“包括(comprises)”和现在分词形式“包括(comprising)”被解释为开放、包含的意思,即为“包含,但不限于”。在说明书的描述中,术语“一个实施例(one embodiment)”、“一些实施例(some embodiments)”、“示例性实施例(exemplary embodiments)”、“示例(example)”、“特定示例(specific example)”或“一些示例(some examples)”等旨在表明与该实施例或示例相关的特定特征、结构、材料或特性包括在本公开的至少一个实施例或示例中。上述术语的示意性表示不一定是指同一实施例或示例。此外,所述的特定特征、结构、材料或特点可以以任何适当方式包括在任何一个或多个实施例或示例中。
以下,术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、 “第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本公开实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。
“A、B和C中的至少一个”与“A、B或C中的至少一个”具有相同含义,均包括以下A、B和C的组合:仅A,仅B,仅C,A和B的组合,A和C的组合,B和C的组合,及A、B和C的组合。
“A和/或B”,包括以下三种组合:仅A,仅B,及A和B的组合。
如本文中所使用,根据上下文,术语“如果”任选地被解释为意思是“当……时”或“在……时”或“响应于确定”或“响应于检测到”。类似地,根据上下文,短语“如果确定……”或“如果检测到[所陈述的条件或事件]”任选地被解释为是指“在确定……时”或“响应于确定……”或“在检测到[所陈述的条件或事件]时”或“响应于检测到[所陈述的条件或事件]”。
本文中“适用于”或“被配置为”的使用意味着开放和包容性的语言,其不排除适用于或被配置为执行额外任务或步骤的设备。
另外,“基于”的使用意味着开放和包容性,因为“基于”一个或多个所述条件或值的过程、步骤、计算或其他动作在实践中可以基于额外条件或超出所述的值。
如本文所使用的那样,“约”、“大致”或“近似”包括所阐述的值以及处于特定值的可接受偏差范围内的平均值,其中所述可接受偏差范围如由本领域普通技术人员考虑到正在讨论的测量以及与特定量的测量相关的误差(即,测量***的局限性)所确定。
随着手机智能化程度的提高,在衣食住行等方面人们对于手机的依赖程度越来越高。为了避免由于人们在某些场景下玩手机而引起的不良影响,需要对人们进行玩手机行为识别,以及时提醒人们在某些场景下勿玩手机,避免不良影响的情况发生。以车辆驾驶场景为例,人们存在玩手机行为而引起的不良影响可以是车辆驾驶员在驾驶车辆过程中存在玩手机行为会增加车祸发生的概率。
而相关技术提供的玩手机行为识别方法中针对人们玩手机行为的识别,是通过将人们所处环境的图像作为整体输入至行为识别模型中进行玩手机行为识别,由于人们所处环境的图像包含的冗余数据较多且仅通过行为识别模型进行了一次识别,导致玩手机行为识别的准确度较低,无法及时的识别出人们是否存在玩手机行为,进而无法在人们存在玩手机行为时做到及时提醒。
基于此,本公开实施例提供了一种玩手机行为识别方法,该方法通过获取待识别图像,从图像中提取出包含目标人物的感兴趣区域图像,根据包含目标 人物的感兴趣区域图像来确定目标人物是否存玩手机行为,而并非是以包含大量冗余数据的待识别图像来确定目标人物是否存在玩手机行为,减少了待识别图像中冗余数据对与玩手机行为识别的干扰,提升了玩手机行为识别的准确度。
另外,相对于相关技术中提供的玩手机行为识别方法中通过行为识别模型进行一次玩手机行为识别所造成的玩手机行为识别的准确度较低的问题,本公开实施例提供的一种玩手机行为识别方法,首先通过第一行为识别模型和第二行为识别模型分别对目标人物是否存在玩手机行为进行识别,在第一行为识别模型的第一行为识别结果和第二行为识别模型的第二行为识别结果一致的情况下,以第一行为识别结果或第二行为识别结果作为目标人物是否存在玩手机行为的结果,由此对目标人物是否存在玩手机行为进行了双重识别,提升了玩手机行为识别的准确度。且在第一行为识别模型的第一行为识别结果和第二行为识别模型的第二行为识别结果不一致的情况下,再次根据待识别图像中包含目标人物的感兴趣区域图像对目标人物进行行为识别处理,以此来确定目标人物是否存在玩手机行为。可见,本公开实施例提供的玩手机行为识别方法,对包含目标人物的感兴趣区域图像进行了多次玩手机行为识别,用户玩手机行为识别的识别结果的准确度更高,提升了玩手机行为识别的准确度,进而以便于在目标人物存在玩手机行为时,能够及时发出提醒,避免由于目标人物存在玩手机行为而引起不良影响的情况发生。
本公开实施例提供的玩手机行为识别方法可以应用于车辆驾驶、岗亭站岗、办公区域和教室等场景。
以玩手机行为识别方法应用于车辆驾驶场景为例,在行为识别装置基于本公开实施例提供的玩手机行为识别方法,确定车辆驾驶员存在玩手机行为之后,行为识别装置可以将此时车辆终端内的图像以及玩手机行为识别结果上传至车辆终端的后台管理服务器,供管理人员查看。进一步的,在行为识别装置确定车辆驾驶员存在玩手机行为一段时间后,行为识别装置可以控制车辆终端发出报警信息,以提示车辆驾驶员禁止玩手机、注意驾驶安全。
以玩手机行为识别方法应用于教室场景为例,在行为识别装置基于本公开实施例提供的玩手机行为识别方法,确定教室内有学生存在玩手机行为之后,行为识别装置可以将此时教室内的图像以及玩手机行为识别结果上传至老师的终端设备,以供老师查看,以便于老师根据终端设备显示的玩手机行为识别结果维护教室教学环境。
如图1所示,本公开实施例提供了一种玩手机行为识别***的组成 图。该玩手机行为识别***包括:行为识别装置10和拍摄装置20。其中,行为识别装置10和拍摄装置20之间可以通过有线或者无线的方式进行连接。
拍摄装置20可以设置于监督区域附近。例如,以监督区域为车辆驾驶室为例,拍摄装置20可以安装与该车辆驾驶室内的顶部。本公开实施例不限制拍摄装置20的具体安装方式以及具体安装位置。
拍摄装置20可用于拍摄监督区域的待识别图像。
在一些实施例中,拍摄装置20可以采用彩色摄像头来拍摄彩色图像。
示例性的,彩色摄像头可以为RGB摄像头。其中,RGB摄像头采用RGB色彩模式,通过红(red,R)、绿(greed,G)、蓝(blue,B)三个颜色通道的变化以及它们相互之间的叠加来得到各式各样的颜色。通常,RGB摄像头由三根不同的线缆给出了三个基本彩色成分,用三个独立的电荷耦合器件(charge coupled device,CCD)传感器来获取三种彩色信号。
在一些实施例中,拍摄装置可以采用深度摄像头来拍摄深度图像。
示例性的,深度摄像头可以为飞行时间(time of flight,TOF)摄像头。TOF摄像头采用TOF技术,TOF摄像头的成像原理如下:根据激光光源发出经调制的脉冲红外光,遇到物体后反射,光源探测器接收经物体反射的光源,通过计算光源发射和反射的时间差或相位差,来换算TOF摄像头与被拍摄物体之间的距离,进而根据TOF摄像头与被拍摄物体之间的距离,得到场景中各个点的深度值。
行为识别装置10用于获取拍摄装置20所拍摄到的待识别图像,并基于拍摄装置20所拍摄到的待识别图像,确定监督区域的人物是否存在玩手机行为。
在一些实施例中,行为识别装置10可以是独立的服务器,也可以是多个服务器构成的服务器集群或者分布式***,还可以是提供云服务、云数据库、云计算、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、内容分发网络、大数据服务网等基础云计算服务的云服务器。
在一些实施例中,行为识别装置10可以是手机、平板电脑、桌面型、膝上型、手持计算机、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本,以及蜂窝电话、个人数字助理(personal digital assistant,PDA)、增强现实(augmented reality,AR) \虚拟现实(virtual reality,VR)设备等。或者,行为识别装置10可以是车辆终端。车辆终端是用于车辆通信和管理的前端设备,可以安装在各种车辆内。
在一些实施例中,行为识别装置10可以通过有线或无线的方式与其他终端设备进行通信,例如在车辆驾驶场景下与车辆管理员的终端设备进行通信,又例如在教室场景下与老师的终端设备进行通信。
示例性的,基于教室场景下,在行为识别装置10基于拍摄装置20所拍摄到的待识别图像确定教室的玩手机行为识别的结果后,可以将玩手机行为识别的结果以语音、文字或视频的方式发送至老师的终端设备,以供老师查看。
在一些实施例中,行为识别装置10可以和拍摄装置20集成在一起。
图2为本公开实施例所提供的一种行为识别装置的硬件结构图。参见图2,行为识别装置可以包括处理器41、存储器42、通信接口43、总线44。处理器41,存储器42以及通信接口43之间可以通过总线44连接。
处理器41是行为识别装置的控制中心,可以是一个处理器,也可以是多个处理元件的统称。例如,处理器41可以是一个通用CPU,也可以是其他通用处理器等。其中,通用处理器可以是微处理器或者是任何常规的处理器等。
作为一种实施例,处理器41可以包括一个或多个CPU,例如图2中所示的CPU 0和CPU 1。
存储器42可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。
一种可能的实现方式中,存储器42可以独立于处理器41存在,存储器42可以通过总线44与处理器41相连接,用于存储指令或者程序代码。处理器41调用并执行存储器42中存储的指令或程序代码时,能够实现本公开下述实施例提供的玩手机行为识别方法。
另一种可能的实现方式中,存储器42也可以和处理器41集成在一起。
通信接口43,用于行为识别装置与其他设备通过通信网络连接,所述通信网络可以是以太网,无线接入网(radio access network,RAN),无线局域网(wireless local area networks,WLAN)等。通信接口43可以包括用于接收数据的接收单元,以及用于发送数据的发送单元。
总线44,可以是工业标准体系结构(Industry Standard Architecture,ISA)总线、外部设备互连(Peripheral Component Interconnect,PCI)总线或扩展工业标准体系结构(Extended Industry Standard Architecture,EISA)总线等。该总线可以分为地址总线、数据总线、控制总线等。为便于表示,图2中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
需要指出的是,图2中示出的结构并不构成对该行为识别装置的限定,除图2所示部件之外,该行为识别装置可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合说明书附图,对本公开提供的实施例进行具体介绍。
本公开实施例提供的玩手机行为识别方法,该方法应用于行为识别装置,该行为识别装置可以是上述玩手机行为识别***中的行为识别装置10,或者行为识别装置10的处理器。如图3所示,该方法包括如下步骤:
S101、获取待识别图像。
其中,待识别图像为拍摄装置对监督区域进行拍摄而得到的图像。监督区域为需要监督用户是否存在玩手机行为的区域。例如车辆驾驶室、教室、办公区域和岗亭等。
在一些实施例中,监督区域可以由行为识别装置来确定。例如,与行为识别装置连接的有多个拍摄装置,行为识别装置可以将多个拍摄装置中每个拍摄装置所在区域认为是监督区域。
在一些实施例中,监督区域可以由用户以直接或间接的方式来确定。例如,在应用于教室场景下,一个学校具有M个教室,每个教室均安装有对应的拍摄装置,M个教室中N个教室未存在学生的情况下,用户可以选择关闭N个教室的拍摄装置,则行为识别装置可以选择这M-N个教室中每一个教室作为监督区域。这样,行为识别装置可以不用对N个教室进行玩手机行为识别,以节省计算资源。其中,M和N均为正整数。
待识别图像用于记录在当前时刻下监督区域中包含的K个人物的图像。其中,K为正整数。
在一些实施例中,行为识别装置在开启玩手机行为识别功能之后,执行本 公开实施例提供的玩手机行为识别方法。相应的,若行为识别装置关闭玩手机行为识别功能之后,则行为识别装置不执行或停止执行本公开实施例提供的玩手机行为识别方法。
一种可选的实现方式中,行为识别装置默认开启玩手机行为识别功能。
另一种可选的实现方式中,行为识别装置周期性开启玩手机行为识别功能。例如在教室场景下,行为识别装置在早上8:00-下午17:30之间自动开启玩手机行为识别功能,在下午17:30-早上8:00之间自动关闭玩手机行为识别功能。
另一种可选的实现方式中,行为识别装置根据终端设备的指令,确定开启/关闭玩手机行为识别功能。
例如,应用在车辆驾驶场景下,在驾驶人员驾驶车辆过程中,车辆管理人员通过终端设备向行为识别装置下发开启玩手机行为识别功能的指令。响应于该指令,行为识别装置开启玩手机行为识别功能。或者,在驾驶人员驾驶车辆停止后,车辆管理人员通过终端设备向行为识别装置下发关闭玩手机行为识别功能的指令。响应于该指令,行为识别装置关闭玩手机行为识别功能。
在一些实施例中,在满足预设条件的情况下,行为识别装置通过拍摄装置获取监督区域的待识别图像。
可选的,在应用于车辆驾驶场景下,上述预设条件包括:拍摄装置检测到车辆驾驶室存在人物。这样,行为识别装置仅需要在车辆驾驶室存在人物的情况下进行玩手机行为识别,而无需再车辆驾驶室不存在人物的情况下进行玩手机行为识别,有助于减少行为识别装置的计算量。
在一些实施例中,行为识别装置通过拍摄装置获取监督区域的待识别图像,可以具体实现为:行为识别装置向拍摄装置发送拍摄指令,该拍摄指令用于指示拍摄装置拍摄监督区域的图像;之后,行为识别装置接收到来自拍摄装置的监督区域的待识别图像。
可选的,待识别图像可以是拍摄装置在接收到拍摄指令之前拍摄的,也可以是拍摄装置在接收到拍摄指令之后拍摄的。
S102、从待识别图像中提取出包含目标人物的感兴趣区域图像。
在一些实施例中,在行为识别装置接收到拍摄装置发送的待识别图像之后,可以对待识别图像进行人体识别处理,进而从待识别图像中的K个人物中确定出目标人物。其中,目标人物可以是K个人物中的任一个人物,也可以K个人物中的特定的人物,K为正整数。
可以理解的,在一些场景下,行为识别装置可以只需对监督区域中的特定 的人物进行玩手机行为识别,无需对监督区域中的每一个人物进行玩手机行为识别。例如在车辆驾驶场景下,行为识别装置只需对车辆驾驶员进行玩手机行为识别即可,无需对车辆中的其他乘客进行玩手机行为识别,能够降低行为识别装置的计算量。
在一些实施例中,在行为识别装置接收到待识别图像之后,行为识别装置可以对待识别图像进行身份识别处理,以识别出监督区域包含的K个人物中每一个人物的身份,进而将K个人物的身份识别结果发送至终端设备,以供终端设备的用户查看。若终端设备的用户根据K个人物的身份识别结果,选定对K个人物中的某个人物进行玩手机行为识别,则行为识别装置确定此人物即为目标人物。若终端设备的用户根据K个人物的身份识别结果,选定对K个人物进行玩手机行为识别,则行为识别装置确定K个人物中的任一个人物为目标人物。
可选的,行为识别装置对待识别图像进行身份识别处理,以识别出监督区域包含的K个人物中每一个人物的身份,可以具体实现为:将待识别图像输入至身份识别模型中,得到每一个人物的身份识别结果。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的身份识别模型,行为识别装置在获取待识别图像后,可以将待识别图像输入至身份识别模型,以得到监督区域包含的K个人物中每一个人物的身份识别结果。
在一些实施例中,上述身份识别模型可以是卷积神经网络模型(convolutional neural networks,CNN),例如可以采用VGG-16的模型结构来实现。
在一些实施例中,在行为识别装置确定了目标人物之后,为了去除待识别图像中冗余信息对于后续玩手机行为识别的准确度的影响,行为识别装置可以对待识别图像进行图像分割处理,进而从待识别图像中提取出包含目标人物的感兴趣区域图像。
可以理解的,在对待识别图像进行图像分割后,目标人物在待识别图像中是以检测框的形式呈现,将目标人物在待识别图像中的检测框等比例放大、扩展后形成的区域的图像作为包含目标人物的感兴趣区域图像。
可选的,从待识别图像中提取出包含目标人物的感兴趣区域图像,可以具体实现为:将待识别图像输入至图像分割模型中,得到每一个人物对应的感兴趣区域图像。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的图像分割模型,行为识别装置在获取到待识别图像后,可以将待识别图像输入至训 练完成的图像分割模型中,以得到监督区域包含的K个人物中每一个人物对应的感兴趣区域图像。
在一些实施例中,上述图像分割模型可以是深度神经网络(deep neural network,DNN)模型。
容易理解的是,深层次的神经网络可以在海量的训练数据中自动提取和学习图像中更本质的特征,将深度神经网络应用于图像分割中,将显著增强分类效果,并进一步提升后续对于玩手机行为识别的准确度。
在一些实施例中,上述图像分割模型可以是基于于Deeplab v3+语义分割算法来构建。
可选的,上述目标人物的感兴趣区域图像可以是经过修复处理后的感兴趣区域图像,以保证后续根据目标人物的感兴趣区域图像对目标人物进行玩手机行为识别的结果是准确的。
S103、将感兴趣区域图像输入至第一行为识别模型,得到目标人物的第一行为识别结果。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的第一行为识别模型。为了识别目标人物是否存在玩手机行为,在得到目标人物的感兴趣区域图像后,可以将目标人物的感兴趣区域图像输入至第一行为识别模型中,得到目标人物的第一行为识别结果。其中,第一行为识别结果用于指示目标人物是否存在玩手机行为。
其中,玩手机行为包括目标人物用手拿着手机发短信、发语音,以及将手机放在桌子等物体上发短信、发语音,以及把手机靠在耳边打电话、听语音等。
可选的,上述第一行为识别模型为inception网络模型,例如可以是inception-v3模型。其中,inception-v3模型可以包括多个inception结构。inception-v3模型中的inception结构是将不同的卷积层通过井联的方式结合在一起。应理解的是,第一行为识别模式可以采用相关技术中的inception结构(例如图4所示的inception结构),也可以采用本申请实施例提供的改进的inception结构(例如图5所示的inception结构)。
图4示出一种inception结构的示意图。如图4所示,相关技术中inception结构包括输出层、全连接层以及位于输出层以及全连接层之间的4条学习路径,第一条学习路径包括依次连接的1*1卷积核、3*3卷积核以及3*3卷积核。第二条学习路径包括依次连接的1*1卷积核和3*3卷积核。第三条学习路径包括pool和1*1卷积核。第四条学习路径包括1*1卷积核。
在一些实施例中,为了加快inception-v3模型的训练和收敛速度,本申请 实施例提供的改进的inception结构采用1*7卷积核以及7*1卷积核来替换原先采用的3*3卷积核。
示例性的,如图5所示,本公开实施例提供了一种改进的inception结构的示意图。改进的inception结构包括输出层、全连接层以及位于输出层以及全连接层之间的10条学习路径,第一条学习路径包括依次连接的1*7卷积核、7*7卷积核和1*7卷积核。第二条学习路径包括依次连接的7*1卷积核、7*7卷积核和7*1卷积核。第三条学习路径包括依次连接的1*1卷积核和1*7卷积核。第四条学习路径包括依次连接的1*1卷积核和7*1卷积核。第五条学习路径包括依次连接的Pool和1*7卷积核。第六条学习路径包括依次连接的Pool和7*1卷积核。第七条学习路径包括1*7卷积核。第八条学习路径包括7*1卷积核。第九条学习路径包括依次连接的Pool和1*7卷积核。第十条学习路径包括依次连接的Pool和7*1卷积核。
示例性的,若第一行为识别结果为是,则代表第一行为识别模型基于目标人物的感兴趣区域图像对目标人物的行为识别结果为目标人物存在玩手机行为;若第一行为识别结果为否,则代表第一行为识别模型基于目标人物的感兴趣区域图像对目标人物的行为识别结果为目标人物不存在玩手机行为。
S104、将感兴趣区域图像输入至第二行为识别模型,得到目标人物的第二行为识别结果。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的第二行为识别模型。为了识别目标人物是否存在玩手机行为,在得到目标人物的感兴趣区域图像后,可以将目标人物的感兴趣区域图像输入至第二行为识别模型中,得到目标人物的第二行为识别结果。其中,第二行为识别结果用于指示目标用户是否存在玩手机行为。
可选的,上述第二行为识别模型为残差网络模型,例如可以是resnet18模型。resnet18模型是一种基于basicblock的串行网络结构,巧妙地利用了shortcut连接,解决了深度网络中模型退化的问题。应理解,上述第二行为识别模型可以采用相关技术中的resnet18模型(例如图6所示的resnet18模型),或者可以采用本申请实施例提供的改进的resnet18模型(例如图7所示的resnet18模型)。
图6示出一种resnet18模型的架构图。如图6所示,相关技术中的resnet18模型包括依次连接的输出层、7*7卷积层、最大池化(maximum pooling,Maxpool)层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷 积层、3*3卷积层、3*3卷积层、平均池化层(average pooling,Argpool)和输出层。
在一些实施例中,为了加快resnet18模型的训练和收敛速度,本申请实施例提供的改进的resnet18模型增加了至少一个归一化(batch normalization,BN)层。可选的,新增的BN层可以位于两个3*3卷积层之间。
如图7所示,本公开实施例提供了一种改进的resnet18模型的架构图。参见图7,改进的resnet18模型包括依次连接的输出层、7*7卷积层、最大池化层、3*3卷积层、3*3卷积层、BN层、3*3卷积层、3*3卷积层、BN层、3*3卷积层、3*3卷积层、3*3卷积层、BN层、3*3卷积层、3*3卷积层、BN层、3*3卷积层、3*3卷积层、3*3卷积层、3*3卷积层、BN层、平均池化层和输出层。
示例性的,若第二行为识别结果为是,则代表第二行为识别模型基于目标人物的感兴趣区域图像对目标人物的行为识别结果为目标人物存在玩手机行为;若第二行为识别结果为否,则代表第二行为识别模型基于目标人物的感兴趣区域图像对目标人物的行为识别结果为目标人物不存在玩手机行为。
需要说明的是,本公开实施例不限制步骤S103和步骤S104之间的执行顺序。例如,可以先执行步骤S103,再执行步骤S104;或者,先执行步骤S104,再执行步骤S103;又或者,同时执行步骤S103和步骤S104。
应理解,选择inception-v3模型作为第一行为识别模型进行玩手机行为识别的优点在于:inception-v3模型引入了将一个较大的二维卷积拆成两个较小的一维卷积的做法。例如,7×7卷积核可以拆成1×7卷积核和7×l卷积核。当然3x3卷积核也可以拆成1×3卷积核和3×l卷积核,这被称为factorizationinto small convolutions思想。这种非对称的卷积结构拆分在处理更多、更丰富的空间特征以及增加特征多样性等方面的效果能够比对称的卷积结构拆分更好,同时能减少计算量。例如,2个3×3卷积代替1个5×5卷积能够减少28%的计算量。
同样的,选择resnet18模型作为第二行为识别模型进行玩手机行为识别的优点在于:相对于传统的VGG模型,resnet18模型的复杂度降低,所需的参数量下降,且网络深度更深,不会出现梯度消失现象,解决了深层次的网络退化问题,能够加速网络收敛,防止过度拟合。
S105、若第一行为识别结果与第二行为识别结果不一致,则基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为。
可以理解的,第一行为识别模型与第二行为识别模型为两种不同的识别 模型,故针对同一个目标用户的感兴趣区域图像可能有不同的行为识别结果。例如,第一行为识别结果指示目标人物存在玩手机行为,第二行为识别结果指示目标不存在玩手机行为;或者,第一行为识别结果指示目标人物不存在玩手机行为,第二行为识别结果指示目标人物存在玩手机行为。
基于图3所示的实施例,至少带来以下有益效果:基于包含目标人物的感兴趣区域图像,通过第一行为识别模型和第二行为识别模型对目标人物是否存在玩手机行为进行双重识别,提升了玩手机行为识别的准确度。且在第一行为识别模型输出的第一行为识别结果与第二行为识别模型的第二行为识别结果不一致的情况下,再次基于包含目标人物的感兴趣区域图像对目标人物进行行为识别处理,以此来确定目标人物是否存在玩手机行为。可见,本公开实施例提供的一种玩手机行为识别方法,对用户是否存在玩手机行为进行了多次行为识别,提升了玩手机行为识别的准确度。以便于在识别出目标人物存在玩手机行为时,及时发出提醒信息,避免目标人物由于存在玩手机行为而引起的不良影响的发生。
在一些实施例中,如图8所示,在步骤S104之后,该方法还包括如下步骤:
S106、若第一行为识别结果与第二行为识别结果一致,则基于第一行为识别结果或第二行为识别结果,确定目标人物是否存在玩手机行为。
可以理解的,若第一行为识别结果与第二行为识别结果一致,代表第一行为识别模型和第二行为识别模型对于目标人物是否存在玩手机行为存在一致的识别结果。由于第一行为识别模型和第二行为识别模型为基于不同算法的行为识别模型,基于不同算法的行为识别模型输出了一致的识别结果,识别结果的准确度高,则可以基于第一行为识别结果或第二识别结果确定目标人物是否存在玩手机行为。
示例性的,若第一行为识别结果指示目标人物存在玩手机行为,第二行为识别结果指示目标人物存在玩手机行为,则确定目标人物存在玩手机行为。若第一行为识别结果指示目标人物不存在玩手机行为,第二行为识别结果指示目标人物不存在玩手机行为,则确定目标人物不存在玩手机行为。
在一些实施例中,如图9所示,上述步骤S105可以具体实现为以下步骤:
S1051、将感兴趣区域图像输入手机检测模型,以及将感兴趣区域图像输入人物检测模型。
可以理解的,若目标人物存在玩手机行为,需要目标人物所在区域需要存在手机,也就是目标人物的感兴趣区域图像中存在手机。若目标人物的感兴趣 区域图像中不存在手机,也就是目标人物所在区域不存在手机,那么目标人物也就不具有玩手机的可能性。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的手机检测模型。为了识别出目标人物是否具有玩手机的可能性,可以将感兴趣区域图像输入手机检测模型,来检测感兴趣区域图像中是否存在手机。
具体的,将目标人物的感兴趣区域图像输入至手机检测模型后,若手机检测模型输出了至少一个手机框,则代表感兴趣区域图像中存在手机,目标人物具有玩手机的可能性。若手机检测模型输出了0个手机框,则代表感兴趣区域图像中不存在手机,目标人物不具有玩手机的可能性。
由上述可知,目标人物的感兴趣区域图像是目标人物的检测框所在区域的图像,目标人物的感兴趣区域图像不仅可以包含目标人物,由于拍摄装置拍摄角度的原因,目标人物的感兴趣区域图像中还可以包含除目标人物之外的其他人物(也可以称作非目标人物)和物品(例如墙体、手机等),例如在一个非目标人物与目标人物站位较为接近的情况下,目标人物的感兴趣区域图像中还可以包括此非目标人物。
可以理解的,在目标人物的感兴趣区域图像包含除目标人物之外的非目标人物的情况下,感兴趣区域图像中除目标人物之外的非目标人物会对目标人物是否存在玩手机行为的识别结果造成干扰。
在一些实施例中,行为识别装置的存储器中预先存储有训练完成的人物检测模型。为了识别感兴趣区域图像中是否存在除目标人物之外的非目标人物,可以将感兴趣区域图像输入至人物检测模型,来检测感兴趣区域图像中是否存在非目标人物。
具体的,将目标人物的感兴趣区域图像输入至人物检测模型后,若人物检测模型仅输出了一个人物框,则此人物框也就是目标人物的人物框,代表感兴趣区域图像中未存在非目标人物,也就是目标人物所在区域未存在非目标人物。若人物检测模型输出了至少一个人物框,则代表感兴趣区域图像中存在非目标人物,也就是目标人物所在区域存在非目标人物。
在一些实施例中,上述手机检测模型包括:yolov5模型、yolox模型。
在一些实施例中,上述行人检测模型包括:yolov5模型、yolov4模型、yolov3模型、mobilenetv1_ssd模型、mobilenetv2_ssd模型和mobilenetv3_ssd模型。
S1052、若未从感兴趣区域图像检测到手机,确定目标人物不存在玩手机行为。
可以理解的,感兴趣区域图像反映的是目标人物所在区域,若未从感兴趣区域图像中检测到手机,代表一定程度上目标人物所在区域不存在手机。若目标人物所在区域不存在手机,则目标人物也就不存在玩手机行为的可能性。故若未从感兴趣区域图像中检测到手机,则确定目标人物不存在玩手机行为。
应理解,步骤S1052的优点在于:根据感兴趣区域图像中是否存在手机来直接确定目标人物是否存在玩手机行为,行为识别装置无需再进行繁琐的计算,能够在提升目标用户玩手机行为识别的准确度的同时,降低行为识别装置的计算量。
S1053、若从感兴趣区域图像检测到手机,则根据手机检测模型输出的手机框,以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为。
可以理解的,若从感兴趣区域图像中检测到手机,则代表目标人物所在区域存在手机,也就是目标人物存在玩手机行为的可能性。
在一些实施例中,从感兴趣区域图像中检测到手机后,可以根据手机检测模型输出的手机框以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为。
示例性的,根据手机检测模型输出的手机框以及人物检测模型输出的人物框,确定目标人物是否存在玩手机行为,可以具体包括以下几种情形。
情形1,人物检测模型仅输出一个人物框。
由上述S1051可知,在人物检测模型仅输出一个人物框时,代表目标人物所在区域未存在非目标人物。在情形1的情况下,如图10所示,步骤S1053可以具体实现为以下步骤:
S201、确定手机框与人物框之间的重合度。
其中,手机框与人物框之间的重合度与目标人物存在玩手机行为的可能性呈正相关,即重合度越高,目标人物存在玩手机行为的可能性越高。
可以理解的,通常情况下,若目标人物存在玩手机行为,则手机应存在于目标人物周边。手机距离目标人物越近,目标人物存在玩手机行为的可能性越高。而在图像中,目标人物和手机均是以检测框的形式存在,手机框与人物框之间的重合度能够反映目标人物与手机之间的距离,故手机框与人物框之间的重合度与目标人物存在玩手机行为的可能性呈正相关。
示例性的,确定手机框与人物框之间的重合度的过程如下:
步骤1、确定手机框与人物框在感兴趣区域图像中重合区域的面积。
容易理解的,在手机与目标人物之间的距离在一定范围时,手机相应的手机框与目标人物相应的人物框之间存在重合区域。
如图11所示,根据手机框的上边界、下边界、左边界和右边界可以确定手机框在感兴趣区域图像中对应的像素区域的形状和坐标。其中,手机框在感兴趣区域图像中对应的像素区域的形状为矩形,手机框在感兴趣区域图像中对应的像素区域的坐标为(Xamin,Yamin,Xamax,Yamax),其中,Xamin为手机框在像素区域中横坐标最小值,Yamin为手机框在像素区域中纵坐标最小值,Xamax为手机框在像素区域中横坐标最大值,Yamax为手机框在像素区域中纵坐标最大值。进而根据手机框在感兴趣区域图像中对应的像素区域的坐标得到手机框在感兴趣区域图像所占的面积。
同样的,根据人物框的上边界、下边界、左边界和右边界可以确定人物框在感兴趣区域图像中对应的像素区域的形状和坐标。其中,人物框在感兴趣区域图像中对应的像素区域的形状为矩形,人物框在感兴趣区域图像中对应的像素区域的坐标为(Xbmin,Ybmin,Xbmax,Ybmax),其中,Xbmin为人物框在像素区域中横坐标最小值,Ybmin为人物框在像素区域中纵坐标最小值,Xbmax为人物框在像素区域中横坐标最大值,Ybmax为人物框在像素区域中纵坐标最大值。进而根据人物框在感兴趣区域图像中对应的像素区域的坐标得到人物框在感兴趣区域图像所占的面积。其中,图11中左侧所示的虚线框为手机框,右侧所示的虚线框为人物框。
在得到手机框在感兴趣区域图像中对应的像素区域的坐标以及人物框在感兴趣区域图像中对应的像素区域的坐标后,可以根据手机框在感兴趣区域图像中对应的像素区域的坐标以及人物框在感兴趣区域图像中对应的像素区域的坐标,得到手机框与人物框在感兴趣区域图像中重合区域,进而能够得到重合区域的面积。
示例性的,手机框在感兴趣区域图像的坐标、人物框在感兴趣区域图像的坐标与重合区域之间的关系可以如下述公式(1)所示:
A=renwu∩shouji        公式(1)
其中,A用于表示重合区域,renwu用于表示人物框在感兴趣区域图像的坐标,shouji用于表示手机框在感兴趣区域图像的坐标。
步骤2、以重合区域的面积与手机框在感兴趣区域所占的区域的面积之间的比值,作为重合度。
示例性的,重合度、重合区域的面积与手机框在感兴趣区域图像所占的区域的面积之间的关系可以如下述公式(2)所示:
其中,B用于表示重合度,Asq用于表示重合区域的面积,shoujisq用于表示手机框在感兴趣区域图像所占的区域的面积。
S202、若重合度大于或等于预设重合度阈值,确定目标人物存在玩手机行为。
其中,预设重合度阈值可以是管理人员根据人工经验预先设置的,例如,预设重合度阈值为80%。也就是手机框与人物框之间的重合区域的面积与手机框的面积的比值大于或等于80%时,确定目标人物存在玩手机行为。
应理解,通常情况下,手机存在于目标人物周边时,目标人物才存在玩手机行为的可能性。但即使手机存在与目标人物周边时,目标人物不一定具有玩手机行为。故本公开实施例提供的一种玩手机行为识别方法,基于重合度大于或等于预设重合度阈值的情况下,确定目标人物存在玩手机行为,提升了玩手机行为识别的准确度。
S203、若重合度小于预设重合度阈值,确定目标人物不存在玩手机行为。
可以理解的,若重合度小于预设重合度阈值,代表目标人物存在玩手机行为的可能性较低,故可以确定目标人物不存在玩手机行为。
作为一种可能的实现方式,为了降低行为识别装置的计算量,如图12所示,上述玩手机行为识别方法在步骤S201之前还可以包括步骤S301,并且步骤S201可以具体实现为步骤S303。
S301、基于手机框和人物框,确定目标人物与手机之间的距离。
上述步骤S201至步骤S203是默认以手机框与人物框之间存在重合区域的情况下的说明。可以理解的,若目标人物不存在玩手机行为,则手机框与人物框之间不存在重合区域。若在手机框与人物框之间不存在重合区域的情况下,继续计算手机框与人物框之间的重合度,会增加行为识别装置的计算量,且造成行为识别装置的计算资源的浪费。
基于此,在确定手机框与人物框之间的重合度之前,行为识别装置可以根据手机框在感兴趣区域图像中对应的像素区域的坐标,得到手机框的中心位置在感兴趣区域图像中对应的像素区域的坐标简称手机框的中心位置的坐标。以及根据人物框在感兴趣区域图像中对应的像素区域的坐标,得到人物框的中心位置在感兴趣区域图像中对应的像素区域的坐标简称人物框的中心位置的坐标。
进而根据手机框的中心位置的坐标和人物框的中心位置的坐标,能够得到手机框的中心位置与人物框的中心位置之间的距离。将手机框的中心位置与人物框的中心位置之间的距离作为目标人物与手机之间的距离。
S302、在目标人物与手机之间的距离大于预设距离阈值时,确定目标人物 不存在玩手机行为。
可以理解的,在手机与目标人物之间的距离大于预设距离阈值的情况下,代表手机框与人物框之间不存在交集,也即手机框与人物框之间不存在重合区域。在手机框与人物框之间不存在重合区域的情况下,代表手机距离目标人物较远,目标人物存在玩手机行为的可能性较低,可以直接确定目标人物不存在玩手机行为,进而无需计算手机框与人物框之间的重合度,降低了行为识别装置的计算量的同时,减少了行为识别装置的计算资源的浪费。
其中,距离阈值用于指示手机框与人物框在不相交的情况下的距离门限值。
作为一种可能的实现方式,预设距离阈值可以是行为识别装置基于感兴趣区域图像的分辨率实时计算的。
示例性的,本公开实施例提供一种预设距离阈值的确定方式,行为识别装置根据手机框的中心位置到手机框的左上角、右上角、左下角、右下角中任一角的距离,以及人物框的中心位置到人物框的左上角、右上角、左下角、右下角中任一角的距离,以两者距离之和作为预设距离阈值。
作为另一种可能的实现方式,预设距离阈值可以是管理人员根据人工经验预先设置的。
S303、在目标人物与手机之间的距离小于或等于预设距离阈值时,确定手机框与人物框之间的重合度。
作为一种可能的实现方式,上述步骤S201可以具体实现为:在目标人物与手机之间的距离小于或等于预设距离阈值时,确定手机框与人物框之间的重合度。
可以理解的,在目标人物与手机之间的距离小于或等于预设距离阈值的情况下,代表手机框与人物框之间存在交集,也即手机框与人物框之间存在重合区域。在手机框与人物框之间存在重合区域的情况下,代表人物框代表的目标人物所在区域存在手机,也就是目标人物存在玩手机行为的可能性,可以进一步根据人物框与手机框之间的重合度来确定目标人物是否存在玩手机行为。
关于确定手机框与人物框之间的重合度的具体实现,可以参照上述步骤S201的描述,在此不予赘述。
上述实施例着重介绍了人物检测模型仅输出一个人物框时的情形,在一些实施例中,本公开实施例提供的玩手机行为识别方法还包括下述情形:
情形2、人物检测模型输多个人物框。
由上述S1051可知,在人物检测模型输出多个人物框时,代表目标人物 所在区域存在非目标人物。在情形2的情况下,如图13所示,步骤S1053还可以具体实现为以下步骤:
S401、从多个人物框中确定目标人物的人物框、以及非目标人物的人物框。
在一些实施例中,在上述步骤S102中行为识别装置基于图像分割模型对待识别图像进行图像分割,得到每一个人物对应的感兴趣区域图像时,行为识别装置给每一个人物建立了每一个人物对应的身份标识,一个身份标识用于唯一指示一个人物。
在人物检测模型输出多个人物框的情况下,行为识别装置可以基于多个人物框中每一个人物框对应的人物的身份标识,从多个人物框中确定目标人物的人物框,以及非目标人物的人物框。
S402、基于目标人物的人物框、手机框、以及感兴趣区域图像,确定目标人物与手机之间的距离。
可选的,基于目标人物的人物框、手机框、以及感兴趣区域图像,确定目标人物与手机之间的距离,可以包括如下方式中的一种或多种:
方式1、行为识别装置基于目标人物的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
示例性的,行为识别装置可以根据目标人物的人物框的上边界、下边界、左边界和右边界可以确定目标人物的人物框在感兴趣区域图像中对应的像素区域的形状和坐标。其中,目标人物的人物框在感兴趣区域图像中对应的像素区域的形状为矩形。
同样的,行为识别装置可以根据手机框的上边界、下边界、左边界和右边界可以确定手机框在感兴趣区域图像中对应的像素区域的形状和坐标。其中,手机框在感兴趣区域图像中对应的像素区域的形状为矩形。
行为识别装置在得到目标人物的人物框在感兴趣区域图像中对应的像素区域的坐标后,可以得到目标人物的中心位置在感兴趣区域图像中对应的像素区域的坐标。
同样的,行为识别装置在得到手机框在感兴趣区域图像中对应的像素区域的坐标后,也可以得到手机的中心位置在感兴趣区域图像中对应的像素区域的坐标。
根据目标人物的中心位置在感兴趣区域图像中对应的像素区域的坐标和手机的中心位置在感兴趣区域图像中对应的像素区域的坐标,可以得到手机的中心位置与目标人物的中心位置的距离。进而将手机的中心位置与目标人 物的中心位置的距离,作为目标人物与手机之间的距离。
方式2、行为识别装置基于目标人物的手部的中心位置与手机的中心位置,确定目标人物与手机之间的距离。
上述方式1中是以目标人物的中心位置与手机的中心位置之间的距离作为目标人物与手机之间的距离。可以理解的,通常情况下若目标人物存在玩手机行为,目标人物是通过手来玩手机的,为了提升玩手机行为识别的准确度,本公开实施例提出以目标人物的手部的中心位置与手机的中心位置的距离作为目标人物与手机之间的距离。
具体的,上述方式2可以包括如下步骤:
S1、基于目标人物的人物框和感兴趣区域图像,对目标人物进行手部识别,确定目标人物的手部的中心位置。
在一些实施例中,服务区的存储器中预先存储有训练完成的手部识别模型,服务区可以将包含目标人物的人物框的感兴趣区域图像输入手部识别模型中,得到目标人物的手部框。
根据目标人物的手部框的上边界、下边界、左边界和右边界可以确定目标人物的手部框在感兴趣区域图像中对应的像素区域的形状和坐标。其中,目标人物在感兴趣区域图像中对应的像素区域的形状为矩形。进而,根据目标人物的手部框在感兴趣区域图像中对应的像素区域的坐标,可以得到目标人物的手部的中心位置。
在一些实施例中,上述手部识别模型可以是基于Faster R-CNN算法的手部识别模型。
S2、基于手机框和感兴趣区域图像,确定手机的中心位置。
关于基于手机框和感情去区域图像,确定手机的中心位置,可以参照上述方式1中对于手机的中心位置的确认方式,在此不予赘述。
S3、基于目标人物的手部的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
可选的,可以根据目标人物的手部的中心位置在感兴趣区域图像中对应的像素区域的坐标,以及手机的中心位置在感兴趣区域图像中对应的像素区域的坐标,得到目标人物的手部的中心位置与手机的中心位置之间的距离。进而将目标人物的手部的中心位置与手机的中心位置之间的距离作为目标人物与手机之间的距离。
方式3、行为识别装置基于目标人物的眼部的中心位置与手机的中心位置,确定目标人物与手机之间的距离。
应理解,通常情况下,目标人物存在玩手机行为时,目标人物的眼部会观看手机,故本公开实施例以目标人物的眼部的中心位置与手机的中心位置之间的距离,作为目标人物与手机之间的距离。
具体的,上述方式3可以包括如下步骤:
P1、基于目标人物的人物框和感兴趣区域图像,对目标人物进行眼部识别,确定目标人物的眼部的中心位置。
在一些实施例中,行为识别装置的存储器中预先存储有眼部识别模型。可以将包含目标人物的人物框的感兴趣区域图像,输入至眼部识别模型中,得到目标人物的眼部框。
根据目标人物的眼部框得到目标人物的眼部的中心位置的方式可以参照上述S1中关于根据目标人物的手部框得到目标人物的手部的中心位置的方式,在此不予赘述。
在一些实施例中,上述眼部识别模型可以是基于尺度不变特征转换(scale-invariant feature transform、SIFT)算法的眼部识别模型。
P2、基于手机框和感兴趣区域图像,确定手机的中心位置。
P3、基于目标人物的眼部的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
关于P2和P3的描述,可以参照上述S2和S3的描述,在此不予赘述。
S403、基于非目标人物的人物框、手机框以及感兴趣区域图像,确定非目标人物与手机之间的距离。
关于步骤S403的描述,可以参照上述关于步骤S402的描述,在此不予赘述。
在一些实施例中,在目标人物的感兴趣区域图像中存在多个非目标人物的情况下,行为识别装置对多个非目标人物中的每一个非目标人物进行上述计算,以得到每一个非目标人物与手机之间的距离。
需要说明的是,为了保证玩手机行为识别的准确度,若行为识别装置采用上述S402中的方式1来确定目标人物与手机之间的距离时,则行为识别装置也采用上述S402中的方式1来确定非目标人物与手机之间的距离。同样的,若行为识别装置采用上述S402中的方式2来确定目标人物与手机之间的距离时,则行为识别装置也采用上述S402中的方式2来确定非目标人物与手机之间的距离。
本公开实施例不限制步骤S402和步骤S403之间的执行顺序。例如,可以先执行步骤S402,在执行步骤S403;或者,先执行步骤S403,在执行步骤 S402;又或者,同时执行步骤S402和步骤S403。
S404、在目标人物与手机之间的距离小于所有非目标人物与手机之间的距离时,确定目标人物存在玩手机行为。
可以理解的,若目标人物与手机之间的距离小于所有非目标人物与手机之间的距离时,代表目标人物距离手机最近,即目标人物为多个人物中最具有玩手机行为可能性的人物,故确定目标人物存在玩手机行为。
S405、在目标人物与手机之间的距离大于或等于任意一个非目标人物与手机之间的距离时,确定目标人物不存在玩手机行为。
可以理解的,若目标人物与手机之间的距离大于或等于任一个非目标人物与手机之间的距离时,代表目标人物并非距离手机最近的用户,目标用户存在玩手机行为的可能性较低,为了避免误识别的情况发生,行为识别装置确定目标人物不存在玩手机行为。
基于图13所示的实施例,至少带来以下有益效果:在人物检测模型输出多个人物框的情况下,代表目标人物所在区域不仅存在目标人物,还存在非目标人物。为了排除非目标人物对于目标人物是否存在玩手机行为识别的影响,根据每个人物与手机之间的距离,在目标人物与手机之间的距离最短的情况下,确定目标人物存在玩手机行为,从而排除了非目标人物对于目标人物是否存在玩手机行为识别的影响,提升了玩手机行为识别的准确度。
下面结合一种具体的示例对本公开实施例提供的一种玩手机行为识别方法进行举例说明。
如图14所示,假设图14所示的图像为待识别图像,待识别图像中包括人物1和人物2。
首先将待识别图像进行图像分割处理,得到人物1的感兴趣区域图像和人物2的感兴趣区域图像。
将人物1的感兴趣区域图像分别输入至第一行为识别模型和第二行为识别模型中,得到人物1的第一行为识别结果和第二行为识别结果,将人物2的感兴趣区域图像分别输入至第一行为识别模型和第二行为识别模型中,得到人物2的第一行为识别结果和第二行为识别结果。
假设人物1的第一行为识别结果与第二行为识别结果一致,且第一行为识别结果指示任务机存在玩手机行为,则确定人物1存在玩手机行为。
假设人物2的第一行为识别结果与第二行为识别结果不一致,代表无法确认人物2是否存在玩手机行为。可以将人物2的感兴趣区域图像输入至人物检测模型和手机检测模型,来检测人物2所在区域存在的人物以及人物2 所在区域是否存在手机。
假设人物检测模型仅输出一个人物框,手机检测模型输出一个手机框,代表人物2所在区域仅存在一个人物,且人物2所在区域存在手机。则可以根据手机检测模型输出的手机框,以及人物检测模型输出的人物框之间的重合度,确定人物2是否存在玩手机。
假设预设重合度阈值为80%,若根据手机框与人物框之间的重合度为85%,则确定人物2存在玩手机行为。行为识别装置输出最终的识别结果,即人物1存在玩手机行为、人物2存在玩手机行为。
上述主要从方法的角度对本公开实施例提供的方案进行了介绍。为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。本领域技术人员应该很容易意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,本公开能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本公开的范围。
本公开实施例还提供了一种行为识别装置。如图15所示,行为识别装置300可以包括:通信单元301和处理单元302。在一些实施例中,上述行为识别装置300还可以包括存储单元303。
在一些实施例中,上述通信单元301,用于获取待识别图像。
上述处理单元302,用于:从待识别图像中提取出包含目标人物的感兴趣区域图像;将感兴趣区域图像输入至第一行为识别模型,得到目标人物的第一行为识别结果,第一行为识别结果用于指示目标人物是否存在玩手机行为;将感兴趣区域图像输入至第二行为识别模型,得到目标人物的第二行为识别结果,第二行为识别结果用于指示目标人物是否存在玩手机行为;若第一行为识别结果与第二行为识别结果不一致,则基于感兴趣区域图像,对目标人物进行行为识别处理,确定目标人物是否存在玩手机行为。
另一些实施例中,上述处理单元302,还用于若第一行为识别结果与第二行为识别结果一致,则基于第一行为识别结果或者第二行为识别结果,确定目标人物是否存在玩手机行为。
另一些实施例中,上述处理单元302,具体用于:将感兴趣区域图像输入手机检测模型,以及将感兴趣区域图像输入人物检测模型;若未从感兴趣区域图像检测到手机,确定目标人物不存在玩手机行为;若从感兴趣区域图像检测到手机,则根据手机检测模型输出的手机框,以及人物检测模型输出的人物框, 确定目标人物是否存在玩手机行为。
另一些实施例中,在人物检测模型仅输出一个人物框时,上述处理单元302,具体用于:确定手机框与人物框之间的重合度;若重合度大于或等于预设重合度阈值,则确定目标人物存在玩手机行为;若重合度小于预设重合度阈值,则确定目标人物不存在玩手机行为。
另一些实施例中,上述处理单元302,具体用于:确定手机框与人物框在感兴趣区域图像中重合区域的面积。
以重合区域的面积与手机框在感兴趣区域所占的区域的面积之间的比值,作为重合度。
另一些实施例中,上述处理单元,还用于:基于手机框和人物框,确定目标人物与手机之间的距离;在目标人物与手机之间的距离大于预设距离阈值时,确定目标人物不存在玩手机行为;上述处理单元,具体用于在目标人物与手机之间的距离小于或等于预设距离阈值时,确定手机框与人物框之间的重合度。
另一些实施例中,在人物检测模型输出多个人物框时,上述处理单元302,具体用于:从多个人物框中确定目标人物的人物框,以及非目标人物的人物框,非目标人物为感兴趣区域图像中除目标人物之外的其他人物;基于目标人物的人物框、手机框以及感兴趣区域图像,确定目标人物与手机之间的距离;基于非目标人物的人物框、手机框以及感兴趣区域图像,确定非目标人物与手机之间的距离;在目标人物与手机之间的距离小于所有非目标人物与手机之间的距离时,确定目标人物存在玩手机行为;在目标人物与手机之间的距离大于或等于任意一个非目标人物与手机之间的距离时,确定目标人物不存在玩手机行为。
另一些实施例中,上述处理单元302,具体用于:基于目标人物的人物框和感兴趣区域图像,对目标人物进行手部识别,确定目标人物的手部的中心位置;基于手机框和感兴趣区域图像,确定手机的中心位置;根据目标人物的手部的中心位置和手机的中心位置,确定目标人物与手机之间的距离。
另一些实施例中,上述存储单元303,用于存储待识别图像。
另一些实施例中,上述存储单元303,用于存储第一行为识别模型、第二行为识别模型、人物检测模型、手机检测模型、手部识别模型、身份识别模型和图像分割模型。
图15中的单元也可以称为模块,例如,处理单元可以称为处理模块。
图15中的各个单元如果以软件功能模块的形式实现并作为独立的产品销 售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,行为识别装置,或者网络设备等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。存储计算机软件产品的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
本公开的一些实施例提供了一种计算机可读存储介质(例如,非暂态计算机可读存储介质),该计算机可读存储介质中存储有计算机程序指令,计算机程序指令在计算机的处理器上运行时,使得处理器执行如上述实施例中任一实施例所述的玩手机行为识别方法。
示例性的,上述计算机可读存储介质可以包括,但不限于:磁存储器件(例如,硬盘、软盘或磁带等),光盘(例如,CD(Compact Disk,压缩盘)、DVD(Digital Versatile Disk,数字通用盘)等),智能卡和闪存器件(例如,EPROM(Erasable Programmable Read-Only Memory,可擦写可编程只读存储器)、卡、棒或钥匙驱动器等)。本公开描述的各种计算机可读存储介质可代表用于存储信息的一个或多个设备和/或其它机器可读存储介质。术语“机器可读存储介质”可包括但不限于,无线信道和能够存储、包含和/或承载指令和/或数据的各种其它介质。
本公开的一些实施例还提供了一种计算机程序产品,例如,该计算机程序产品存储在非瞬时性的计算机可读存储介质上。该计算机程序产品包括计算机程序指令,在计算机上执行该计算机程序指令时,该计算机程序指令使计算机执行如上述实施例所述的玩手机行为识别方法。
本公开的一些实施例还提供了一种计算机程序。当该计算机程序在计算机上执行时,该计算机程序使计算机执行如上述实施例所述的玩手机行为识别方法。
上述计算机可读存储介质、计算机程序产品及计算机程序的有益效果和上述一些实施例所述的玩手机行为识别方法的有益效果相同,此处不再赘述。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (12)

  1. 一种玩手机行为识别方法,其中,所述方法包括:
    获取待识别图像;
    从所述待识别图像中提取出包含目标人物的感兴趣区域图像;
    将所述感兴趣区域图像输入至第一行为识别模型,得到所述目标人物的第一行为识别结果,所述第一行为识别结果用于指示所述目标人物是否存在玩手机行为;
    将所述感兴趣区域图像输入至第二行为识别模型,得到所述目标人物的第二行为识别结果,所述第二行为识别结果用于指示所述目标人物是否存在玩手机行为;
    若所述第一行为识别结果与所述第二行为识别结果不一致,则基于所述感兴趣区域图像,对所述目标人物进行行为识别处理,确定所述目标人物是否存在玩手机行为。
  2. 根据权利要求1所述的方法,其中,所述方法还包括:
    若所述第一行为识别结果与所述第二行为识别结果一致,则基于所述第一行为识别结果或者所述第二行为识别结果,确定所述目标人物是否存在玩手机行为。
  3. 根据权利要求2所述的方法,其中,所述基于所述感兴趣区域图像,对所述目标人物进行行为识别处理,确定所述目标人物是否存在玩手机行为,包括:
    将所述感兴趣区域图像输入手机检测模型,以及将所述感兴趣区域图像输入人物检测模型;
    若未从所述感兴趣区域图像检测到手机,确定所述目标人物不存在玩手机行为;
    若从所述感兴趣区域图像检测到手机,则根据所述手机检测模型输出的手机框,以及所述人物检测模型输出的人物框,确定所述目标人物是否存在玩手机行为。
  4. 根据权利要求3所述的方法,其中,在所述人物检测模型仅输出一个人物框时,所述根据所述手机检测模型输出的手机框,以及所述人物检测模型输出的人物框,确定所述目标人物是否存在玩手机行为,包括:
    确定所述手机框与所述人物框之间的重合度;
    若所述重合度大于或等于预设重合度阈值,则确定所述目标人物存在玩手机行为;
    若所述重合度小于预设重合度阈值,则确定所述目标人物不存在玩手机行为。
  5. 根据权利要求4所述的方法,其中,所述确定所述手机框与所述人物框之间的重合度,包括:
    确定所述手机框与所述人物框在所述感兴趣区域图像中重合区域的面积;
    以所述重合区域的面积与所述手机框在所述感兴趣区域所占的区域的面积之间的比值,作为所述重合度。
  6. 根据权利要求4所述的方法,其中,在所述确定所述手机框与所述人物框之间的重合度之前,所述方法还包括:
    基于所述手机框和所述人物框,确定所述目标人物与所述手机之间的距离;
    在所述目标人物与所述手机之间的距离大于预设距离阈值时,确定所述目标人物不存在玩手机行为;
    所述确定所述手机框与所述人物框之间的重合度,包括:
    在所述目标人物与所述手机之间的距离小于或等于所述预设距离阈值时,确定所述手机框与所述人物框之间的重合度。
  7. 根据权利要求3所述的方法,其中,在所述人物检测模型输出多个人物框时,所述根据所述手机检测模型输出的手机框,以及所述人物检测模型输出的人物框,确定所述目标人物是否存在玩手机行为,包括:
    从所述多个人物框中确定目标人物的人物框,以及非目标人物的人物框,所述非目标人物为所述感兴趣区域图像中除目标人物之外的其他人物;
    基于所述目标人物的人物框、所述手机框以及所述感兴趣区域图像,确定所述目标人物与手机之间的距离;
    基于所述非目标人物的人物框、所述手机框以及所述感兴趣区域图像,确定所述非目标人物与手机之间的距离;
    在所述目标人物与手机之间的距离小于所有非目标人物与手机之间的距 离时,确定所述目标人物存在玩手机行为;
    在所述目标人物与手机之间的距离大于或等于任意一个所述非目标人物与手机之间的距离时,确定所述目标人物不存在玩手机行为。
  8. 根据权利要求7所述的方法,其中,所述基于所述目标人物的人物框、所述手机框以及所述感兴趣区域图像,确定所述目标人物与手机之间的距离,包括:
    基于所述目标人物的人物框和所述感兴趣区域图像,对所述目标人物进行手部识别,确定所述目标人物的手部的中心位置;
    基于所述手机框和所述感兴趣区域图像,确定所述手机的中心位置;
    根据所述目标人物的手部的中心位置和所述手机的中心位置,确定所述目标人物与手机之间的距离。
  9. 根据权利要求1至8中任一项所述的方法,其中,所述第一行为识别模型为inception网络模型,所述第二行为识别模型为残差网络模型。
  10. 一种行为识别装置,其中,所述行为识别装置包括:
    通信单元,用于获取待识别图像;
    处理单元,用于:从所述待识别图像中提取出包含目标人物的感兴趣区域图像;将所述感兴趣区域图像输入至第一行为识别模型,得到所述目标人物的第一行为识别结果,所述第一行为识别结果用于指示所述目标人物是否存在玩手机行为;将所述感兴趣区域图像输入至第二行为识别模型,得到所述目标人物的第二行为识别结果,所述第二行为识别结果用于指示所述目标人物是否存在玩手机行为;若所述第一行为识别结果与所述第二行为识别结果不一致,则基于所述感兴趣区域图像,对所述目标人物进行行为识别处理,确定所述目标人物是否存在玩手机行为。
  11. 一种行为识别装置,其中,所述行为识别装置包括存储器和处理器;
    所述存储器和所述处理器耦合;所述存储器用于存储计算机程序代码,所述计算机程序代码包括计算机指令;
    其中,当所述处理器执行所述计算机指令时,使得所述行为识别装置执行如权利要求1至9中任一项所述的玩手机行为识别方法。
  12. 一种非瞬态的计算机可读存储介质,所述计算机可读存储介质存储有计算机程序;其中,所述计算机程序在行为识别装置运行时,使得所述行为识 别装置实现如权利要求1至9中任一项所述的玩手机行为识别方法。
PCT/CN2023/095778 2022-06-30 2023-05-23 玩手机行为识别方法及装置 WO2024001617A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210764212.1 2022-06-30
CN202210764212.1A CN115147818A (zh) 2022-06-30 2022-06-30 玩手机行为识别方法及装置

Publications (1)

Publication Number Publication Date
WO2024001617A1 true WO2024001617A1 (zh) 2024-01-04

Family

ID=83409872

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/095778 WO2024001617A1 (zh) 2022-06-30 2023-05-23 玩手机行为识别方法及装置

Country Status (2)

Country Link
CN (1) CN115147818A (zh)
WO (1) WO2024001617A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115147818A (zh) * 2022-06-30 2022-10-04 京东方科技集团股份有限公司 玩手机行为识别方法及装置

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286884A1 (en) * 2014-04-04 2015-10-08 Xerox Corporation Machine learning approach for detecting mobile phone usage by a driver
CN109871799A (zh) * 2019-02-02 2019-06-11 浙江万里学院 一种基于深度学习的驾驶员玩手机行为的检测方法
CN111723602A (zh) * 2019-03-19 2020-09-29 杭州海康威视数字技术股份有限公司 驾驶员的行为识别方法、装置、设备及存储介质
CN113158842A (zh) * 2021-03-31 2021-07-23 中国工商银行股份有限公司 一种识别方法、***、设备及介质
CN113255606A (zh) * 2021-06-30 2021-08-13 深圳市商汤科技有限公司 行为识别方法、装置、计算机设备及存储介质
CN114187666A (zh) * 2021-12-23 2022-03-15 中海油信息科技有限公司 边走路边看手机的识别方法及其***
CN114445710A (zh) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 图像识别方法、装置、电子设备以及存储介质
CN115147818A (zh) * 2022-06-30 2022-10-04 京东方科技集团股份有限公司 玩手机行为识别方法及装置

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150286884A1 (en) * 2014-04-04 2015-10-08 Xerox Corporation Machine learning approach for detecting mobile phone usage by a driver
CN109871799A (zh) * 2019-02-02 2019-06-11 浙江万里学院 一种基于深度学习的驾驶员玩手机行为的检测方法
CN111723602A (zh) * 2019-03-19 2020-09-29 杭州海康威视数字技术股份有限公司 驾驶员的行为识别方法、装置、设备及存储介质
CN113158842A (zh) * 2021-03-31 2021-07-23 中国工商银行股份有限公司 一种识别方法、***、设备及介质
CN113255606A (zh) * 2021-06-30 2021-08-13 深圳市商汤科技有限公司 行为识别方法、装置、计算机设备及存储介质
CN114187666A (zh) * 2021-12-23 2022-03-15 中海油信息科技有限公司 边走路边看手机的识别方法及其***
CN114445710A (zh) * 2022-01-29 2022-05-06 北京百度网讯科技有限公司 图像识别方法、装置、电子设备以及存储介质
CN115147818A (zh) * 2022-06-30 2022-10-04 京东方科技集团股份有限公司 玩手机行为识别方法及装置

Also Published As

Publication number Publication date
CN115147818A (zh) 2022-10-04

Similar Documents

Publication Publication Date Title
US11727577B2 (en) Video background subtraction using depth
US11830230B2 (en) Living body detection method based on facial recognition, and electronic device and storage medium
US10956714B2 (en) Method and apparatus for detecting living body, electronic device, and storage medium
US20220076444A1 (en) Methods and apparatuses for object detection, and devices
WO2022134337A1 (zh) 人脸遮挡检测方法、***、设备及存储介质
JP6873126B2 (ja) 顔位置追跡方法、装置及び電子デバイス
WO2018188453A1 (zh) 人脸区域的确定方法、存储介质、计算机设备
EP3471021A1 (en) Method for determining target intelligently followed by unmanned aerial vehicle, unmanned aerial vehicle and remote controller
TWI439951B (zh) 人臉影像性別辨識系統及其辨識方法及其電腦程式產品
TW201911130A (zh) 一種翻拍影像識別方法及裝置
US20220335619A1 (en) Instance segmentation method and apparatus
US20170053156A1 (en) Human face recognition method, apparatus and terminal
US20220138912A1 (en) Image dehazing method, apparatus, and device, and computer storage medium
TW202026948A (zh) 活體檢測方法、裝置以及儲存介質
WO2022021029A1 (zh) 检测模型训练方法、装置、检测模型使用方法及存储介质
KR20190098858A (ko) 딥러닝 기반의 포즈 변화에 강인한 얼굴 인식 방법 및 장치
CN108416291B (zh) 人脸检测识别方法、装置和***
WO2024001617A1 (zh) 玩手机行为识别方法及装置
WO2020181426A1 (zh) 一种车道线检测方法、设备、移动平台及存储介质
CN111325107B (zh) 检测模型训练方法、装置、电子设备和可读存储介质
CN114267041B (zh) 场景中对象的识别方法及装置
WO2022161139A1 (zh) 行驶朝向检测方法、装置、计算机设备及存储介质
CN114663871A (zh) 图像识别方法、训练方法、装置、***及存储介质
US11709914B2 (en) Face recognition method, terminal device using the same, and computer readable storage medium
WO2020244076A1 (zh) 人脸识别方法、装置、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23829802

Country of ref document: EP

Kind code of ref document: A1