CN111862159A

CN111862159A - Improved target tracking and segmentation method, system and medium for twin convolutional network

Info

Publication number: CN111862159A
Application number: CN202010716280.1A
Authority: CN
Inventors: 盛校麟; 李凡平; 石柱国
Original assignee: Anhui Issa Data Technology Co ltd; Beijing Yisa Technology Co ltd; Qingdao Yisa Data Technology Co Ltd
Current assignee: Anhui Issa Data Technology Co ltd; Beijing Yisa Technology Co ltd; Qingdao Yisa Data Technology Co Ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-10-30

Abstract

The invention discloses a target tracking and segmenting method of an improved twin convolutional network, which comprises the following steps: extracting input image features by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map; performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram; after convolution is carried out on the output feature map, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel of each pixel are set to be ROW, and each pixel in the score map is used as a confidence corresponding to each ROW in the first feature map; and selecting the ROW corresponding to the pixel point with the highest confidence in the score map on the first characteristic map as the ROW used when the mask matrix is finally generated, and generating the target mask according to the ROW by adopting a Refinement module. The method utilizes the features to extract the bottom layer feature information of the network, and improves the precision of semantic segmentation and semantic tracking.

Description

Improved target tracking and segmentation method, system and medium for twin convolutional network

Technical Field

The invention relates to the technical field of computer vision, in particular to a target tracking and segmenting method, a target tracking and segmenting system, a target tracking and segmenting terminal and a target tracking and segmenting medium of an improved twin convolutional network.

Background

The target tracking technology is an important branch in computer vision tasks and is widely applied to the fields of automatic driving, video monitoring, robots and the like. The traditional target tracking algorithm mainly depends on manual labeling features and related filtering algorithms (such as KCF, TLD and the like), has a high frame rate, but is low in accuracy and robustness and difficult to meet the requirements of practical application. With the rise of artificial intelligence and deep learning in recent years, the convolutional neural network algorithm gradually enters the field of target tracking and obtains unsophisticated performance and achievement, wherein the algorithm framework based on the twin convolutional network is greatly concerned in international computer vision conference and tracking events in recent years by virtue of good performance and simple network structure.

In order to facilitate the expression of the tracking result, the original tracking algorithm returns the target tracking result by using a rectangular box with aligned coordinate axes. However, as the tracking accuracy is improved, the difficulty of the data set is improved, and a rotating rectangular box is proposed to be used as a mark in the VOT 2015. An automatic method of generating a rotating frame through a mask is proposed at the time of the VOT2016, but the requirement of a diversified target tracking task cannot be met.

Disclosure of Invention

Aiming at the defects in the prior art, the embodiment of the invention provides an improved target tracking and segmenting method, a system, a terminal and a medium of a twin convolutional network.

In a first aspect, an embodiment of the present invention provides an improved target tracking and segmenting method for a twin convolutional network, including:

acquiring input image information;

extracting input image features by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map;

performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

after convolution is carried out on the output feature map, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel of each pixel are set to be ROW, and each pixel in the score map is a confidence corresponding to each ROW in the first feature map;

and selecting the ROW corresponding to the pixel point with the highest confidence in the score map on the first characteristic map as the ROW used when the mask matrix is finally generated, and generating the target mask according to the ROW by adopting a Refinement module.

In a second aspect, an embodiment of the present invention provides an improved target tracking and segmenting system for a twin convolutional network, including: an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module,

the image acquisition module is used for acquiring input image information;

the image feature extraction module adopts a densely connected convolutional neural network to extract the features of an input image to obtain a target feature map and a tracking area feature map;

the cross-correlation module performs cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

the first analysis module is used for convolving the output feature map and then respectively sending the convolved output feature map into the semantic segmentation branch and the score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and setting each pixel in the score map as a confidence corresponding to each ROW in the first feature map;

and the second analysis module is used for selecting the ROW corresponding to the pixel point with the highest reliability in the score map on the first characteristic map as the ROW used for finally generating the mask matrix, and the Refinement module is adopted to generate the target mask according to the ROW.

In a third aspect, an intelligent terminal provided in an embodiment of the present invention includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, the memory is used to store a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method steps described in the foregoing embodiment.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions, which, when executed by a processor, cause the processor to perform the method steps described in the above embodiments.

The invention has the beneficial effects that:

the embodiment of the invention provides an improved target tracking and segmenting method, a system, a terminal and a medium of a twin convolutional network, which adopt a densely connected convolutional neural network to extract image features, improve the extraction capability of the network features, add a semantic segmentation branch and a score map branch into a basic twin convolutional neural network, improve the target tracking precision, realize the pixel-level tracking of a target, adopt a Refinement module to obtain a binary mask of the target in an image, fully utilize the bottom layer feature information of the feature extraction network, improve the semantic segmentation precision, and can realize high-precision semantic tracking.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

FIG. 1 is a flow chart illustrating a method for object tracking and segmentation of an improved twin convolutional network according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of a target tracking and segmenting system of an improved twin convolutional network according to another embodiment of the present invention;

fig. 3 shows a schematic structural diagram of an intelligent terminal according to another embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should be further understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

As shown in fig. 1, there is shown a flowchart of an improved target tracking and segmenting method of a twin convolutional network provided by a first embodiment of the present invention, which includes the following steps:

s1: input image information is acquired.

S2: and extracting the features of the input image by adopting a densely connected convolutional neural network to obtain a target feature map and a tracking area feature map.

S3: performing cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram;

s4: and after convolution, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel thereof are set as ROW, and each pixel in the score map is a confidence corresponding to each ROW in the first feature map.

S5: and selecting the ROW corresponding to the pixel point with the highest confidence in the score map on the first characteristic map as the ROW used when the mask matrix is finally generated, and generating the target mask according to the ROW by adopting a Refinement module.

Specifically, the specific method for generating the target mask according to the ROW by using the reference module comprises the following steps:

deconvoluting ROW used when a mask matrix is finally generated to obtain a second feature map, and performing additive coupling on the second feature map and an intermediate feature map obtained when a tracking target is extracted by a feature extraction network;

reducing the dimension of each intermediate characteristic diagram through a three-layer model to enable the intermediate characteristic diagram to have the same channel number, width and height as the ROW, and adding the characteristic diagrams;

and obtaining a first matrix by passing the feature map obtained after addition through a last layer of convolution layer, performing binary classification on the first matrix to obtain a mask matrix, wherein the value of each element of the mask matrix is 0-1, mapping the mask matrix back to the original image through affine transformation, performing binarization on numerical values between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of the tracking target in the original image, and obtaining a boundary frame of the tracking target according to the minimum circumscribed rectangle of the target mask.

The above technical solution is described in detail below using a specific example.

Acquiring two input images, wherein the dimension of one image is 127 × 3, the dimension of the other image is 255 × 3, inputting the two images into a convolution neural network which is connected in a dense mode respectively for feature extraction, dividing the feature extraction network into two paths, respectively extracting target features and tracking area features, outputting a target image (the scale is 127 × 127) by a full convolution network, and outputting a tracking area image (the scale is 255). The feature extraction process is represented by the following mathematical expression:

x_l＝H_l([x₀,x₁,...,x_l-1])

wherein H_lRepresents a network extraction feature operation, [ x ]₀,x₁,...,x_l-1]Indicating that the feature maps from the first layer to the last layer are merged as channels, x_lThen for the output of the feature extraction network, a target feature map dimension of 15 × 256 and a tracking region feature map dimension of 31 × 256 are obtained, respectively. And (3) performing correlation operation on the target feature map dimension of 15 × 256 and the tracking region feature map dimension of 31 × 256, padding of 0 and step size of 1 to obtain an output feature map with the dimension of 17 × 256. Respectively sending the output feature maps with the dimension of 17 × 256 into a semantic segmentation branch and a score map (score map) branch, wherein the semantic segmentation branch and the score map branch are formed by 1 × 1 convolution, and respectively obtaining the dimension of 17 × 17 (63 × 6) after the output feature maps are subjected to 1 × 1 convolution3) And a score map (score map) with dimension 17 x 1, each pixel in fmask and its corresponding channel are referred to as RoW, i.e., response of a candidate window, so that there are a total of 17 x 17 RoW in fmask and each RoW has dimension 1 x 1 (63). And each pixel in the score map is a confidence corresponding to each RoW in the fmask, and RoW corresponding to the pixel point with the highest confidence in the score map on the fmask is selected as RoW used when the mask is finally generated. A goal mask is generated from the RoW using a referement module.

The specific method for generating the target mask according to the RoW by adopting the Refinement module comprises the following steps: (1) first, deconvolution (deconv) was performed on the obtained RoW to obtain a second feature map with dimensions 15 × 32. (2) And adding and coupling the second feature map obtained by deconvolution and an intermediate feature map obtained when the feature extraction network extracts a tracking target (template branch). (3) And (3) reducing the dimension of each intermediate feature map through a three-layer model to enable the intermediate feature maps to have the same channel number, width and height as RoW, and adding the feature maps. (4) Finally, a third signature with dimension 127 x 4 is obtained, and a first matrix with dimension 127 x 1 is obtained through the last convolution layer. (4) And (3) carrying out sigmoid (second classification) operation on the first matrix to obtain a mask matrix, wherein the value of each element of the obtained mask matrix is between 0 and 1, and the mask matrix is used for judging whether the pixel on the matrix generated by RoW belongs to a mask or not. And then mapping the mask matrix back to the original image by affine transformation, and mapping the matrix back to the area used for the search in the original image. (5) Setting a segmentation threshold (in this embodiment, 0.35 is selected as the segmentation threshold of the mask), and converting the sigmoid matrix into a binary matrix, so that semantic segmentation information of the target can be obtained, a target mask of the tracked target in the original image is obtained, and a bounding box (bounding box) of the tracked target is obtained according to the minimum bounding rectangle of the target mask.

The method described in this example was implemented on a DAVIS2016 dataset and compared to other state-of-the-art tracking algorithms (including traditional twin network based tracking algorithms) for various performance indicators, with the results shown in table 1:

TABLE 1

The method described in this example was applied to the DAVIS2017 dataset and compared to other state-of-the-art tracking algorithms (including conventional twin network based tracking algorithms) for various performance indicators, with the results shown in table 2:

TABLE 2

The data in tables 1 and 2 show that the improved target tracking and segmentation method of the twin convolutional network provided by the embodiment of the invention has obviously better performance than various methods in the prior art.

The improved target tracking and segmenting method of the twin convolutional network provided by the embodiment of the invention adopts the convolutional neural network in dense connection to extract image features, improves the extraction capability of the network features, adds a semantic segmentation branch and a score map branch into the basic twin convolutional neural network, improves the target tracking precision, realizes the pixel-level tracking of the target, adopts a redefinement module to obtain a binary mask of the target in the image, fully utilizes the bottom-layer feature information of the feature extraction network, improves the semantic segmentation precision, and can realize high-precision semantic tracking.

In the first embodiment described above, an improved target tracking and segmenting method of a twin convolutional network is provided, and correspondingly, the present application also provides an improved target tracking and segmenting system of a twin convolutional network. Please refer to fig. 2, which is a schematic diagram of an improved target tracking and segmenting system of a twin convolutional network according to a second embodiment of the present invention. Since the apparatus embodiments are substantially similar to the method embodiments, they are described in a relatively simple manner, and reference may be made to some of the descriptions of the method embodiments for relevant points.

As shown in fig. 2, there is shown a schematic structural diagram of an improved target tracking and segmenting system of a twin convolutional network according to another embodiment of the present invention, where the system includes: the device comprises an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module, wherein the image acquisition module is used for acquiring input image information; the image feature extraction module adopts a densely connected convolutional neural network to extract the features of an input image to obtain a target feature map and a tracking area feature map; the cross-correlation module performs cross-correlation operation on the target characteristic diagram and the tracking area characteristic diagram to obtain an output characteristic diagram; the first analysis module is used for convolving the output feature map and then respectively sending the convolved output feature map into the semantic segmentation branch and the score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and taking each pixel in the score map as a confidence corresponding to each ROW in the first feature map; and the second analysis module is used for selecting the ROW corresponding to the pixel point with the highest reliability in the score map on the first characteristic map as the ROW used for finally generating the mask matrix, and the Refinement module is adopted to generate the target mask according to the ROW.

The specific method for generating the target mask according to the ROW by adopting the Refinement module comprises the following steps:

deconvoluting ROW used when a mask matrix is finally generated to obtain a second feature map, and performing additive coupling on the second feature map and an intermediate feature map obtained when a tracking target is extracted by a feature extraction network; reducing the dimension of each intermediate characteristic diagram through a three-layer model to enable the intermediate characteristic diagram to have the same channel number, width and height as the ROW, and adding the characteristic diagrams; and obtaining a first matrix by passing the feature map obtained after addition through a last layer of convolution layer, performing binary classification on the first matrix to obtain a mask matrix, wherein the value of each element of the mask matrix is 0-1, mapping the mask matrix back to the original image through affine transformation, performing binarization on numerical values between 0 and 1 in the mask matrix by using a set segmentation threshold value to obtain a target mask of the tracking target in the original image, and obtaining a boundary frame of the tracking target according to the minimum circumscribed rectangle of the target mask.

The image acquisition module acquires two input images, the dimensionality of one image is 127 × 3, the dimensionality of the other image is 255 × 3, the two input images are respectively input into the image feature extraction module, the image feature extraction module adopts a convolution neural network in dense connection to extract features, the feature extraction network is divided into two paths and is respectively used for extracting target features and tracking region features, the full convolution network outputs a target image (the scale is 127 × 127), and a tracking region image (the scale is 255) is output. The image feature extraction module feature extraction process is represented by the following mathematical expression:

x_l＝H_l([x₀,x₁,…,x_l-1])

wherein H_lRepresents a network extraction feature operation, [ x ]₀,x₁,...,x_l-1]Indicating that the feature maps from the first layer to the last layer are merged as channels, x_lThen for the output of the feature extraction network, a target feature map dimension of 15 × 256 and a tracking region feature map dimension of 31 × 256 are obtained, respectively. And (3) performing correlation operation on the target feature map dimension of 15 × 256 and the tracking region feature map dimension of 31 × 256, padding of 0 and step size of 1 to obtain an output feature map with the dimension of 17 × 256. The output feature map with the dimension of 17 × 256 is sent to a semantic segmentation branch and a score map (score map) branch respectively, the semantic segmentation branch and the score map branch are formed by 1 × 1 convolution, the output feature map is subjected to 1 × 1 convolution to obtain a first feature map (fmask) with the dimension of 17 × 17 (63 × 63) and a score map (score map) with the dimension of 17 × 17 (17 × 1), each pixel and corresponding channel in the fmask are called RoW, namely response of a candidate window, so that 17 × 17 RoW are totally contained in the fmask, and the dimension of each RoW is 1 × 1 (63 × 63). And each pixel in the score map is a confidence corresponding to each RoW in the fmask, and RoW corresponding to the pixel point with the highest confidence in the score map on the fmask is selected as RoW used when the mask is finally generated. A goal mask is generated from the RoW using a referement module. The specific method for generating the target mask according to the RoW by adopting the Refinement module comprises the following steps: (1) first, deconvolution (deconv) was performed on the obtained RoW to obtain a second feature map with dimensions 15 × 32. (2) Extracting a tracking target (template branch) by using the second feature map obtained by deconvolution and a feature extraction networkWays) are additively coupled. (3) And (3) reducing the dimension of each intermediate feature map through a three-layer model to enable the intermediate feature maps to have the same channel number, width and height as RoW, and adding the feature maps. (4) Finally, a third signature with dimension 127 x 4 is obtained, and a first matrix with dimension 127 x 1 is obtained through the last convolution layer. (4) And (3) carrying out sigmoid (second classification) operation on the first matrix to obtain a mask matrix, wherein the value of each element of the obtained mask matrix is between 0 and 1, and the mask matrix is used for judging whether the pixel on the matrix generated by RoW belongs to a mask or not. And then mapping the mask matrix back to the original image by affine transformation, and mapping the matrix back to the area used for the search in the original image. (5) Setting a segmentation threshold (in this embodiment, 0.35 is selected as the segmentation threshold of the mask), and converting the sigmoid matrix into a binary matrix, so that semantic segmentation information of the target can be obtained, a target mask of the tracked target in the original image is obtained, and a bounding box (bounding box) of the tracked target is obtained according to the minimum bounding rectangle of the target mask.

According to the improved target tracking and segmenting system of the twin convolutional network, the image features are extracted by adopting the convolutional neural network in dense connection, the extraction capability of the network features is improved, the semantic segmentation branch and the score map branch are added into the basic twin convolutional neural network, the target tracking precision is improved, the pixel-level tracking of the target is realized, the redefinement module is adopted to obtain the binary mask of the target in the image, the bottom-layer feature information of the feature extraction network is fully utilized, the semantic segmentation precision is improved, and the high-precision semantic tracking can be realized.

As shown in fig. 3, a schematic diagram of an intelligent terminal according to a third embodiment of the present invention is provided, where the terminal includes a processor, an input device, an output device, and a memory, where the processor, the input device, the output device, and the memory are connected to each other, the memory is used for storing a computer program, the computer program includes program instructions, and the processor is configured to call the program instructions to execute the method described in the first embodiment.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input devices may include a touch pad, microphone, etc., and the output devices may include a display (LCD, etc.), speakers, etc.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In a specific implementation, the processor, the input device, and the output device described in the embodiments of the present invention may execute the implementation described in the method embodiments provided in the embodiments of the present invention, and may also execute the implementation described in the system embodiments in the embodiments of the present invention, which is not described herein again.

The invention also provides an embodiment of a computer-readable storage medium, in which a computer program is stored, which computer program comprises program instructions that, when executed by a processor, cause the processor to carry out the method described in the above embodiment.

The computer readable storage medium may be an internal storage unit of the terminal described in the foregoing embodiment, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

Those of ordinary skill in the art will appreciate that the elements and algorithm steps of the examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the terminal and the unit described above may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed terminal and method can be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An improved target tracking and segmentation method for a twin convolutional network, comprising:

acquiring input image information;

after convolution is carried out on the output feature map, the output feature map is respectively sent to a semantic segmentation branch and a score map branch to obtain a first feature map and a score map, each pixel in the first feature map and a corresponding channel of each pixel are set to be ROW, and each pixel in the score map is used as a confidence corresponding to each ROW in the first feature map;

2. The improved target tracking and segmenting method for the twin convolutional network as set forth in claim 1, wherein the specific method for generating the target mask according to the ROW by using the reference module comprises the following steps:

3. The improved target tracking and segmentation method for a twin convolutional network of claim 1 wherein said semantic segmentation branches are comprised of 1 x 1 convolutional layers; the score map branches are made up of 1 × 1 convolutional layers.

4. The improved target tracking and segmentation method for a twin convolutional network of claim 2 wherein said segmentation threshold is 0.35.

5. An improved target tracking and segmentation system for twin convolutional networks, comprising: an image acquisition module, an image feature extraction module, a cross-correlation module, a first analysis module and a second analysis module,

the image acquisition module is used for acquiring input image information;

the first analysis module is used for convolving the output feature map and then respectively sending the convolved output feature map into the semantic segmentation branch and the score map branch to obtain a first feature map and a score map, setting each pixel in the first feature map and a corresponding channel thereof as ROW, and taking each pixel in the score map as a confidence corresponding to each ROW in the first feature map;

the second analysis module is used for selecting the ROW corresponding to the pixel point with the highest confidence in the score map on the first feature map as the ROW used when the mask matrix is finally generated, and the Refinement module is adopted to generate the target mask according to the ROW.

6. The improved target tracking and segmentation system of the twin convolutional network of claim 5 wherein the specific method for generating the target mask according to the ROW by using the reference module comprises:

7. The improved twin convolutional network target tracking and segmentation system of claim 5 wherein said semantic segmentation branches are comprised of 1 x 1 convolutional layers; the score map branches are made up of 1 × 1 convolutional layers.

8. The improved twin convolutional network target tracking and segmentation system of claim 6 wherein the segmentation threshold is 0.35.

9. An intelligent terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, the memory being adapted to store a computer program, the computer program comprising program instructions, characterized in that the processor is configured to invoke the program instructions to perform the method steps according to any of claims 1 to 4.

10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the method steps according to any one of claims 1 to 4.