CN112306675B - Data processing method, related device and computer readable storage medium - Google Patents

Data processing method, related device and computer readable storage medium Download PDF

Info

Publication number
CN112306675B
CN112306675B CN202011084416.8A CN202011084416A CN112306675B CN 112306675 B CN112306675 B CN 112306675B CN 202011084416 A CN202011084416 A CN 202011084416A CN 112306675 B CN112306675 B CN 112306675B
Authority
CN
China
Prior art keywords
memory
memory blocks
multiplexing
blocks
operator
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011084416.8A
Other languages
Chinese (zh)
Other versions
CN112306675A (en
Inventor
陈国海
马海波
黄永明
尤肖虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Network Communication and Security Zijinshan Laboratory
Original Assignee
Network Communication and Security Zijinshan Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Network Communication and Security Zijinshan Laboratory filed Critical Network Communication and Security Zijinshan Laboratory
Priority to CN202011084416.8A priority Critical patent/CN112306675B/en
Publication of CN112306675A publication Critical patent/CN112306675A/en
Application granted granted Critical
Publication of CN112306675B publication Critical patent/CN112306675B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5011Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals
    • G06F9/5016Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resources being hardware resources other than CPUs, Servers and Terminals the resource being the memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data processing method, related equipment and a computer readable storage medium, belonging to the field of data processing, wherein the data processing method comprises the following steps: the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting; the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; the memory multiplexing client receives a response message of the memory multiplexing server; the memory multiplexing client side sets one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message; the input and output caches of the operators are divided into smaller memory requirements, so that memory holes in memory multiplexing are more easily filled, and the total memory requirement of the input and output caches of the model operators in deep learning is reduced.

Description

Data processing method, related device and computer readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a data processing method, related device, and computer readable storage medium capable of reducing memory occupation in edge computing.
Background
In recent years, along with the rise of artificial intelligence, a plurality of frames (such as TensorFlow, pyTorc, caffe and the like) for deep learning are developed, various deep learning models are continuously developed, and the deep learning brings great convenience to the life of people in various industries. Such as license plate recognition, online conversion among various languages, etc.
The neural network model for deep learning is shown in fig. 1, wherein the leftmost part is an input layer, the rightmost part is an output layer, and the middle part can be understood as a calculation node to form a hierarchical concept. The more nodes on a layer, the deeper the hierarchy, and the more complex the logic for deep learning.
The data to be input by the computing node (or called operator) is calculated, the data are required to be stored, the data are called as input buffers of the operator, the data are required to be output after the operator is calculated, the buffers for storing output data are called as output buffers of the operator, the number of the input and output buffers has operator function definition, such as addition, two input buffers are required for adding two matrixes, and three input buffers are required for adding three input matrixes; the size of the buffer is determined by the amount of input data and the type of data.
Because of the large number of computing nodes involved in the computation, the amount of data to be processed is large, and the amount of data storage required for the whole computation is large, usually in GB, which puts strict demands on the computing system. In order to adapt to deep learning Google pushing TensorFlow Lite on a mobile phone, the size of a model in storage and the memory requirement in calculation are reduced through weight quantization (the weight of 32 bits is represented by 8 bits). Thus, input/output memory multiplexing of the deep learning model is required to reduce the total memory requirement.
Meaning of memory multiplexing: because the execution of the operators is performed sequentially (there may be multiple parallel sequential executions), a block of memory can be used as an input/output buffer of multiple operators in a non-overlapping life cycle, so as to form memory multiplexing.
Apache MXNet is a deep learning framework, which is specially designed for improving efficiency and flexibility, and provides a simple and effective heuristic method, firstly determining the number of required memory blocks, and then calculating the size of each memory block, as follows.
The core idea is as follows: the variable (input/output buffer) is allowed to share the memory in non-overlapping life cycles, a reference counter is designed for each memory block and is colored (the agreeing color indicates the same memory block), and when the reference counter is equal to 0, the reference counter indicates that the memory block can be recycled to a memory block resource pool and can be used by other operators.
Referring to fig. 2, the process of determining the number of memory blocks needed is as follows, where a is an input operator, independent memories are needed (and output multiplexing inputs) and the output buffer of each operator is the input buffer of the subsequent operator, so the whole process of calculating the number of memory blocks only needs to consider the allocation of the output buffers of the operators.
Step1: the operator B needs a block of output buffer to be used as input buffer for operators C and F, and the memory pool is empty, so that a block of memory (red) needs to be newly allocated, and the reference count is 2 (the operators C and F need to be used as input buffer);
Step2: executing an operator C, wherein the operator C needs a block of output buffer memory, a memory pool is empty, a green memory is allocated, the reference count is 1, and meanwhile, the reference count of a red memory block is reduced by one (the execution of the operator C is completed);
Step3: executing an operator F, wherein the operator F needs a block output buffer memory, the memory pool is empty at the moment, a blue memory is allocated, and the reference count is 1;
step4: executing an operator E, wherein a block of red memory is required to be output and cached, so that a new memory block is not required to be applied, and after the execution of the operator E is completed, subtracting one from the reference count of the output memory (green) block of the operator C, wherein the count is equal to 0, and putting the output memory (green) block into a memory block pool;
Step5: and executing the operator G, wherein the output memory of the operator G can multiplex the input memory (namely, red memory blocks) according to the operator characteristics, a new memory block does not need to be distributed from a memory pool, and after the execution of the operator G is completed, the reference count of the output buffer block (blue) of the operator F is reduced by one.
Therefore, regardless of the initial input memory block, the inferred execution of the entire model requires 3 memory blocks, red, green, and blue, respectively.
And then determining the size of each memory block according to the maximum value of the output buffer memory of each operator by using the memory block of each color, wherein the sum of all the memory blocks is the total memory required in the operator execution process.
From the above, it can be seen that there is a 1M space between green and red, while the blue and violet lifecycles are non-overlapping, obviously wasting part of the storage space.
Disclosure of Invention
In order to solve the problem that the existing data has large memory occupancy rate in the operation process, the invention provides a method and related equipment capable of reducing the memory occupancy rate in the edge calculation process.
To achieve the above object, a first aspect of the present invention provides a data processing method, including:
the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting;
the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
The memory multiplexing client receives a response message of the memory multiplexing server; the response message comprises the relative offset of the memory blocks to be arranged;
The memory multiplexing client sets one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and sets the offset and the size of one or more memory blocks of the suffix operator.
Optionally, the memory block list includes the size of the memory block to be arranged after cutting, the identifier of the memory block to be arranged, the life starting and ending time of the memory block to be arranged, and the position constraint relation of the memory block to be arranged.
Optionally, the memory multiplexing request further includes a message type and a request identifier; the response message also includes a message type and a request identification.
In the above data processing method, optionally, the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server arranging the memory blocks in the memory block list according to the information in the memory block list.
In a second aspect, the present invention provides a memory multiplexing client, including:
The first processing unit is used for cutting the output memory of the operator and establishing the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting;
the first sending unit is used for sending a memory multiplexing request to the memory multiplexing server; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
the first receiving unit is used for receiving the response message of the memory multiplexing server; the response message comprises the relative offset of the memory blocks to be arranged;
the second processing unit is used for setting one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and setting the offset and the size of one or more memory blocks of the suffix operator.
In the memory multiplexing client, optionally, the memory block list includes a size of the memory block to be arranged after cutting, an identifier of the memory block to be arranged, a life start-stop time of the memory block to be arranged, and a position constraint relation of the memory block to be arranged.
In the above memory multiplexing client, optionally, the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server arranging the memory blocks in the memory block list according to the information in the memory block list.
In a third aspect, the present invention provides a memory multiplexing server, including:
The second receiving unit is used for receiving the memory multiplexing request sent by the memory multiplexing client; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
The third processing unit is used for arranging the memory blocks in the memory block list according to the information in the memory block list, and ensuring that the memory blocks are not overlapped so as to determine the relative offset of the memory blocks to be arranged;
And the second sending unit is used for sending a response message, wherein the response message comprises the relative offset of the memory blocks to be arranged.
In the above memory multiplexing server, optionally, the memory block list includes a size of the memory block to be arranged after cutting, an identifier of the memory block to be arranged, a life start-stop time of the memory block to be arranged, and a position constraint relation of the memory block to be arranged.
In the above memory multiplexing server, optionally, the memory multiplexing client is configured to cut an output memory of the operator, and establish a correspondence between a plurality of memory blocks after cutting and a memory block before cutting.
In the above memory multiplexing server, optionally, the memory multiplexing client is further configured to set one or more offsets of the operator output memory according to a correspondence between the memory blocks after cutting and the memory blocks before cutting and a relative offset of the memory blocks in the response message, and set an offset and a size of one or more memory blocks of the suffix operator.
Compared with the prior art, the invention has the beneficial effects that: the input and output caches of the operators are divided into smaller memory requirements, so that memory holes in memory multiplexing are more easily filled, and the total memory requirement of the input and output caches of the model operators in deep learning is reduced. Compared with the prior art, the method has the advantages that holes in the memory multiplexing are fewer, the efficiency of the memory multiplexing is higher, the total memory required is fewer, the data interaction between the cache and the DDR main memory is fewer, the CPU waiting time is less, and the method is beneficial to improving the utilization rate of the CPU.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a model diagram of a deep neural network;
FIG. 2 is a flow chart of a prior art deep neural network data processing;
FIG. 3 is a flow chart of a data processing method of the present invention;
FIG. 4 is a schematic diagram of memory multiplexing in the present invention;
FIG. 5 is a schematic diagram of a positional constraint relationship;
FIGS. 6 and 7 are schematic diagrams of Concat operator data processing;
FIG. 8 is a schematic diagram of a non-overlapping algorithm;
FIG. 9 is a schematic diagram of address mapping;
FIGS. 10 and 11 are data processing flow diagrams for an edge computing device in an offline mode;
FIG. 12 is a flow chart of the data processing method of the present invention as applied in the APP of the MEP or MEC or in dedicated hardware;
FIG. 13 is a schematic diagram of cutting the output memory of an operator;
FIG. 14 is a flow chart of cutting the output memory of an operator;
FIG. 15 is a layout of a diced memory in accordance with the present invention;
FIG. 16 is a flow chart of the application of the data processing method in the mobile device of the present invention;
FIG. 17 is a flow chart in a data processing method application (cloud) server in the present invention;
FIG. 18 is a block diagram of a memory multiplexing client according to the present invention;
fig. 19 is a block diagram of a memory multiplexing server according to the present invention.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 3, the present embodiment provides a data processing method, including the steps of:
Step 1, a memory multiplexing client (hereinafter referred to as a client for short) cuts an output memory of an operator, cuts a cuttable memory according to a minimum cuttable memory block, and establishes a corresponding relation between a plurality of memory blocks after cutting and a memory block before cutting; outputting a memory block list to be multiplexed, wherein the memory block list to be multiplexed comprises the sizes of the memory blocks to be arranged after cutting, the identifiers of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged;
For example, as shown in fig. 4, the operator B may divide the required 4MB cache into 4 memory blocks to be arranged, where each memory block is divided into 1MB, by taking 1k x 1k matrix 8 bytes of data from the input cache each time, that is, 1MB of data each time. Buffer B (pre-cut) corresponds to 4 memories after cut: buffer B1, buffer B2, buffer B3, buffer B4.
It should be noted that, as shown in fig. 5, the location constraint relationship refers to a requirement that a part of memory blocks have to be connected, which generally occurs when a model operator optimizes, for example, concat operators are used to synthesize 2 or more memory block data into one large memory block data, and Concat operators can be optimized at this time to save time when deep learning is performed for reasoning if the input is continuously stored.
As shown in fig. 6, the Concat operator needs to read and write the data in the cache a and the cache B into a new cache C, and the relative position of the data in the cache C is unchanged, so that the cache C can be omitted, a large number of operations can be saved without reading the data and writing the data, thereby improving the efficiency, and the cache a and the cache B can be directly and continuously stored, as shown in fig. 7.
Step 2, the client sends a memory multiplexing request to a memory multiplexing server (hereinafter referred to as a server), wherein the request comprises a memory block list to be multiplexed, a message type, a request identifier, a memory block list as a necessary option, and the rest as selectable options; the list item includes memory block ID, memory block size, life cycle start and end values, position constraint of other memory blocks, etc.
It should be appreciated that the memory block identification, request identification and message type may be numeric/text or numeric + text.
Step 3, after the server receives the request, determining the offset and the size of the memory block by using a non-overlapping algorithm for the memory blocks in the memory block list; and then sending a response message to the client, wherein the response message comprises the relative offset (optional)/message type/message identification of the memory blocks to be arranged.
The non-overlapping algorithm is described below, as shown in fig. 8, where 1 is a memory that has been already laid out, 2 is a/B/C/D4 kinds of positions that have been laid out, a and C do not overlap in memory space, and BD does not overlap in life cycle.
And 4, after receiving the response message, the client refers to the corresponding relation between the memory blocks before cutting and the memory blocks after cutting and the offset of the memory blocks in the response message to set one or more offsets of the operator output memory, and sets the offset and the size of one or more memory blocks of the suffix operator.
Examples are as follows: the response message contains an offset of 4 memory blocks { { B1,0x00000000}, { B2,0x00100000}, { B3,0x00200000}, { B4,0x00300000 }. The client retrieves the corresponding relation to be a buffer B, the memory size of each block after division is 1MB, the buffer B corresponds to the first output of the operator a, and the suffix is the second input of the operator B.
The first setting method comprises the following steps: the first output to operator A and the second input to operator B are respectively given 4 offsets and magnitudes { {1,0x00000000,0x100000}, {2,0x00100000,0x100000}, {3,0x00200000,0x100000}, {4,0x00300000,0x100000 }.
The second setting method comprises the following steps: only one offset and size may be set, given that the 4 blocks of memory are contiguous. The first output to operator a and the second input to operator B are respectively 1 offset and size { {1,0x00000000,0x400000}. If the divided memory blocks are not successively increased, a read-write address mapping table needs to be established, the read-write address is judged, and then a proper offset value is added to the address, for example, as follows.
The data in the response of the server is as follows, { {1,0x00800000,0x100000}, {2,0x007000, 0x100000}, {3,0x00600000,0x100000}, {4,0x00A00000,0x100000 }.
And (5) an operator reads and writes data. Since the data read-write is distributed to a plurality of memory blocks, the operator needs to perform address conversion when reading and writing the data, which is very easy to do. For example, using the upper 4 split memory blocks as an example, the following conversion relationship is established, as shown in fig. 9. The corresponding offset value is 0x00800000 when the access range is 0 or more and less than 0x100000, and 0x00700000 when the access range is 0x200000 or less than 0x300000, and the address corresponding to the memory access is the storage from 0x00a 00000.
Here, superimposing the offset value is a very low overhead operation, which is negligible compared to the multiplication and addition operations of the matrix class.
Referring to fig. 10, in actual use, an offline execution model may be generated according to the method of the present invention outside the edge computing device (the relative offsets of the memory blocks of the memory of each input and output have been calculated and stored in the offline model file).
Referring to fig. 11, the size and number of memory blocks that can be divided can be set according to the data types/the number of times that the operator needs to access and the number of data related to one basic operation, and the memory multiplexing client obtains a reasonable division instruction from the operator corresponding to the request, then divides the memory blocks, and then calculates the relative offset by using the memory multiplexing algorithm.
According to the data processing method, the input and output caches of the operators are divided into the memory requirements of smaller blocks, so that the memory holes in the process of memory multiplexing are more easily filled, and the total memory requirement of the input and output caches of the model operators in the process of deep learning is reduced. Compared with the prior art, the method has the advantages that holes in the memory multiplexing are fewer, the efficiency of the memory multiplexing is higher, the total memory required is fewer, the data interaction between the cache and the DDR main memory is fewer, the CPU waiting time is less, and the method is beneficial to improving the utilization rate of the CPU.
The foregoing is a principle of the data processing method of the present application, and the data processing method of the present application will be further described with reference to specific applications. It should be noted that, the client and the server in the present application may be any of the following: cloud server, virtual machine, container, a process, a thread, a function, a code block. The method can be implemented on personal mobile equipment, computing cloud and edge computing equipment in a multi-access mode, or can be implemented on one equipment to generate an offline computing model, and the method is implemented on a second equipment to segment the input and output caches of operators in the model and then implement a memory multiplexing algorithm when the offline computing model is generated.
Case 1: the invention is illustrated on multi-access edge computing.
This embodiment uses the invention in the APP of MEP or MEC (Multiple (mobile) Edge computing) or in dedicated hardware.
As shown in fig. 12, the client and the server in the APP may be two threads or two functions in one thread, and are described below as two functions.
Referring to fig. 13, function a cuts the input memory when called, assuming that the cut is made 1 MB. The function a needs to allocate storage space for 4 blocks of memory, where the 4 blocks of memory are M1 (2 MB), M2 (1 MB), M3 (1 MB), and M4 (2 MB), respectively, where M2, M3, and M4 are required to be stored consecutively, and the offset of M2 is greater than the offset of M3.
Referring to fig. 14, assume simultaneously that M1 is the output of operator a, the second input of operator B.
Step 101: function a partitions M1 into m1_1 (1 MB) and m1_2 (1 MB), M1 memory blocks correspond to m1_1 and m1_2, function a constructs the request message using function parameters, wherein the memory block list is shown in table 1:
TABLE 1
Step 102: function a calls function B. The memory block List can be stored by a Vector or a List in C++, the specific content refers to a table 1, the optional parameter is a request identifier omitting, and the optional parameter message type (request memory multiplexing is represented by an enumeration value and the like).
Step 103: and the function B lays out the memory blocks according to the memory block list in the parameter to ensure that the memory blocks are not overlapped, and then returns the layout result by MEMRESVEC. One possible placement result is shown on the right side of fig. 15 for the memory block list described above.
It can be seen that m1_1 and m1_2 are discontinuous, placing memory block M2 between them, requiring a total of 6MB of memory; if the M1 memory is not partitioned, the layout result may require 7MB of memory as shown on the left side of the figure.
The results returned to the right are as follows {{M4,0x00000000,0x00200000},{M3,0x00200000,0x00100000},{M1_1,0x00200000,0x00100000},{M2,0x00300000,0x00200000},{M1_2,0x00500000,0x00100000}}.
Step 104: after the function A is called and returned by the function B, the offset and the size of a plurality of memory blocks of the output buffer of the operator A and the second input buffer of the operator B are set by referring to the memory correspondence before and after segmentation. Only the output buffer of operator a and the second input buffer of operator B are described here for simplicity.
The output buffer of operator a { { { out,1,0x00200000,0x00100000}, { out,1,0x00500000,0x00100000}, the second input buffer of operator B sets { { { in,2,0x00200000,0x00100000}, { in,2,0x00500000,0x00100000 }. Where in/out indicates whether the operator is input or not, and the second number indicates which number of inputs and outputs the operator is. After operator configuration is completed, data is written into the appointed memory block or read out from the appointed memory block for processing according to the offset of the data plus the corresponding offset value.
Specific case two: the invention is illustrated on a mobile device.
The invention can be used in consumer products, such as notebook computers, tablet computers or mobile phones.
Referring to fig. 16, the client and server in the consumer product may be two threads or two functions in one thread, described below in terms of two threads.
Referring to FIG. 13, thread A cuts the input memory when called, assuming a 1MB cut. There are 4 blocks of memory in thread a that require space allocation, M1 (2 MB), M2 (1 MB), M3 (1 MB), M4 (2 MB), respectively, where M2, M3, M4 are required to be consecutively deposited, and the offset of M2 is greater than the offset of M3.
Referring to fig. 14, assume simultaneously that M1 is the output of operator a, the second input of operator B.
Step 201: thread a splits M1 into m1_1 (1 MB) and m1_2 (1 MB), M1 memory blocks correspond to m1_1 and m1_2, thread a sends a memory multiplexing request message by inter-thread communication, where the memory block list is shown in table 2:
TABLE 2
Step 202: thread a communicates a request message to thread B via inter-thread communication. The memory block List can be stored by a Vector or a List in C++, the specific content refers to a table 2, the optional parameter is a request identifier omitting, and the optional parameter message type (request memory multiplexing is represented by an enumeration value and the like). The structured data is formed into byte stream to be transmitted by a serialization method, and then the data before serialization is obtained by deserialization at a receiving side.
Step 203: and the thread B lays out the memory blocks according to the memory block list in the message to ensure that the memory blocks are not overlapped, uses MEMRESVEC for serializing the layout result, and then sends the layout result to the thread A through inter-thread communication. One possible placement result is shown on the right side of fig. 15 for the memory block list described above.
It can be seen that m1_1 and m1_2 are discontinuous, placing memory block M2 between them, requiring a total of 6MB of memory; if the M1 memory is not partitioned, the layout result may require 7MB of memory as shown on the left side of the figure.
The results returned to the right are as follows {{M4,0x00000000,0x00200000},{M3,0x00200000,0x00100000},{M1_1,0x00200000,0x00100000},{M2,0x00300000,0x00200000},{M1_2,0x00500000,0x00100000}}.
Step 204: after the thread A obtains the thread B response message, the offset and the size of a plurality of memory blocks of the output buffer of the operator A and the second input buffer of the operator B are set by referring to the memory correspondence before and after segmentation. Only the output buffer of operator a and the second input buffer of operator B are described here for simplicity.
The output buffer of operator a { { { out,1,0x00200000,0x00100000}, { out,1,0x00500000,0x00100000}, the second input buffer of operator B sets { { { in,2,0x00200000,0x00100000}, { in,2,0x00500000,0x00100000 }. Where in/out indicates whether the operator is input or not, and the second number indicates which number of inputs and outputs the operator is. After operator configuration is completed, data is written into the appointed memory block or read out from the appointed memory block for processing according to the offset of the data plus the corresponding offset value.
Specific case three: the invention is illustrated on a (cloud) server.
The invention is used on (cloud) servers, including virtual machines, containers, or non-virtualized servers.
Referring to fig. 17, the client and the server may be two threads (processes) or two functions in one thread, illustrated in two processes.
Referring to fig. 13, process a cuts the input memory when invoked, assuming that the cut is made 1 MB. There are 4 blocks of memory in process a that require space allocation, M1 (2 MB), M2 (1 MB), M3 (1 MB), M4 (2 MB), respectively, where M2, M3, M4 are required to be stored consecutively, and the offset of M2 is greater than the offset of M3.
As shown in fig. 14, it is also assumed that M1 is the output of operator a, the second input of operator B.
Step 301: process a divides M1 into m1_1 (1 MB) and m1_2 (1 MB), M1 memory blocks correspond to m1_1 and m1_2, and thread a sends a memory multiplexing request message through inter-thread communication, where the memory block list is shown in table 3:
TABLE 3 Table 3
Step 302: process a communicates a request message to thread B via inter-process communication. The memory block List can be stored by a Vector or a List in C++, the specific content refers to a table 3, the optional parameter is a request identifier omitting, and the optional parameter message type (request memory multiplexing is represented by an enumeration value and the like). The structured data is formed into byte stream to be transmitted by a serialization method, and then the data before serialization is obtained by deserialization at a receiving side.
Step 303: and the process B lays out the memory blocks according to the memory block list in the message to ensure that the memory blocks are not overlapped, uses MEMRESVEC the layout result to sequence, and sends the result to the process A through inter-process communication. One possible placement result is shown on the right side of fig. 15 for the memory block list described above.
It can be seen that m1_1 and m1_2 are discontinuous, placing memory block M2 between them, requiring a total of 6MB of memory; if the M1 memory is not partitioned, the layout result may require 7MB of memory as shown on the left side of the figure.
The results returned to the right are as follows {{M4,0x00000000,0x00200000},{M3,0x00200000,0x00100000},{M1_1,0x00200000,0x00100000},{M2,0x00300000,0x00200000},{M1_2,0x00500000,0x00100000}}.
Step 304: after the process A obtains the process B response message, the offset and the size of a plurality of memory blocks of the output buffer of the operator A and the second input buffer of the operator B are set by referring to the memory correspondence before and after segmentation. Only the output buffer of operator a and the second input buffer of operator B are described here for simplicity.
The output buffer of operator a { { { out,1,0x00200000,0x00100000}, { out,1,0x00500000,0x00100000}, the second input buffer of operator B sets { { { in,2,0x00200000,0x00100000}, { in,2,0x00500000,0x00100000 }. Where in/out indicates whether the operator is input or not, and the second number indicates which number of inputs and outputs the operator is. After operator configuration is completed, data is written into the appointed memory block or read out from the appointed memory block for processing according to the offset of the data plus the corresponding offset value.
According to the data processing method in the embodiment, the input and output caches of the operators are divided into the memory requirements of smaller blocks, so that the memory holes in the process of memory multiplexing are more easily filled, and the total memory requirement of the input and output caches of the model operators in the process of deep learning is reduced. Compared with the prior art, the method has the advantages that holes in the memory multiplexing are fewer, the efficiency of the memory multiplexing is higher, the total memory required is fewer, the data interaction between the cache and the DDR main memory is fewer, the CPU waiting time is less, and the method is beneficial to improving the utilization rate of the CPU.
In some embodiments, the present invention further provides a memory multiplexing client, as shown in fig. 18, where the memory multiplexing client includes:
The first processing unit 101 is configured to cut an output memory of the operator, and establish a correspondence between a plurality of memory blocks after cutting and a memory block before cutting; the specific processing procedure is already described in detail in the step 1 of the data processing method, and will not be described herein.
A first sending unit 102, configured to send a memory multiplexing request to a memory multiplexing server; wherein the memory multiplexing request comprises a memory block list to be multiplexed; the memory block list comprises the sizes of memory blocks to be arranged after cutting, the identifiers of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged. The specific processing procedure is already described in detail in step 2 of the data processing method, and will not be described here again.
A first receiving unit 103, configured to receive a response message of the memory multiplexing server; the response message comprises the relative offset of the memory blocks to be arranged; and the relative offset of the memory blocks to be arranged is determined by arranging the memory blocks in the memory block list by the memory multiplexing server and ensuring that the memory blocks are not overlapped. The specific processing procedure is already described in detail in the step 3 of the data processing method, and will not be described herein.
The second processing unit 104 is configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and set the offset and the size of the one or more memory blocks of the suffix operator. The specific processing procedure is already described in detail in step 4 of the data processing method, and will not be described here again.
In still other embodiments, the present invention provides a memory multiplexing server, as shown in fig. 19, which includes:
A second receiving unit 201, configured to receive a memory multiplexing request sent by a memory multiplexing client; wherein the memory multiplexing request comprises a memory block list to be multiplexed; the memory block list comprises the size of the memory blocks to be arranged after cutting, the identification of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged.
A third processing unit 202, configured to allocate memory blocks in the memory block list according to the information in the memory block list, and ensure that the memory blocks are not overlapped so as to determine a relative offset of the memory blocks to be allocated; the specific data processing procedure is already described in detail in step 3 of the data processing method, and will not be described here again.
A second sending unit 203, configured to send a response message, where the response message includes a relative offset of the memory blocks to be arranged.
In addition, the memory multiplexing client is used for cutting the output memory of the operator and establishing the corresponding relation between the plurality of memory blocks after cutting and the memory blocks before cutting.
In addition, the memory multiplexing client is further configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and set the offset and the size of one or more memory blocks of the suffix operator.
In addition, an embodiment of the present invention further provides a computer readable storage medium, where the computer readable storage medium may store a program, where the program when executed includes some or all of the steps of any one of the data processing methods described in the foregoing method embodiments.
In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present invention may be embodied essentially or partly in the form of a software product, or all or part of the technical solution, which is stored in a memory, and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned memory includes: a usb disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be implemented by a program that instructs associated hardware, and the program may be stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), magnetic disk or optical disk.
An exemplary flowchart of a method for implementing a service chain according to an embodiment of the present invention is described above with reference to the accompanying drawings. It should be noted that the numerous details included in the above description are merely illustrative of the invention and not limiting of the invention. In other embodiments of the invention, the method may have more, fewer, or different steps, and the order, inclusion, functional relationship between steps may be different than that described and illustrated.

Claims (12)

1. A method of data processing, comprising:
the memory multiplexing client cuts the output memory of the operator and establishes the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting;
the memory multiplexing client sends a memory multiplexing request to a memory multiplexing server; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
The memory multiplexing client receives a response message of the memory multiplexing server; the response message comprises the relative offset of the memory blocks to be arranged;
The memory multiplexing client sets one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and sets the offset and the size of one or more memory blocks of the suffix operator.
2. A data processing method according to claim 1, characterized in that: the memory block list comprises the sizes of memory blocks to be arranged after cutting, the identifiers of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged.
3. A data processing method according to claim 1, characterized in that: the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server according to the information in the memory block list to arrange the memory blocks in the memory block list.
4. A data processing method according to claim 1, characterized in that: the memory multiplexing request also comprises a message type and a request identifier; the response message also includes a message type and a request identification.
5. A memory multiplexing client, comprising:
The first processing unit is used for cutting the output memory of the operator and establishing the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting;
the first sending unit is used for sending a memory multiplexing request to the memory multiplexing server; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
the first receiving unit is used for receiving the response message of the memory multiplexing server; the response message comprises the relative offset of the memory blocks to be arranged;
the second processing unit is used for setting one or more offsets of the operator output memory according to the corresponding relation between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and setting the offset and the size of one or more memory blocks of the suffix operator.
6. The memory multiplexing client of claim 5, wherein: the memory block list comprises the sizes of memory blocks to be arranged after cutting, the identifiers of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged.
7. The memory multiplexing client of claim 5, wherein: the relative offset of the memory blocks to be arranged is determined by the memory multiplexing server according to the information in the memory block list to arrange the memory blocks in the memory block list.
8. A memory multiplexing server, comprising:
The second receiving unit is used for receiving the memory multiplexing request sent by the memory multiplexing client; wherein the memory multiplexing request comprises a memory block list to be multiplexed;
The third processing unit is used for arranging the memory blocks in the memory block list according to the information in the memory block list, and ensuring that the memory blocks are not overlapped so as to determine the relative offset of the memory blocks to be arranged;
And the second sending unit is used for sending a response message, wherein the response message comprises the relative offset of the memory blocks to be arranged.
9. The memory multiplexing server of claim 8, wherein: the memory block list comprises the sizes of memory blocks to be arranged after cutting, the identifiers of the memory blocks to be arranged, the life starting and ending time of the memory blocks to be arranged and the position constraint relation of the memory blocks to be arranged.
10. The memory multiplexing server of claim 8, wherein: the memory multiplexing client is used for cutting the output memory of the operator and establishing the corresponding relation between a plurality of memory blocks after cutting and the memory blocks before cutting.
11. The memory multiplexing server of claim 8, wherein: the memory multiplexing client is further configured to set one or more offsets of the operator output memory according to the correspondence between the memory blocks after cutting and the memory blocks before cutting and the relative offset of the memory blocks in the response message, and set the offset and the size of the one or more memory blocks of the suffix operator.
12. A computer-readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of a data processing method according to any one of claims 1 to 3.
CN202011084416.8A 2020-10-12 2020-10-12 Data processing method, related device and computer readable storage medium Active CN112306675B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011084416.8A CN112306675B (en) 2020-10-12 2020-10-12 Data processing method, related device and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011084416.8A CN112306675B (en) 2020-10-12 2020-10-12 Data processing method, related device and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN112306675A CN112306675A (en) 2021-02-02
CN112306675B true CN112306675B (en) 2024-06-04

Family

ID=74488411

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011084416.8A Active CN112306675B (en) 2020-10-12 2020-10-12 Data processing method, related device and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN112306675B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113608881B (en) * 2021-10-09 2022-02-25 腾讯科技(深圳)有限公司 Memory allocation method, device, equipment, readable storage medium and program product

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309767A (en) * 2012-03-08 2013-09-18 阿里巴巴集团控股有限公司 Method and device for processing client log
CN108780656A (en) * 2016-03-10 2018-11-09 美光科技公司 Device and method for logic/memory device
CN110597616A (en) * 2018-06-13 2019-12-20 华为技术有限公司 Memory allocation method and device for neural network
CN110766135A (en) * 2019-10-15 2020-02-07 北京芯启科技有限公司 Method for storing required data when optimizing operation function of neural network in any depth
CN111105018A (en) * 2019-10-21 2020-05-05 深圳云天励飞技术有限公司 Data processing method and device
CN111401532A (en) * 2020-04-28 2020-07-10 南京宁麒智能计算芯片研究院有限公司 Convolutional neural network reasoning accelerator and acceleration method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10936487B2 (en) * 2018-03-12 2021-03-02 Beijing Horizon Information Technology Co., Ltd. Methods and apparatus for using circular addressing in convolutional operation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309767A (en) * 2012-03-08 2013-09-18 阿里巴巴集团控股有限公司 Method and device for processing client log
CN108780656A (en) * 2016-03-10 2018-11-09 美光科技公司 Device and method for logic/memory device
CN110597616A (en) * 2018-06-13 2019-12-20 华为技术有限公司 Memory allocation method and device for neural network
CN110766135A (en) * 2019-10-15 2020-02-07 北京芯启科技有限公司 Method for storing required data when optimizing operation function of neural network in any depth
CN111105018A (en) * 2019-10-21 2020-05-05 深圳云天励飞技术有限公司 Data processing method and device
CN111401532A (en) * 2020-04-28 2020-07-10 南京宁麒智能计算芯片研究院有限公司 Convolutional neural network reasoning accelerator and acceleration method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向组合逻辑的DNA计算;张川 等;《中国科学》;第第49卷卷(第第7期期);819-837 *

Also Published As

Publication number Publication date
CN112306675A (en) 2021-02-02

Similar Documents

Publication Publication Date Title
US20170346759A1 (en) Optimizing placement of virtual machines
US8850265B2 (en) Processing test cases for applications to be tested
US9367359B2 (en) Optimized resource management for map/reduce computing
CN104965757A (en) Virtual machine live migration method, virtual machine migration management apparatus, and virtual machine live migration system
US11699073B2 (en) Network off-line model processing method, artificial intelligence processing device and related products
CN103634379A (en) Management method for distributed storage space and distributed storage system
CN109254836B (en) Deadline constraint cost optimization scheduling method for priority dependent tasks of cloud computing system
CN113849312A (en) Data processing task allocation method and device, electronic equipment and storage medium
CN102413183B (en) Cloud intelligence switch and processing method and system thereof
CN107704310A (en) A kind of method, apparatus and equipment for realizing container cluster management
CN105302536A (en) Configuration method and apparatus for related parameters of MapReduce application
US10318343B2 (en) Migration methods and apparatuses for migrating virtual machine including locally stored and shared data
US11023825B2 (en) Platform as a service cloud server and machine learning data processing method thereof
CN113190534A (en) Database data migration method and device
CN114900699A (en) Video coding and decoding card virtualization method and device, storage medium and terminal
CN112306675B (en) Data processing method, related device and computer readable storage medium
CN112433812A (en) Method, system, equipment and computer medium for virtual machine cross-cluster migration
CN111078353A (en) Operation method of storage equipment and physical server
CN105677481B (en) A kind of data processing method, system and electronic equipment
CN104518897A (en) Resource management optimization processing method and resource management optimization processing device for virtual firewalls
CN113672375A (en) Resource allocation prediction method, device, equipment and storage medium
CN115061825B (en) Heterogeneous computing system and method for private computing, private data and federal learning
CN114995770B (en) Data processing method, device, equipment, system and readable storage medium
CN110221902A (en) A kind of data transmission method and relevant apparatus based on virtual machine
CN113177211A (en) FPGA chip for privacy computation, heterogeneous processing system and computing method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant