CN116010301B - Mapping method and device from data stream to DMA configuration, storage medium and DLA - Google Patents

Mapping method and device from data stream to DMA configuration, storage medium and DLA Download PDF

Info

Publication number
CN116010301B
CN116010301B CN202211517576.6A CN202211517576A CN116010301B CN 116010301 B CN116010301 B CN 116010301B CN 202211517576 A CN202211517576 A CN 202211517576A CN 116010301 B CN116010301 B CN 116010301B
Authority
CN
China
Prior art keywords
data
dma
data stream
dimension
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211517576.6A
Other languages
Chinese (zh)
Other versions
CN116010301A (en
Inventor
潘佳诚
孙铁力
张亚林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Suiyuan Technology Co ltd
Original Assignee
Shanghai Enflame Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Enflame Technology Co ltd filed Critical Shanghai Enflame Technology Co ltd
Priority to CN202211517576.6A priority Critical patent/CN116010301B/en
Publication of CN116010301A publication Critical patent/CN116010301A/en
Application granted granted Critical
Publication of CN116010301B publication Critical patent/CN116010301B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Bus Control (AREA)

Abstract

The invention discloses a mapping method and device from data stream to DMA configuration, a storage medium and DLA. The method comprises the following steps: determining whether the scanning method in the nth level data stream meets the access memory limited by the DMA hardware, and checking whether the synchronization description in the nth level data stream meets the synchronization limited by the DMA hardware. If the access memory limited by the DMA hardware is met and the synchronization limited by the DMA hardware is met, checking whether the scan description in the N-th data stream meets the specification limited by the DMA hardware. If the data stream does not meet the specification limited by the DMA hardware, splitting the N-th data stream into at least two N+1-th data streams according to the scanning description; and if the specification limited by the DMA hardware is met, outputting the DMA configuration limited by the DMA hardware and provided with the W times of repeated execution to the DMA hardware according to the description of the N-th data stream and the interface information of the DMA hardware. Accordingly, the utilization rate of the computing core is improved, so that the DLA obtains improvement of performance and reduction of power consumption.

Description

Mapping method and device from data stream to DMA configuration, storage medium and DLA
Technical Field
The present invention relates to the field of deep learning computing technology, and in particular, to a mapping method and apparatus from data stream to DMA configuration, a storage medium, and DLA.
Background
With the maturation and popularity of deep learning techniques, there is an increasing demand for energy efficient algorithms. As an emerging technical field, the deep learning computing has a very strong regularity in the common computing mode and access form, so that the DLA (Deep Learning Accelerator ) architecture optimized for the deep learning computing has been rapidly developed in recent years.
The traditional CPU architecture often has multi-level cache, complex control units and rich instruction support, and is suitable for application scenes with general calculation, low delay and moderate calculation power requirements. Whereas GPU architecture has evolved rapidly in recent years, its SIMD (Single Instruct ion Multi-Data, single instruction stream multiple Data stream) and SIMT (Single Instruct ion Multi-Thread, single instruction stream multithreading) microarchitecture implementation also improves the energy efficiency ratio, thanks to its possession of scalable high computational power, rich cache and register resources, simplified control units, and mature software ecology, which also represents a great advantage in high throughput application scenarios.
A popular DLA architecture adopts a multi-level buffer cache, and compared with the traditional architecture that a CPU uses a multi-level cache, the DLA architecture is simpler to realize, better in energy efficiency, more in on-chip memory which can be used in unit area, and has strong competitiveness in a deep learning computing mode. Under such architecture design, in order to improve the computational power utilization of a computing core (also referred to as a computing unit) and reduce the idle of the computing core due to uncore computing logic, a dedicated DMA (Direct Memory Access ) hardware unit (abbreviated as DMA hardware) is often used to carry data between buffer caches.
In the deep learning calculation, the calculation and access are usually performed based on the matrix and tensor, but often the on-chip cache cannot completely store the matrix or tensor required by an operator (such as matrix multiplication and convolution calculation), so that DMA is required to transfer data stored in a low-bandwidth off-chip memory (such as DDR memory) far from a relative computing core to a near-end high-bandwidth on-chip memory (such as buffer cache) according to a regular slicing or dicing mode, and in the process, synchronization is repeatedly performed with other DMAs in the computing core and DMA hardware, so as to ensure the integrity and correctness of the data.
In order to reduce DLA efficiency loss (including performance, power consumption) caused by configuring DMA hardware while fully utilizing the above-described rules, the DMA hardware supports a function of configuring a single time to be automatically and repeatedly executed according to certain requirements. But this function has hardware limitations including the number of dimensions supported by the configuration, the number of repetitions, and the order of access, among others. Therefore, how to map the general data stream to the DMA configuration limited by the DMA hardware, so that the DLA obtains the improvement of the performance and the reduction of the power consumption is a technical problem to be solved.
Disclosure of Invention
The invention provides a mapping method and a device from a data stream to DMA configuration, a storage medium and DLA, which are used for realizing the mapping from a general data stream to the DMA configuration limited by DMA hardware, thereby shortening the idle time of a computing core, improving the utilization rate of the computing core, and further improving the performance of the DLA and reducing the power consumption.
According to an aspect of the present invention, there is provided a mapping method from a data stream to a DMA configuration, comprising:
receiving an nth data stream;
determining whether a scanning method in the nth data stream accords with access memory limited by DMA hardware, and checking whether synchronization descriptions in the nth data stream accord with synchronization limited by the DMA hardware;
if the access memory limited by the DMA hardware is not met or the synchronization limited by the DMA hardware is not met, outputting the N-th data stream to an initial mapping device; if the access memory limited by the DMA hardware is met and the synchronization limited by the DMA hardware is met, checking whether the scanning description in the N-stage data stream meets the specification limited by the DMA hardware or not;
if the data flow does not meet the specification limited by DMA hardware, splitting the N-th data flow into at least two N+1th data flows according to the scanning description, wherein the N+1th data flow is the next stage data flow of the N-th data flow, N is a positive integer, and N is more than or equal to 1; and if the specification limited by the DMA hardware is met, outputting DMA configuration limited by the DMA hardware and provided with W times of repeated execution times to the DMA hardware according to the description of the N-level data stream and the interface information of the DMA hardware, wherein W is an integer and W0.
Optionally, the determining whether the scanning method in the nth data stream meets the access memory limited by the DMA hardware includes:
if the scanning behavior in at least one dimension in the nth data stream is out of range, judging whether the memory arrangement of all dimensions in the nth data stream needs to be transposed or not; if yes, judging that the scanning method does not accord with the access memory limited by the DMA hardware, and if not, judging that the scanning method accords with the access memory limited by the DMA hardware;
and if the scanning behaviors in all the dimensions in the nth data stream do not cross the boundary scanning, judging that the scanning method accords with the access limited by the DMA hardware.
Optionally, before the determining whether the memory arrangement of all dimensions in the nth data stream needs to be transposed if the scanning behavior in at least one dimension in the nth data stream is out of range scanning, the method further includes:
in a single dimension, when the size of the source data is greater than or equal to the size of the target data, if [ stride× ]
(times-1)+dst_dim_size]If not more than src_dim_size, determining that scanning behavior in the dimension does not cross-boundary scanning, if [ stride× (times-1) +dst_dim_size ]]> src _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,
In a single dimension, when the size of the source data is smaller than the size of the target data, if [ stride× (times-1) +src_dim_size ]. Ltoreq.dst_dim_size ], determining that the scanning behavior in the dimension does not generate out-of-range scanning, if
[stride×(times-1)+src_dim_size]> dst _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,
wherein src_dim_size is the size of the source data, dst_dim_size is the size of the target data, stride is the span size between two accesses in the current dimension, and time is the number of scans corresponding to the scanning behavior in the current dimension.
Optionally, the checking whether the synchronization description in the nth data stream meets the synchronization limited by the DMA hardware includes:
if the number of the upstream synchronous execution bodies, the number of the downstream synchronous execution bodies and/or the synchronization rule in the synchronous description do not accord with the synchronization limited by the DMA hardware, determining that the synchronous description in the Nth data stream does not accord with the synchronization limited by the DMA hardware.
Optionally, after the receiving the nth data stream, the method further includes:
if the Z-1 level low dimension of the source data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement, and the corresponding Z-1 level low dimension of the target data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement, judging whether the Z-1 level low dimension data description of the source data and the Z-1 level low dimension data description of the source data after Z-1 level merging of the Z-1 level low dimension data description of the target data and the Z-1 level low dimension data description of the target data meet the specification of hardware limitation or not when the size of the target data on the Z-1 level high dimension is equal to 1 or the size of the target data and the source data on the Z-1 level low dimension is the same;
If the Z-level dimension data description of the source data after Z-level merging and the Z-level dimension data description of the target data after Z-level merging meet the specification limited by the DMA hardware, carrying out Z-1-level merging on the Z-1-level low dimension data description of the source data and the Z-1-level dimension data description of the source data, and carrying out Z-1-level merging on the Z-1-level low dimension data description of the target data and the Z-1-level dimension data description of the target data so as to simplify the N-level data stream, wherein Z is a positive integer, and Z is more than or equal to 2 and less than the total number of dimensions in the N-level data stream.
Optionally, the merging the Z-1 th level low-dimensional data description of the source data and the Z-1 th level high-dimensional data description of the source data for the Z-1 th time, and merging the Z-1 th level low-dimensional data description of the target data and the Z-1 th level high-dimensional data description of the target data for the Z-1 th time, includes:
multiplying the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data to merge the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data, and multiplying the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data to merge the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data;
Multiplying the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size to combine the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size;
multiplying the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number to combine the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number;
combining the Z-1 level low-dimensional memory arrangement of the source data and the Z-1 level high-dimensional memory arrangement of the source data, and combining the Z-1 level low-dimensional memory arrangement of the target data and the Z-1 level high-dimensional memory arrangement of the target data;
and merging the Z-1 level low-dimensional access sequence and the Z-1 level high-dimensional access sequence.
According to another aspect of the present invention, there is provided a mapping apparatus from a data stream to a DMA configuration for performing a mapping method from a data stream to a DMA configuration according to any embodiment of the present invention, the apparatus comprising:
a receiving module, configured to receive an nth data stream;
the access and synchronization checking module is used for determining whether the scanning method in the N-th data stream accords with the access limited by the DMA hardware or not and checking whether the synchronization description in the N-th data stream accords with the synchronization limited by the DMA hardware or not;
The specification checking module is used for outputting the nth data stream to the initial mapping device if the access memory which is not in accordance with the limitation of the DMA hardware or the synchronization which is not in accordance with the limitation of the DMA hardware are determined; if the access meeting the limitation of the DMA hardware and the synchronization meeting the limitation of the DMA hardware are confirmed, checking whether the scanning description in the N-stage data stream meets the specification limited by the DMA hardware or not; the DMA configuration output module is used for splitting the N-th data stream into at least two N+1th data streams according to the scanning description if the specification which is not in accordance with the limitation of DMA hardware is determined, wherein the N+1th data stream is the next data stream of the N-th data stream, N is a positive integer, and N is more than or equal to 1; if the specification meeting the limitation of the DMA hardware is confirmed, outputting the DMA configuration limited by the DMA hardware and provided with the number of times of repeated execution for W to the DMA hardware according to the description of the N-level data stream and the interface information of the DMA hardware, wherein W is an integer and is more than or equal to 0.
According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the mapping method from data streams to DMA configurations of any of the embodiments of the present invention when executed.
According to another aspect of the present invention, there is provided a DLA including: DMA hardware, a computing unit and mapping means from data streams to DMA configuration according to any embodiment of the present invention; wherein the mapping means from data flow to DMA configuration is in communication with the DMA hardware, which is in communication with the computing unit.
According to the technical scheme, after the N-level data stream is received, whether the scanning method in the N-level data stream accords with access memory limited by DMA hardware or not is firstly determined, and whether synchronous description in the N-level data stream accords with synchronous limited by the DMA hardware or not is checked. If the access memory limited by the DMA hardware is not met or the synchronization limited by the DMA hardware is not met, outputting the N-th data stream to the initial mapping device; if the access memory limited by the DMA hardware is met and the synchronization limited by the DMA hardware is met, checking whether the scan description in the N-th data stream meets the specification limited by the DMA hardware. Secondly, if the data flow does not meet the specification limited by DMA hardware, splitting the N-th data flow into at least two N+1th data flows according to the scanning description, wherein the N+1th data flow is the next-stage data flow of the N-th data flow, N is a positive integer, and N is more than or equal to 1; if the specification limited by the DMA hardware is met, outputting DMA configuration limited by the DMA hardware and provided with W times of repeated execution to the DMA hardware according to the description of the N-th data stream and the interface information of the DMA hardware, wherein W is an integer and is more than or equal to 0. Therefore, the DMA hardware is supported to automatically and repeatedly execute the function according to a certain requirement, and the mapping from the general data stream to the DMA configuration limited by the DMA hardware is realized by utilizing the function, so that the time expenditure caused by configuring the DMA hardware is reduced, the idle time of a computing core is shortened, the utilization rate of the computing core is improved, and the improvement of the DLA acquisition performance and the reduction of the power consumption are further realized.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a mapping method from data stream to DMA configuration provided in an embodiment of the present invention;
FIG. 2a is a schematic diagram of a memory arrangement of data according to an embodiment of the present invention;
FIG. 2b is a schematic diagram of another memory arrangement of data according to an embodiment of the present invention;
FIG. 3a is a schematic diagram of a span and access sequence of data according to an embodiment of the present invention;
FIG. 3b is a schematic diagram of another span and access sequence provided by an embodiment of the present invention;
FIG. 3c is a schematic diagram of another span and access sequence provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of the number of repetitions of a data provided by an embodiment of the invention;
FIG. 5a is a schematic diagram of the shape of data provided by an embodiment of the present invention;
FIG. 5b is a schematic diagram of another data shape provided by an embodiment of the present invention;
FIG. 6a is a schematic diagram of a scanning behavior according to an embodiment of the present invention;
FIG. 6b is a schematic diagram of another scanning behavior provided by an embodiment of the present invention;
FIG. 7 is a schematic diagram of a mapping apparatus from data stream to DMA configuration according to an embodiment of the present invention;
fig. 8 is a schematic structural diagram of a DLA architecture according to an embodiment of the present invention.
Detailed Description
In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Fig. 1 is a flow chart of a mapping method from a data stream to a DMA configuration according to an embodiment of the present invention. Referring to fig. 1, the mapping method from a data stream to a DMA configuration includes:
s110, receiving the N-th data stream.
In particular, the data stream may be a generic data stream in the art, and the generic data stream may be understood as a description of generic data access requirements; in other embodiments, the description of the generic data access requirements may also be abbreviated as description of the data flow. In general, a data stream may include access requirements for a set of data and synchronization requirements during the access process. Illustratively, the access requirements may include each data in the set of data: name information of a storage space of source data, name information of a storage space of target data, address information of source data, address information of target data, shape information of source data, shape information of target data, size information of source data, size information of target data, memory arrangement information of source data, memory arrangement information of target data, and the like, and repetition number in each dimension, span size in each dimension, access order in each dimension, and the like.
Fig. 2a is a schematic diagram of a memory arrangement of data according to an embodiment of the present invention, and fig. 2b is a schematic diagram of a memory arrangement of another data according to an embodiment of the present invention. That is, in FIG. 2a, one memory arrangement of one data of a group of data is schematically illustrated as layout [0,1], and elements E of the data are consecutively arranged along the x-axis, while in FIG. 2b, another memory arrangement of the data is schematically illustrated as layout [1,0], and elements E of the data are consecutively arranged along the y-axis.
Exemplary, fig. 3a is a schematic diagram of a span and an access sequence of data according to an embodiment of the present invention, fig. 3b is a schematic diagram of another span and an access sequence according to an embodiment of the present invention, and fig. 3c is a schematic diagram of another span and an access sequence according to an embodiment of the present invention. That is, in fig. 3a, a slice span size and a slice access order (wherein a slice Q may include a plurality of elements E) of one data of a set of data are exemplarily illustrated, the slice span size is [ y, x ] = [4,4] elements, and the slice access order is to scan along the x-axis in a "Z" shape; in fig. 3b, another slice span size and slice access order of the data is schematically illustrated, the slice span size being [ y, x ] = [2,2] elements and the slice access order being scanning in a "Z" shape trend along the x-axis; in fig. 3c, another slice span size and slice access order of the data is illustrated, the slice span size being [ y, x ] = [4,4] elements and the access order being scanning in an inverse "N" shaped trend along the y-axis. Wherein the x-axis may be one dimension of the data and the y-axis may be another dimension of the data; of course, the dimension of the data may also be a z-axis or a time axis, and the number of dimensions of the data is not particularly limited in this embodiment.
Fig. 4 is a schematic diagram illustrating the number of repetitions of data according to an embodiment of the present invention. That is, in fig. 4, the number of repetitions of one data (i.e., the number of repetitions of slice access of one data) in a set of data is exemplarily illustrated, and after repeating scanning 3 times along the x-axis, scanning along the x-axis is continued after line feed along the y-axis.
Fig. 5a is a schematic diagram of the shape of one data according to an embodiment of the present invention, and fig. 5b is a schematic diagram of the shape of another data according to an embodiment of the present invention. That is, in fig. 5a, the left graph is the shape of one source data of one data in a group of data, and the right graph is the shape of one target data of the data, where the case may be that one source data is sliced to obtain multiple target data; in fig. 5b, the left diagram is the shape of another source data of the data, and the right diagram is the shape of another target data of the data, where multiple source data may be combined to obtain one target data.
S120, determining whether the scanning method in the N-level data stream accords with access memory limited by DMA hardware, and checking whether the synchronous description in the N-level data stream accords with synchronous limited by the DMA hardware. If the access limited by the DMA hardware is not met or the synchronization limited by the DMA hardware is not met, executing step S130; if the access limited by the DMA hardware is met and the synchronization limited by the DMA hardware is met, step S140 is performed.
In particular, DMA hardware may be understood as a DMA hardware unit; the DMA hardware unit may include a plurality of DMAs; one setting mode of the plurality of DMAs corresponds to one limitation of the DMA hardware, and the other setting mode of the plurality of DMAs corresponds to the other limitation of the DMA hardware; the restrictions may be, among other things, restrictions on access memory, restrictions on synchronization, restrictions on specifications, etc.
Based on this, the accesses limited by the DMA hardware may include slice size, access order, number of repetitions, stride size, transpose requirements, and out-of-range scanning, among others. The scanning method in this embodiment may include transpose demand, out-of-range scanning, and the like. The synchronization limited by the DMA hardware may include a synchronization relationship between the DMA and the DMA in the DMA hardware and a synchronization relationship between the DMA and the compute core (also referred to as a compute unit). The synchronization descriptions in this embodiment may include a synchronization relationship between the DMA and a synchronization relationship between the DMA and the computing core.
In this embodiment, after receiving the nth data stream, by determining whether the scanning method in the nth data stream belongs to the access memory limited (i.e. supported) by the DMA hardware or a subset of the access memory, and checking whether the synchronization description in the nth data stream belongs to the synchronization or a subset of the synchronization supported by the DMA hardware, the scanning rule and the synchronization rule in the nth data stream are checked, so that the DMA hardware is supported to automatically and repeatedly perform the function according to a certain requirement. However, in the related art, after receiving the data stream, instead of checking the scanning rule, the synchronization rule, etc. of the data stream, the data stream is expanded, and then access and synchronization of each data are checked, configured, executed, and synchronized in sequence, so that the DMA hardware cannot be utilized to support the function of configuring the DMA hardware to automatically and repeatedly execute according to a certain requirement, resulting in huge time expenditure caused by configuring the DMA hardware, more idle time of the computing core, lower utilization rate of the computing core, and loss of DLA efficiency (including performance and power consumption) is caused.
Illustratively, the source data in the nth data stream has a shape of yxx=64×1024, the target data has a shape of yxx=64×64, the scanning method is to scan along the x-axis with a span of 64×64, and the synchronization is described as DMA sending a synchronization signal to the computing unit after each execution; in this step, the scanning method and the synchronization description are checked to determine whether the scanning method and the synchronization description belong to accesses supported by the DMA hardware or a subset of the accesses and belong to synchronization supported by the DMA hardware or a subset of the synchronization supported by the DMA hardware, so as to check the scanning rule and the synchronization in the nth data stream, thereby facilitating the realization of the utilization of the automatic repeated execution function of the DMA hardware support once configuration according to a certain requirement. However, in the related art, after receiving the nth data stream, the scanning method and the synchronization description are both developed, so that 16 accesses to the source data 64×1024 according to the slice of the target data 64×64 are obtained, and each access in 16 accesses includes checking, configuration, execution and synchronization, so that the DMA hardware cannot be utilized to support the function of automatically and repeatedly executing according to a certain requirement by one configuration, resulting in huge time expenditure caused by configuring the DMA hardware, more idle time of the computing core, and lower utilization rate of the computing core, resulting in DLA efficiency loss (including performance and power consumption).
In addition, in this embodiment, the step of determining whether the scanning method in the nth data stream conforms to the access memory limited by the DMA hardware may be performed after or before the step of checking whether the synchronization description in the nth data stream conforms to the synchronization limited by the DMA hardware, i.e., the two steps may be performed in no order.
S130, outputting the nth data stream to the initial mapping device.
Specifically, when it is checked that the scanning method in the nth data stream does not conform to the access memory limited by the DMA hardware and/or it is checked that the synchronization description in the nth data stream does not conform to the synchronization limited by the DMA hardware, the nth data stream may be mapped to the DMA hardware according to the existing related art mentioned in step S120 to output the DMA configuration. That is, the initial mapping means mentioned in the present embodiment can be understood as the mapping means in the related art mentioned in step S120.
S140, checking whether the scan description in the N-level data stream meets the specification limited by the DMA hardware. If the specification limited by the DMA hardware is not met, executing step S150; if the specification limited by the DMA hardware is met, step S160 is performed.
Specifically, the specification limited by the DMA hardware may include shape information of source data, shape information of target data, the number of repetitions in each dimension, a span size in each dimension, size information of source data, and size information of target data. The scan description in this implementation may include the number of repetitions in each dimension and the span size in each dimension.
In this embodiment, after checking the scanning rule and the synchronization rule in the nth data stream and checking that the scanning rule is met and the synchronization rule is met, checking the specification in the nth data stream is performed, so that the DMA hardware is supported to automatically and repeatedly execute the utilization of the function according to a certain requirement, and the mapping from the general data stream to the DMA configuration limited by the DMA hardware is realized by utilizing the function.
S150, splitting the N-th data stream into at least two N+1th data streams according to the scanning description, wherein the N+1th data stream is the next data stream of the N-th data stream, N is a positive integer, and N is more than or equal to 1.
Specifically, when the scan description in the nth data stream is checked to be not in accordance with the specification limited by the DMA hardware, it is indicated that the nth data stream cannot be directly supported by the DMA hardware to support a function (automatic repeat execution function for short) of automatic repeat execution according to a certain requirement by one-time configuration, so as to realize mapping of the DMA configuration, and then the nth data stream can be split into at least two nth+1st data streams. The nth data stream is used as the last data stream of the n+1st data stream, the n+1st data stream is used as the next data stream of the nth data stream, and the n+1st data stream is also understood as the sub data stream of the nth data stream. Similarly, when it is checked that the scan description in the n+1st level data stream does not meet the specification limited by the DMA hardware, the n+1st level data stream may be split into at least two (n+1) +1st level data streams; the (n+1) +1 stage data stream is a sub-stream of the n+1 stage data stream. The 1 st level data stream is the original data stream which is not split.
In this embodiment, when the original data stream can directly use the automatic repeat execution function to realize the mapping of the DMA configuration, the mapping of the DMA configuration is directly realized by using step S160 in this embodiment without splitting the original data stream. When the original data stream can not directly realize the mapping of DMA configuration by utilizing the automatic repeat execution function, splitting the original data stream into a 2 nd data stream, and carrying out the mapping of DMA configuration by using the 2 nd data stream, wherein the 2 nd data stream is used as a sub data stream of the original data stream. When the level 2 data stream cannot directly utilize the automatic repeat execution function to realize the mapping of the DMA configuration, splitting the level 2 data stream into a level 3 data stream, carrying out the mapping of the DMA configuration by using the level 3 data stream, and analogizing the level 3 data stream as a sub data stream … of the level 2 data stream until the DMA mapping of the data stream is completed.
Illustratively, assume that the source data in the nth data stream has a shape of yxx=64×1024, the target data has a shape of yxx=32×256, and the scanning method conforms to the access memory limited by the DMA hardware and the synchronization description conforms to the synchronization limited by the DMA hardware. At this point, however, the scan is described as repeating 2 times along the x-axis (i.e., the number of repetitions in the x-axis is 2), followed by a linefeed along the y-axis, and it can be seen that the scan description does not meet the specifications limited by the DMA hardware. Thus, the present embodiment can exemplarily split the nth level data stream into one nth+1st level data stream (first nth+1st level data stream) and another external nth+1st level data stream (second nth+1st level data stream).
Further, access, synchronization and specification checking are performed on the first n+1st level data stream and the second n+1st level data stream, respectively, in step S120 and step S140. If the first n+1st data stream is detected, the mapping of the DMA configuration can be directly implemented by using the automatic repeat execution function, and the mapping of the DMA configuration is directly implemented by using step S160 in the embodiment without splitting the first n+1st data stream. If the second N+1st data stream is detected to be unable to directly utilize the automatic repeat execution function to realize the mapping of DMA configuration, splitting the second N+1st data stream into at least two N+1+1st data streams. Further, the mapping of the DMA configuration is performed with two n+1+1-th level data streams.
S160, outputting DMA configuration limited by the DMA hardware and provided with W times of repeated execution times to the DMA hardware according to the description of the N-th level data stream and interface information of the DMA hardware, wherein W is an integer and W0.
Specifically, the interface information of the DMA hardware means information of interfaces of respective DMAs in the DMA hardware unit. The description of the data flow is the description of the data access requirements mentioned in step S110. In this step, when the scan description in the nth data stream is checked to meet the specification limited by the DMA hardware, it is explained that the nth data stream can directly utilize the automatic repeat execution function to realize the mapping of the DMA configuration, so that the DMA configuration limited by the DMA hardware and having the number of W repeat executions can be generated according to the description of the nth data stream and the interface information of the DMA hardware, and the DMA configuration limited by the DMA hardware and having the number of W repeat executions can be output to the DMA hardware. The number of times of the repeated execution of the W times is the number of times of the automatic repeated execution in the automatic repeated execution function.
Illustratively, it is assumed that the nth data stream includes two sub-data streams (a first sub-data stream and a second sub-data stream), which are split by the nth-1 data stream, the source data of the two sub-data streams have the shape of yx=64×1024, the target data have the shape of yx=32×256, and the scanning method conforms to the access memory limited by the DMA hardware and the synchronization description conforms to the synchronization limited by the DMA hardware; wherein the scanning of the first sub-data stream is described as repeating 2 times a slice scan access along the x-axis (i.e., the number of repetitions in the x-axis is 2), starting at row 1 of the y-axis; the scan of the second sub-stream is described as a 2-repeat slice scan access along the x-axis, starting from row 33 of the y-axis (i.e., the first sub-stream scans row 1 through row 32 (including row 32), and the second sub-stream scans row 33 through row 64 (including row 64)), and it can be seen that the scan descriptions of both sub-streams respectively meet the specifications limited by the DMA hardware.
Under the above assumption, the present embodiment may exemplarily stream the first sub-data to the DMA configuration with w=7= [2× (1024/256) ] -1 number of repeated executions limited by the DMA hardware output DMA hardware, that is, configure the automatic repeated execution function with the DMA hardware 1 time, execute 1 time, and repeat this 7 times again, and perform 8=7+1 times in total; similarly, the second sub-data is exemplarily streamed to the DMA hardware to output the DMA configuration with w=7= [2× (1024/256) ] -1 repeating execution times limited by the DMA hardware, that is, the DMA hardware is utilized to automatically execute the function repeatedly 1 time, 1 time is configured, and 7 times of execution is repeated for this; therefore, a total of 2 DMA hardware configurations need to be performed so that the DMA hardware can be repeatedly executed 14 times in total, 16 times in total (i.e., the total number of executions is 16 times). Here, the default configuration is executed 1 time for 1 time, and the number of times of repeating the execution is the number of times of repeating the execution (may also be referred to as the number of times of repeating the execution), for example, 7 times.
However, in the related art, for the nth-1 level data stream in the above assumption, when implementing DMA configuration, both the scanning method and the synchronization description are expanded, so that the number of accesses is 16 (i.e. the total number of executions is 16), but each access in 16 includes checking, configuring, executing and synchronizing, i.e. the configuration is 16 and then the execution is 16, each execution requires 1 configuration, and the automatic repeated execution function cannot be implemented, which results in huge time overhead caused by configuring DMA hardware, more idle time of the computing core, and lower utilization rate of the computing core, resulting in DLA efficiency loss (including performance and power consumption).
As can be seen from the comparison, the technical solution of the embodiment of the present invention reduces the time overhead caused by configuring DMA hardware by implementing the utilization of the automatic repeat execution function, so that the idle time of the computing core is shortened, the utilization rate of the computing core is improved, and further, the performance of DLA is improved and the power consumption is reduced.
On the basis of the above technical solution, as an embodiment of the present invention, optionally, if the total execution number of times is M times when DMA configuration is performed for the nth data stream, when N is greater than or equal to floor (M/2), it may be selected not to continue to use the automatic repeat execution function to map the nth data stream, but to output the nth data stream to the initial mapping device, and the initial mapping device is used to perform DMA configuration mapping for the nth data stream. floor is a lower rounding function, and M is a positive integer.
On the basis of the above embodiment, as an implementation manner of the present invention, optionally, determining in step S120 whether the scanning method in the nth data stream meets the access limitation of the DMA hardware may include:
case one: if the scanning behaviors in all the dimensions in the N-level data stream do not cross the boundary scanning, the scanning method is judged to be in accordance with the access memory limited by the DMA hardware.
Fig. 6a is a schematic diagram of one scanning behavior according to an embodiment of the present invention, and fig. 6b is a schematic diagram of another scanning behavior according to an embodiment of the present invention. That is, in fig. 6, the source data has a shape of yx=8×8, the target data has a shape of yx=4×4, the span size of the scan is 4×4, and as can be seen from the left to right of fig. 6a, the scanning behavior does not occur out-of-range scanning in the dimension represented by the x-axis, and as can also be seen from fig. 6a, the scanning behavior does not occur out-of-range scanning in the dimension represented by the y-axis; in fig. 6b, the source data is in the shape of yx=8×8, the target data is in the shape of yx=5×5, the span size of the scan is 5×5, and as can be seen from the left to right diagram of fig. 6b, the scanning behavior is out of range scanning in the dimension represented by the x-axis.
And a second case: if the scanning behavior in at least one dimension in the nth data stream is out of range, judging whether the memory arrangement of all dimensions in the nth data stream needs to be transposed or not; if yes, judging that the scanning method does not accord with the access memory limited by the DMA hardware, and if not, judging that the scanning method accords with the access memory limited by the DMA hardware.
Specifically, when it is checked that the scanning behavior in at least one dimension of the nth data stream is out of range, the DMA hardware may zero out of range access, thereby ensuring that the out of range access data is a determined valid value and not other uncertain random values. When it is checked that the scanning behavior in at least one dimension of the nth data stream crosses boundary scanning, it is described that zero padding is needed, and when it is checked that zero padding is needed, the technical solution of this embodiment determines whether memory arrangement in all dimensions of the nth data stream needs to be transposed, if so, it determines that the scanning method does not conform to access limited by DMA hardware, and if not, it determines that the scanning method conforms to access limited by DMA hardware.
The embodiment of the present invention further provides a method for determining a scan boundary crossing, that is, based on the foregoing embodiment, as an implementation manner of the present invention, optionally, before determining whether memory arrangements of all dimensions in an nth data stream need to be transposed if a scan behavior in at least one dimension in the nth data stream crosses a boundary scan, the method further includes:
In a single dimension, when the size of the source data is greater than or equal to the size of the target data, if [ stride× (times-1) +dst_dim_size]If not more than src_dim_size, determining that scanning behavior in the dimension does not cross-boundary scanning, if [ stride× (times-1) +dst_dim_size ]]> src _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,where src_dim_size is the size of the source data, dst_dim_size is the size of the target data, stride is the span size between two accesses in the current dimension, time is the number of scans corresponding to the scanning behavior in the current dimension, and by way of example, the number of scans in fig. 6a and 6b are both 2 scans from left to right, the span size corresponding to fig. 6a is 4×4, and the span size corresponding to fig. 6b is 5×5.
Illustratively, the source data size is 8×8, the target data size is 4×4, and the span size is 4×4, then the number of scans in a single dimension is time=ceil (8/4) =2, ceil is an upper-rounded function, and then 4× (2-1) +4=8, i.e., (stride× (time-1) +dst_dim_size) =src_dim_size, thus determining that there is no out-of-range scan.
Illustratively, the source data is 8×8 in size, the target data is 5×5 in size, and the span is 4×4 in size, and then the number of scans in a single dimension is time=ceil (8/4) =2, and then 4× (2-1) +5= 9>8, i.e., (stride× (time-1) +dst_dim_size) > src_dim_size, thus determining out-of-range scanning.
In addition, in a single dimension, when the size of the source data is smaller than the size of the target data, if [ stride ]×(times-1)+src_dim_size]If dst_dim_size is less than or equal to dst_dim_size, determining that scanning behavior in the dimension does not cross-border scanning, if [ stride× (times-1) +src_dim_size]> dst _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,
on the basis of the above embodiment, as an implementation manner of the present invention, optionally, checking whether the synchronization description in the nth data stream conforms to the synchronization limited by the DMA hardware in step S120 may include:
if the number of upstream synchronous execution bodies, the number of downstream synchronous execution bodies and/or the synchronization rule in the synchronous description do not accord with the synchronization limited by the DMA hardware, determining that the synchronous description in the N-th data stream does not accord with the synchronization limited by the DMA hardware.
Specifically, the upstream synchronous execution body may be one of the DMA hardware or a computing unit in the DLA architecture, and the downstream synchronous execution body may also be one of the DMA hardware or a computing unit in the DLA architecture. The upstream synchronous execution body is an upstream synchronous execution body of the current execution DMA in the DMA hardware, and the current execution DMA starts to execute the execution action only when the current execution DMA receives the synchronous signal sent by the upstream synchronous execution body; the downstream synchronous execution main body is the downstream synchronous execution main body of the current DMA in the DMA hardware, and the downstream synchronous execution main body starts to execute the action when receiving the synchronous signal sent by the current DMA.
Illustratively, the number of currently executing DMAs is 1, the number of downstream synchronous execution bodies is 6, the synchronization description in the nth level data stream includes that the currently executing DMAs broadcast the same synchronization signal to the 6 downstream synchronous execution bodies simultaneously, and if the synchronization limited by the DMA hardware includes that 1 currently executing DMA can only broadcast the same synchronization signal to the 5 downstream synchronous execution bodies simultaneously, it is determined that the number of downstream synchronous execution bodies in the synchronization description does not conform to the synchronization limited by the DMA hardware. And if the synchronization limited by the DMA hardware comprises 1 currently executing DMA and can simultaneously send the same synchronization signal to 10 downstream synchronization execution bodies in a broadcast mode, determining that the number of the downstream synchronization execution bodies in the synchronization description accords with the synchronization limited by the DMA hardware.
For example, the number of currently executed DMAs is 1, the number of downstream synchronous execution bodies is 3, the synchronous description in the nth level data stream includes that the currently executed DMAs sequentially and respectively send three synchronous signals to the corresponding 3 downstream synchronous execution bodies, and if the synchronization limited by the DMA hardware includes that 1 currently executed DMA can only sequentially and respectively send two synchronous signals to the corresponding 2 downstream synchronous execution bodies, it is determined that the number of downstream synchronous execution bodies in the synchronous description does not conform to the synchronization limited by the DMA hardware. And if the synchronization limited by the DMA hardware comprises 1 currently executing DMA and can sequentially and respectively send five synchronization signals to the corresponding 5 downstream synchronization execution bodies, judging that the number of the downstream synchronization execution bodies in the synchronization description accords with the synchronization limited by the DMA hardware.
Illustratively, the synchronization rule in the synchronization description in the nth data stream is that the currently executing DMA sends the first synchronization signal to the first downstream synchronization executing body, the currently executing DMA sends the second synchronization signal to the second downstream synchronization executing body, the currently executing DMA sends the third synchronization signal to the third downstream synchronization executing body, and the currently executing DMA sends the fourth, fifth and sixth synchronization signals to the first, second and third downstream synchronization executing bodies, respectively, when it is determined that the synchronization rule in the synchronization description does not conform to the synchronization limited by the DMA hardware if the synchronization limited by the DMA hardware does not support such synchronization rule.
On the basis of the foregoing embodiment, as an implementation manner of the present invention, optionally, the mapping method from the data stream to the DMA configuration provided by the embodiment of the present invention may further include, after step S110 and before step S120 (i.e., between step S110 and step S120):
if the Z-1 level low dimension of the source data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement and the corresponding Z-1 level low dimension of the target data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement in the N level data stream, judging whether the Z-1 level low dimension data description of the source data and the Z-1 level high dimension data description of the source data after Z-1 time combination are consistent with the Z-1 level dimension data description of the target data after Z-1 time combination of the Z-1 level low dimension data description of the target data or not according with the specification limited by DMA hardware or not when the size of the target data on the Z-1 level high dimension is equal to 1 or the size of the target data and the source data on the Z-1 level low dimension;
If the Z-level dimension data description of the source data after Z-level merging and the Z-level dimension data description of the target data after Z-level merging meet the specification limited by DMA hardware, carrying out Z-1-level merging on the Z-1-level low dimension data description of the source data and the Z-1-level dimension data description of the source data, and carrying out Z-1-level merging on the Z-1-level low dimension data description of the target data and the Z-1-level dimension data description of the target data so as to simplify the N-level data stream, wherein Z is a positive integer, and Z is more than or equal to 2 and is less than the total number of dimensions in the N-level data stream.
For example, if the source data has a shape of yxx=4×1024, that is, the source data has 4 rows and each row includes 1024 elements, so that in the memory arrangement, two adjacent elements may be separated by 1 element along the x-axis dimension and 1024 elements may be separated between two adjacent elements along the y-axis dimension; to this end, since the number of elements (1024) separated between two adjacent elements along the y-axis dimension is greater than the number of elements (1) separated between two adjacent elements along the x-axis dimension, the y-axis dimension is determined to be high and the x-axis dimension is determined to be low.
For example, if the source data has a shape of zxyxx=2048×4×1024, in the memory arrangement, two adjacent elements may be separated by 1 element along the x-axis dimension, two adjacent elements may be separated by 1024 elements along the y-axis dimension, and two adjacent elements may be separated by 4×1024=4096 elements along the z-axis dimension. Since the number of elements (1) that can be separated between two adjacent elements along the x-axis dimension, the number of elements (1024) that can be separated between two adjacent elements along the y-axis dimension, and the number of elements (4096) that can be separated between two adjacent elements along the z-axis dimension are sorted from large to small (or from small to large), the method is that: the number of elements spaced between adjacent two elements along the z-axis dimension is greater than the number of elements spaced between adjacent two elements along the y-axis dimension, and the number of elements spaced between adjacent two elements along the y-axis dimension is greater than the number of elements spaced between adjacent two elements along the x-axis dimension. Accordingly, determining that the z-axis dimension is a high dimension of the y-axis dimension, the y-axis dimension is a low dimension of the z-axis dimension, and the y-axis dimension is adjacent to the z-axis dimension; and determining the y-axis dimension as a high dimension of the x-axis dimension, the x-axis dimension as a low dimension of the y-axis dimension, and the x-axis dimension being adjacent to the y-axis dimension.
For example, if the source data has a shape of txzxyxx=2×4×8x16, two adjacent elements may be separated by 1 element in the x-axis dimension, two adjacent elements may be separated by 16 elements in the y-axis dimension, two adjacent elements may be separated by 8×16=128 elements in the z-axis dimension, and two adjacent elements may be separated by 4×8×16=512 elements in the t-axis dimension, so that after the number of elements separated in each dimension is sorted in order from large to small, the following results are obtained: the t-axis dimension and the z-axis dimension are high-low and continuous, the z-axis dimension and the y-axis dimension are high-low and continuous, and the y-axis dimension and the x-axis dimension are high-low and continuous. In this embodiment, two adjacent dimensions in the memory arrangement are not necessarily two continuous dimensions, and two adjacent dimensions in the obtained ordering sequence are only two continuous dimensions after the dimensions are ordered according to the number of elements in the order from big to small or from small to big.
On the basis, if the y-axis dimension and the x-axis dimension are combined to obtain a dimension y ', the y ' dimension is a new dimension different from the y-axis dimension and the x-axis dimension, so that the shape of source data is changed into t×z×y ' =2×4×128, in the memory arrangement, two adjacent elements can be separated by 1 element along the y ' axis dimension, 128 elements can be separated by two adjacent elements along the z axis, 512 elements can be separated by two adjacent elements along the t axis, and the number of elements separated by each dimension is sequenced according to the sequence from large to small, so that the t-axis dimension and the z-axis dimension are high and low and are adjacent, and the z-axis dimension and the y ' axis dimension are high and low and are adjacent.
Before the y-axis dimension and the x-axis dimension are combined, the data description of the t-axis dimension, the z-axis dimension, the y-axis dimension and the x-axis dimension is the 1 st-level dimension data description, and after the y-axis dimension and the x-axis dimension are combined to obtain the y '-axis dimension, the data description of the t-axis dimension, the z-axis dimension and the y' -axis dimension is the 2 nd-level dimension data description; that is, after each merging of dimensions to obtain a new dimension, the number of stages of the description of the dimension data is increased by one stage.
Illustratively, if the shape of the source data is zxyxx=2×4×1024 and the shape of the target data is zxyxx=2×1×256, it can be seen that the 1 st low-dimensional x-axis of the source data (i.e., the x-axis dimension is the 1 st low-dimension) and the 1 st high-dimensional y-axis adjacent thereto (i.e., the y-axis dimension is the 1 st high-dimension) are two dimensions continuous on the memory arrangement, and the 1 st low-dimensional x-axis of the corresponding target data (i.e., the x-axis dimension is the 1 st low-dimension) and the 1 st high-dimensional y-axis adjacent thereto (i.e., the y-axis dimension is the 1 st high-dimension) are two dimensions continuous on the memory arrangement, and it can be seen that the size of the target data on the 1 st high-dimensional y-axis is equal to 1. At this time, if it is determined that the new shape z×y '=2×4096 obtained by merging the source data shape (i.e., the dimension data description) does not meet the specification limited by the DMA hardware, or if it is determined that the new shape z×y' =2×256 obtained by merging the target data shape (i.e., the dimension data description) does not meet the specification limited by the DMA hardware, dimension merging is not performed; and if the new shape z×y '=2×4096 obtained by merging the source data shapes (namely the dimension data descriptions) accords with the specification limited by the DMA hardware, and the new shape z×y' =2×256 obtained by merging the target data shapes (namely the dimension data descriptions) accords with the specification limited by the DMA hardware, dimension merging is carried out.
And dimension merging is performed, namely: combining the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the source data to obtain a 2 nd-level dimension data description z×y '=2×4096, and combining the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the target data to obtain a 2 nd-level dimension data description z×y' =2×256. By carrying out dimension combination according with the specification limited by DMA hardware, the embodiment of the invention simplifies the dimension data description (namely the shape) of the source data and the target data, thereby reducing the probability of splitting the N-level data stream into the next-level data stream (sub-data stream) and generating more sub-data streams, reducing the configuration times of mapping the N-level data stream to DMA configuration, further reducing the time expenditure caused by configuring the DMA hardware, further shortening the idle time of a computing core and further improving the utilization rate of the computing core; meanwhile, the description of dimension data in the DMA hardware configuration is simplified, and the execution efficiency of the DMA hardware on the DMA configuration is improved, so that the more preferable DMA configuration is obtained, namely the DMA configuration with higher execution efficiency is obtained.
For example, if the shape of the source data is zxyxx=2×4×1024 and the shape of the target data is zxyxx=2×2×1024, it can be seen that the 1 st low-dimensional x-axis of the source data and the 1 st high-dimensional y-axis adjacent thereto are two dimensions continuous on the memory arrangement, and the 1 st low-dimensional x-axis of the corresponding target data and the 1 st high-dimensional y-axis adjacent thereto are two dimensions continuous on the memory arrangement, and it can be seen that the data sizes of the target data and the source data on the 1 st low-dimensional x-axis are the same (both are 1024). At this time, if it is determined that the new shape z×y '=2×4096 obtained by merging the source data shape (i.e., the dimension data description) does not meet the specification limited by the DMA hardware, or if it is determined that the new shape z×y' =2×2048 obtained by merging the target data shape (i.e., the dimension data description) does not meet the specification limited by the DMA hardware, dimension merging is not performed; and if the new shape z×y '=2×4096 obtained by merging the source data shapes (namely the dimension data descriptions) accords with the specification limited by the DMA hardware, and the new shape z×y' =2×2048 obtained by merging the target data shapes (namely the dimension data descriptions) accords with the specification limited by the DMA hardware, dimension merging is carried out.
And dimension merging is performed, namely: combining the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the source data to obtain a 2 nd-level dimension data description z×y '=2×4096, and combining the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the target data to obtain a 2 nd-level dimension data description z×y' =2×2048. By carrying out dimension combination according with the specification limited by DMA hardware, the embodiment of the invention simplifies the dimension data description (namely the shape) of the source data and the target data, thereby reducing the probability of splitting the N-level data stream into the next-level data stream (sub-data stream) and generating more sub-data streams, reducing the configuration times of mapping the N-level data stream to DMA configuration, further reducing the time expenditure caused by configuring the DMA hardware, further shortening the idle time of a computing core and further improving the utilization rate of the computing core; meanwhile, the description of dimension data in the DMA hardware configuration is simplified, and the execution efficiency of the DMA hardware on the DMA configuration is improved, so that the more preferable DMA configuration is obtained, namely the DMA configuration with higher execution efficiency is obtained.
On the basis of the above embodiment, as an implementation manner of the present invention, optionally, performing Z-1-th merging on the Z-1-th low-dimensional data description of the source data and the Z-1-th high-dimensional data description of the source data, and performing Z-1-th merging on the Z-1-th low-dimensional data description of the target data and the Z-1-th high-dimensional data description of the target data, including:
(1) Multiplying the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data to merge the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data, and multiplying the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data to merge the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data. Illustratively, if the shape of the source data is zxyxx=2×4×1024 and the shape of the target data is zxyxx=2×2×1024, the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the source data are combined to obtain a 2 nd-level dimension data description zxy '=2×4096, and the x-axis dimension and the y-axis dimension of the 1 st-level dimension data description dimension of the target data are combined to obtain a 2 nd-level dimension data description zxy' =2×2048.
(2) Multiplying the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size to combine the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size. Illustratively, if the slice span size is [ y, x ] = [4,4], the low-dimensional x-axis span size 4 and the high-dimensional y-axis span size 4 are multiplied to obtain [ y' ] = [16].
(3) Multiplying the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number to combine the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number. Illustratively, if the number of repetitions is (y, x) = (1, 4), the low-dimensional x-axis 4 number of repetitions and the high-dimensional y-axis 1 number of repetitions are multiplied to obtain (y') = (4).
(4) Combining the Z-1 level low-dimensional memory arrangement of the source data and the Z-1 level high-dimensional memory arrangement of the source data, and combining the Z-1 level low-dimensional memory arrangement of the target data and the Z-1 level high-dimensional memory arrangement of the target data. For example, if the memory arrangement of the source data is layout [0,1], the combined memory arrangement is layout [0]. If the memory arrangement of the target data is layout [0,1], the combined memory arrangement is layout [0].
(5) And merging the Z-1 level low-dimensional access sequence and the Z-1 level high-dimensional access sequence. Illustratively, if the access order is dim_order= [0,1], the merged access order is dim_order= [0].
Based on the foregoing embodiments, the embodiments of the present invention further provide a mapping apparatus from a data stream to a DMA configuration, for performing the mapping method from a data stream to a DMA configuration according to any of the embodiments of the present invention, and fig. 7 is a schematic structural diagram of a mapping apparatus from a data stream to a DMA configuration according to an embodiment of the present invention, and referring to fig. 7, the apparatus includes: a receiving module, configured to receive an nth data stream; the access and synchronization checking module is used for determining whether the scanning method in the nth data stream accords with the access limited by the DMA hardware and checking whether the synchronization description in the nth data stream accords with the synchronization limited by the DMA hardware; the specification checking module is used for outputting the nth data stream to the initial mapping device if the specification checking module determines that the access memory limited by the DMA hardware is not met or the synchronization limited by the DMA hardware is not met; if it is determined that the access memory meeting the limitation of the DMA hardware and the synchronization meeting the limitation of the DMA hardware are met, checking whether the scan description in the N-th data stream meets the specification meeting the limitation of the DMA hardware; the DMA configuration output module is used for splitting the N-th data stream into at least two N+1th data streams according to the scanning description if the specification which is not in accordance with the limitation of DMA hardware is determined, wherein the N+1th data stream is the next data stream of the N-th data stream, N is a positive integer, and N is more than or equal to 1; if the specification meeting the limitation of the DMA hardware is confirmed, outputting the DMA configuration limited by the DMA hardware and provided with W times of repeated execution times to the DMA hardware according to the description of the N-th data stream and the interface information of the DMA hardware, wherein W is an integer and is more than or equal to 0.
The mapping device from the data stream to the DMA configuration and the mapping method from the data stream to the DMA configuration provided by the embodiment of the invention belong to the same inventive concept, can realize the same technical effect, and are not repeated here.
Based on the above embodiments, the present invention further provides a DLA architecture, and fig. 8 is a schematic structural diagram of the DLA architecture provided by the present invention, and referring to fig. 8, the DLA architecture includes DMA hardware, a computing unit, and a mapping device from data flow to DMA configuration; wherein the mapping means from the data stream to the DMA configuration is in communication with DMA hardware, the DMA hardware being in communication with the computing unit.
Based on the above embodiments, the present invention further provides a computer readable storage medium, where the computer readable storage medium stores computer instructions for causing a processor to implement the mapping method from data stream to DMA configuration according to any of the embodiments of the present invention when executed.
In some embodiments, the mapping method from the data stream to the DMA configuration may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as a storage unit. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device via the ROM and/or the communication unit. When the computer program is loaded into RAM and executed by the processor, one or more of the steps of the mapping method from data streams to DMA configuration described above may be performed. Alternatively, in other embodiments, the processor may be configured to perform the mapping method from the data stream to the DMA configuration in any other suitable manner (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.
The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims (9)

1. A method of mapping from a data stream to a DMA configuration, comprising:
receiving an nth data stream;
determining whether a scanning method in the nth data stream accords with access memory limited by DMA hardware, and checking whether synchronization descriptions in the nth data stream accord with synchronization limited by the DMA hardware;
if the access memory limited by the DMA hardware is not met or the synchronization limited by the DMA hardware is not met, outputting the N-th data stream to an initial mapping device; if the access memory limited by the DMA hardware is met and the synchronization limited by the DMA hardware is met, checking whether the scanning description in the N-stage data stream meets the specification limited by the DMA hardware or not;
if the data flow does not meet the specification limited by DMA hardware, splitting the N-th data flow into at least two N+1th data flows according to the scanning description, wherein the N+1th data flow is the next stage data flow of the N-th data flow, N is a positive integer, and N is more than or equal to 1; if the specification limited by the DMA hardware is met, outputting DMA configuration limited by the DMA hardware and provided with W times of repeated execution times to the DMA hardware according to the description of the N-level data stream and interface information of the DMA hardware, wherein W is an integer and is more than or equal to 0.
2. The method of mapping a data stream to a DMA configuration according to claim 1, wherein said determining whether the scanning method in the nth data stream meets DMA hardware limited access memory comprises:
if the scanning behavior in at least one dimension in the nth data stream is out of range, judging whether the memory arrangement of all dimensions in the nth data stream needs to be transposed or not; if yes, judging that the scanning method does not accord with the access memory limited by the DMA hardware, and if not, judging that the scanning method accords with the access memory limited by the DMA hardware;
and if the scanning behaviors in all the dimensions in the nth data stream do not cross the boundary scanning, judging that the scanning method accords with the access limited by the DMA hardware.
3. The mapping method from a data stream to a DMA configuration according to claim 2, further comprising, before determining whether memory arrangements of all dimensions in the nth data stream need to be transposed if a scanning behavior in at least one dimension in the nth data stream is out of range.
In a single dimension, when the size of the source data is greater than or equal to the size of the target data, if [ stride× ]
(times-1)+dst_dim_size]If not more than src_dim_size, determining that scanning behavior in the dimension does not cross-boundary scanning, if [ stride× (times-1) +dst_dim_size ]]>src _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,
in a single dimension, when the size of the source data is smaller than the size of the target data, if [ stride× (times-1) +src_dim_size ]. Ltoreq.dst_dim_size ], determining that the scanning behavior in the dimension does not generate out-of-range scanning, if
[stride×(times-1)+src_dim_size]>dst _ dim _ size, then it is determined that the scanning behavior in that dimension is out of range,
wherein src_dim_size is the size of the source data, dst_dim_size is the size of the target data, stride is the span size between two accesses in the current dimension, and time is the number of scans corresponding to the scanning behavior in the current dimension.
4. The method of mapping a data stream to a DMA configuration according to claim 1, wherein said checking whether a synchronization description in said nth data stream complies with DMA hardware limited synchronization comprises:
if the number of the upstream synchronous execution bodies, the number of the downstream synchronous execution bodies and/or the synchronization rule in the synchronous description do not accord with the synchronization limited by the DMA hardware, determining that the synchronous description in the Nth data stream does not accord with the synchronization limited by the DMA hardware.
5. The method of mapping a data stream to a DMA configuration of claim 1, further comprising, after said receiving an nth data stream:
if the Z-1 level low dimension of the source data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement, and the corresponding Z-1 level low dimension of the target data and the Z-1 level high dimension adjacent to the Z-1 level low dimension are continuous two dimensions on the memory arrangement, judging whether the Z-1 level low dimension data description of the source data and the Z-1 level low dimension data description of the source data after Z-1 level merging of the Z-1 level low dimension data description of the target data and the Z-1 level low dimension data description of the target data meet the specification of hardware limitation or not when the size of the target data on the Z-1 level high dimension is equal to 1 or the size of the target data and the source data on the Z-1 level low dimension is the same;
if the Z-level dimension data description of the source data after Z-level merging and the Z-level dimension data description of the target data after Z-level merging meet the specification limited by the DMA hardware, carrying out Z-1-level merging on the Z-1-level low dimension data description of the source data and the Z-1-level dimension data description of the source data, and carrying out Z-1-level merging on the Z-1-level low dimension data description of the target data and the Z-1-level dimension data description of the target data so as to simplify the N-level data stream, wherein Z is a positive integer, and Z is more than or equal to 2 and less than the total number of dimensions in the N-level data stream.
6. The method of mapping a data stream to a DMA configuration according to claim 5, wherein merging the Z-1 th order low-dimensional data description of the source data and the Z-1 th order high-dimensional data description of the source data for a Z-1 th time, and merging the Z-1 th order low-dimensional data description of the target data and the Z-1 th order high-dimensional data description of the target data for a Z-1 th time, comprises:
multiplying the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data to merge the Z-1 level low dimensional data size of the source data and the Z-1 level high dimensional data size of the source data, and multiplying the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data to merge the Z-1 level low dimensional data size of the target data and the Z-1 level high dimensional data size of the target data;
multiplying the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size to combine the Z-1 stage low dimensional span size and the Z-1 stage high dimensional span size;
multiplying the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number to combine the Z-1 level low-dimensional repetition number and the Z-1 level high-dimensional repetition number;
Combining the Z-1 level low-dimensional memory arrangement of the source data and the Z-1 level high-dimensional memory arrangement of the source data, and combining the Z-1 level low-dimensional memory arrangement of the target data and the Z-1 level high-dimensional memory arrangement of the target data;
and merging the Z-1 level low-dimensional access sequence and the Z-1 level high-dimensional access sequence.
7. Mapping apparatus from a data stream to a DMA configuration for performing a mapping method from a data stream to a DMA configuration according to any of claims 1-6, said apparatus comprising:
a receiving module, configured to receive an nth data stream;
the access and synchronization checking module is used for determining whether the scanning method in the N-th data stream accords with the access limited by the DMA hardware or not and checking whether the synchronization description in the N-th data stream accords with the synchronization limited by the DMA hardware or not;
the specification checking module is used for outputting the nth data stream to the initial mapping device if the access memory which is not in accordance with the limitation of the DMA hardware or the synchronization which is not in accordance with the limitation of the DMA hardware are determined; if the access meeting the limitation of the DMA hardware and the synchronization meeting the limitation of the DMA hardware are confirmed, checking whether the scanning description in the N-stage data stream meets the specification limited by the DMA hardware or not; the DMA configuration output module is used for splitting the N-th data stream into at least two N+1th data streams according to the scanning description if the specification which is not in accordance with the limitation of DMA hardware is determined, wherein the N+1th data stream is the next data stream of the N-th data stream, N is a positive integer, and N is more than or equal to 1; if the specification meeting the limitation of the DMA hardware is confirmed, outputting the DMA configuration limited by the DMA hardware and provided with the number of times of repeated execution for W to the DMA hardware according to the description of the N-level data stream and the interface information of the DMA hardware, wherein W is an integer and is more than or equal to 0.
8. A computer readable storage medium storing computer instructions for causing a processor to implement the mapping method from data streams to DMA configurations of any of claims 1-6 when executed.
9. A DLA comprising DMA hardware, a computing unit and mapping means from a data stream to a DMA configuration as claimed in claim 7; wherein the mapping means from data flow to DMA configuration is in communication with the DMA hardware, which is in communication with the computing unit.
CN202211517576.6A 2022-11-29 2022-11-29 Mapping method and device from data stream to DMA configuration, storage medium and DLA Active CN116010301B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211517576.6A CN116010301B (en) 2022-11-29 2022-11-29 Mapping method and device from data stream to DMA configuration, storage medium and DLA

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211517576.6A CN116010301B (en) 2022-11-29 2022-11-29 Mapping method and device from data stream to DMA configuration, storage medium and DLA

Publications (2)

Publication Number Publication Date
CN116010301A CN116010301A (en) 2023-04-25
CN116010301B true CN116010301B (en) 2023-11-24

Family

ID=86025633

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211517576.6A Active CN116010301B (en) 2022-11-29 2022-11-29 Mapping method and device from data stream to DMA configuration, storage medium and DLA

Country Status (1)

Country Link
CN (1) CN116010301B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008043564A1 (en) * 2006-10-11 2008-04-17 Rhf Gbr - Robelly, Herhold, Fettweis Synchronization and concurrent execution of control flow and data flow at task level
US10284645B1 (en) * 2014-05-06 2019-05-07 Veritas Technologies Llc Backup from network attached storage to sequential access media in network data management protocol environments
CN113254374A (en) * 2021-05-07 2021-08-13 黑芝麻智能科技(上海)有限公司 Method and processor for Direct Memory Access (DMA) access data
US11301295B1 (en) * 2019-05-23 2022-04-12 Xilinx, Inc. Implementing an application specified as a data flow graph in an array of data processing engines
CN114399035A (en) * 2021-12-30 2022-04-26 北京奕斯伟计算技术有限公司 Method for transferring data, direct memory access device and computer system
CN115202808A (en) * 2022-06-20 2022-10-18 中国科学院计算技术研究所 DMA method and system for system on chip in virtualization environment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008043564A1 (en) * 2006-10-11 2008-04-17 Rhf Gbr - Robelly, Herhold, Fettweis Synchronization and concurrent execution of control flow and data flow at task level
US10284645B1 (en) * 2014-05-06 2019-05-07 Veritas Technologies Llc Backup from network attached storage to sequential access media in network data management protocol environments
US11301295B1 (en) * 2019-05-23 2022-04-12 Xilinx, Inc. Implementing an application specified as a data flow graph in an array of data processing engines
CN113254374A (en) * 2021-05-07 2021-08-13 黑芝麻智能科技(上海)有限公司 Method and processor for Direct Memory Access (DMA) access data
CN114399035A (en) * 2021-12-30 2022-04-26 北京奕斯伟计算技术有限公司 Method for transferring data, direct memory access device and computer system
CN115202808A (en) * 2022-06-20 2022-10-18 中国科学院计算技术研究所 DMA method and system for system on chip in virtualization environment

Also Published As

Publication number Publication date
CN116010301A (en) 2023-04-25

Similar Documents

Publication Publication Date Title
CN111445012B (en) FPGA-based packet convolution hardware accelerator and method thereof
US9443156B2 (en) Methods and systems for data analysis in a state machine
US20110276737A1 (en) Method and system for reordering the request queue of a hardware accelerator
CN113076312B (en) Merging tree form sorting device, sorting system and sorting method
CN102508803A (en) Matrix transposition memory controller
CN111813370A (en) Multi-path parallel merging and sorting system based on FPGA
CN111582467B (en) Artificial intelligence accelerator and electronic equipment
US11367498B2 (en) Multi-level memory hierarchy
WO2020114469A1 (en) Sorting method and apparatus, and electronic device and medium
CN116010301B (en) Mapping method and device from data stream to DMA configuration, storage medium and DLA
Ortiz et al. A configurable high-throughput linear sorter system
WO2015142350A1 (en) Bandwidth amplification using pre-clocking
CN117236253A (en) FPGA wiring method and device, computer equipment and storage medium
CN115129642A (en) Chip bus delay adjusting method, electronic device and medium
US20110208952A1 (en) Programmable controller for executing a plurality of independent sequence programs in parallel
CN114780151A (en) Data sorting system for realizing variable-scale quantity based on merging sorting algorithm
CN112181356B (en) Design method and device of configurable MIMO FIFO
US8451022B2 (en) Integrated circuit and input data controlling method for reconfigurable circuit
CN102201817A (en) Low-power-consumption LDPC decoder based on optimization of memory folding architecture
CN112163612B (en) Big template convolution image matching method, device and system based on fpga
Wang et al. Implementing a scalable ASC processor
CN109643301B (en) Multi-core chip data bus wiring structure and data transmission method
CN115731111A (en) Image data processing device and method, and electronic device
CN105137320A (en) Compatibility compression method among grouped testing vectors reordered based on testing mode
CN111258632B (en) Data selection device, data processing method, chip and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CP03 Change of name, title or address

Address after: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee after: Shanghai Suiyuan Technology Co.,Ltd.

Country or region after: China

Address before: Room a-522, 188 Yesheng Road, Lingang New District, China (Shanghai) pilot Free Trade Zone, Pudong New Area, Shanghai, 201306

Patentee before: SHANGHAI ENFLAME TECHNOLOGY Co.,Ltd.

Country or region before: China

CP03 Change of name, title or address