CN115237373A

CN115237373A - Computing device, data processing method and related product

Info

Publication number: CN115237373A
Application number: CN202110449502.2A
Authority: CN
Inventors: 不公告发明人
Original assignee: Shanghai Cambricon Information Technology Co Ltd
Current assignee: Shanghai Cambricon Information Technology Co Ltd
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2022-10-25

Abstract

The present disclosure discloses a computing device, a data processing method and related products. The computing means may be comprised in a combined processing means which may also comprise interface means and other processing means. The computing device interacts with other processing devices to jointly complete computing operations specified by a user. The combined processing device may further comprise a storage device connected to the computing device and the other processing device, respectively, for storing data of the computing device and the other processing device. The scheme of the disclosure provides a hardware architecture for executing the histogram calculation instruction, which can simplify the processing and improve the processing efficiency of the machine.

Description

Computing device, data processing method and related product

Technical Field

The present disclosure relates generally to the field of processors. More particularly, the present disclosure relates to computing devices, data processing methods, chips, and boards.

Background

The histogram is a common data statistical tool, which is suitable for sorting and processing a large amount of metering value data to find out the statistical rule thereof so as to deduce the distribution characteristics of the data. Histograms have been widely used in various engineering fields. For example, in the field of image processing, histograms may be used to count image pixel data distributions, such as the number of pixels of various colors. For another example, in the field of computer vision, histograms may be used to enable target detection and tracking.

On the other hand, with the rapid development of deep learning, hardware architectures such as chips and processors suitable for deep learning processing have also been developed in a striding manner. If the operation of the histogram can be migrated to a hardware architecture suitable for the deep learning processing, it is possible to speed up the histogram operation processing, improving the processing efficiency such as image processing, target tracking, and the like. Existing hardware and/or instruction sets do not efficiently support histogram operation-related operations.

Disclosure of Invention

To at least partially solve one or more technical problems mentioned in the background, an aspect of the present disclosure provides a computing device, a data processing method, a chip, and a board.

In a first aspect, the present disclosure discloses a computing device comprising a control circuit and an operational circuit, wherein: the control circuit is used for analyzing a histogram calculation instruction, an operation object of the histogram calculation instruction comprises data to be detected and target data, and the histogram calculation instruction indicates that the occurrence frequency of the target data in the data to be detected is counted to serve as output data; and the operation circuit comprises a plurality of slave processing circuits, and is used for scheduling a corresponding number of slave processing circuits to execute the histogram operation task according to the histogram calculation instruction, wherein each scheduled slave processing circuit executes the histogram operation aiming at a part of the data to be detected and/or the target data.

In a second aspect, the present disclosure provides a chip comprising the computing device of any of the embodiments of the first aspect described above.

In a third aspect, the present disclosure provides a board card comprising the chip of any of the embodiments of the second aspect.

In a fourth aspect, the present disclosure provides a data processing method implemented by a computing device comprising a control circuit and an arithmetic circuit comprising a plurality of slave processing circuits, the method comprising: the control circuit analyzes a histogram calculation instruction, an operation object of the histogram calculation instruction comprises data to be detected and target data, and the histogram calculation instruction indicates that the occurrence frequency of the target data in the data to be detected is counted to serve as output data; and the operation circuit is used for scheduling a corresponding number of slave processing circuits to execute the histogram operation task according to the histogram calculation instruction, wherein each scheduled slave processing circuit executes the histogram operation aiming at one part of the data to be detected and/or the target data. .

With the computing device, the data processing method, the integrated circuit chip and the board provided as above, the disclosed embodiments provide a scheme for executing histogram calculation instructions on a hardware architecture including a plurality of slave processing circuits. The histogram operation task can be distributed on a plurality of slave processing circuits to be executed, so that the calculation advantage of a deep learning processing hardware architecture is fully utilized, and the processing efficiency of the machine is improved through parallel operation.

Drawings

The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which like reference numerals refer to like or corresponding parts and in which:

fig. 1 shows a block diagram of a board card of an embodiment of the present disclosure;

FIG. 2 illustrates a block diagram of a combined processing device of an embodiment of the disclosure;

FIG. 3 illustrates an internal structural diagram of a processor core of a single or multi-core computing device of an embodiment of the present disclosure;

FIG. 4 illustrates an exemplary structural diagram of a computing device of an embodiment of the present disclosure;

FIG. 5 is a schematic diagram illustrating a portion of an operational circuit in a computing device according to an embodiment of the disclosure;

FIG. 6 illustrates a partial block diagram of a slave processing circuit in accordance with an embodiment of the present disclosure; and

FIG. 7 illustrates an exemplary flow diagram of a data processing method 700 according to an embodiment of the disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be described below clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is to be understood that the described embodiments are only some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be derived by one skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the scope of protection of the present disclosure.

It should be understood that the terms "first," "second," "third," and "fourth," etc., as may appear in the claims, the specification, and the drawings of the present disclosure, are used to distinguish between different objects, and are not used to describe a particular order. The terms "comprises" and "comprising," when used in the specification and claims of this disclosure, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the disclosure is for the purpose of describing particular embodiments only, and is not intended to be limiting of the disclosure. As used in the specification and claims of this disclosure, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the term "and/or" as used in the specification and claims of this disclosure refers to any and all possible combinations of one or more of the associated listed items and includes such combinations.

As used in this specification and claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection".

Specific embodiments of the present disclosure are described in detail below with reference to the accompanying drawings.

Fig. 1 shows a schematic structural diagram of a board card 10 according to an embodiment of the disclosure. As shown in fig. 1, the board card 10 includes a Chip 101, which is a System on Chip (SoC), or System on Chip, integrated with one or more combined processing devices, which is an artificial intelligence arithmetic unit for supporting various deep learning and machine learning algorithms and meeting the intelligent processing requirements in the fields of computer vision, speech, natural language processing, data mining and the like under complex scenes. Especially, the deep learning technology is widely applied to the field of cloud intelligence, and one remarkable characteristic of the cloud intelligence application is that the input data size is large, and the requirements on the storage capacity and the computing capacity of the platform are high.

The chip 101 is connected to an external device 103 through an external interface device 102. The external device 103 is, for example, a server, a computer, a camera, a display, a mouse, a keyboard, a network card, a wifi interface, or the like. The data to be processed may be transferred to the chip 101 by the external device 103 through the external interface apparatus 102. The calculation results of the chip 101 may be transmitted back to the external device 103 via the external interface apparatus 102. The external interface device 102 may have different interface forms, such as a PCIe interface, according to different application scenarios.

The card 10 also includes a memory device 104 for storing data, which includes one or more memory cells 105. The memory device 104 is connected and data-transferred with the control device 106 and the chip 101 through a bus. The control device 106 in the board 10 is configured to regulate the state of the chip 101. For this reason, in an application scenario, the control device 106 may include a single chip Microcomputer (MCU).

Fig. 2 is a structural diagram showing a combined processing device in the chip 101 of this embodiment. As shown in fig. 2, the combination processing device 20 includes a computing device 201, an interface device 202, a processing device 203, and a storage device 204.

The computing device 201 is configured to perform user-specified operations, mainly implemented as a single-core smart processor or a multi-core smart processor, to perform deep learning or machine learning computations, which may interact with the processing device 203 through the interface device 202 to collectively perform the user-specified operations.

The interface device 202 is used for transmitting data and control instructions between the computing device 201 and the processing device 203. For example, the computing device 201 may obtain input data from the processing device 203 via the interface device 202, and write to a storage device on the computing device 201. Further, the computing device 201 may obtain the control instruction from the processing device 203 via the interface device 202, and write the control instruction into a control cache on the computing device 201. Alternatively or optionally, the interface device 202 may also read data from a storage device of the computing device 201 and transmit the data to the processing device 203.

The processing device 203, as a general purpose processing device, performs basic control including, but not limited to, data transfer, starting and/or stopping of the computing device 201, and the like. Depending on the implementation, the processing device 203 may be one or more types of Central Processing Unit (CPU), graphics Processing Unit (GPU) or other general and/or special purpose processor, including but not limited to a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc., and the number thereof may be determined according to actual needs. As previously mentioned, the computing device 201 of the present disclosure may be viewed as having a single core structure or an isomorphic multi-core structure only. However, when considered collectively, the computing device 201 and the processing device 203 are considered to form a heterogeneous multi-core structure.

The storage device 204 is used to store data to be processed, which may be a DRAM, a DDR memory, and is typically 16G or larger in size, and is used to store data of the computing device 201 and/or the processing device 203.

Fig. 3 shows a schematic diagram of an internal structure of a processing core when the computing device 201 is a single-core or multi-core device. The computing device 301 is used for processing input data such as computer vision, voice, natural language, data mining, and the like, and the computing device 301 includes three major modules: a control module 31, an operation module 32 and a storage module 33.

The control module 31 is used for coordinating and controlling the operations of the operation module 32 and the storage module 33 to complete the task of deep learning, and includes an Instruction Fetch Unit (IFU) 311 and an Instruction Decode Unit (IDU) 312. The instruction fetch unit 311 is used for obtaining an instruction from the processing device 203, and the instruction decoding unit 312 decodes the obtained instruction and sends the decoded result as control information to the operation module 32 and the storage module 33.

The operation module 32 includes a vector operation unit 321 and a matrix operation unit 322. The vector operation unit 321 is used for performing vector operations, and can support complex operations such as vector multiplication, addition, nonlinear transformation, and the like; the matrix operation unit 322 is responsible for the core calculation of the deep learning algorithm, i.e., matrix multiplication and convolution.

The storage module 33 is used to store or transport related data, and includes a neuron storage unit (neuron RAM, NRAM) 331, a weight storage unit (weight RAM, WRAM) 332, and a Direct Memory Access (DMA) 333.NRAM 331 is used to store input neurons, output neurons, and intermediate results after computation; the WRAM 332 is used for storing a convolution kernel, namely a weight, of the deep learning network; the DMA 333 is connected to the DRAM 204 via the bus 34 and is responsible for data transfer between the computing device 301 and the DRAM 204.

Based on the foregoing hardware environment, in one aspect, the disclosed embodiments provide a computing device that performs histogram calculation tasks according to histogram calculation instructions. The histogram operation task is generally understood as counting the number of occurrences of target data in the data to be detected. It is understood that there may be more than one target data, i.e. more than one histogram statistic element that needs to be counted. Furthermore, the size of the data to be detected can be very large. Therefore, if the hardware environment can be fully utilized to find the possibility of parallel processing of data, the calculation can be accelerated and the efficiency can be improved.

FIG. 4 shows a schematic block diagram of a computing device 400 according to an embodiment of the present disclosure. It will be appreciated that this structure may be viewed as an internal structural refinement of a single processing core as in fig. 3, or as a functionally partitioned block diagram that is federated over multiple processing cores as shown in fig. 3. As shown in fig. 4, the computing device 400 of the present disclosure may be used to perform histogram operation tasks, and may include a storage circuit 40, a control circuit 41 and an operation circuit 42, which are connected to each other to transmit various data and instructions.

The control circuit 41 functions similarly to the control module 31 in fig. 3, and may also include an Instruction Fetch Unit (IFU) 411 and an Instruction Decode Unit (IDU) 412, for example. In performing various arithmetic operations, such as a calculation operation, the control circuit 41 may be configured to obtain a calculation instruction and parse the calculation instruction to obtain an operation instruction, and then send the operation instruction to the operation circuit 42 and the storage circuit 40. The computation instructions may be a form of hardware instructions and include one or more opcodes, each of which may represent one or more specific operations to be performed by the arithmetic circuitry 42. The operations may include different types of operations according to application scenarios, and may include, for example, arithmetic operations such as addition operations or multiplication operations, logical operations, comparison operations, or table lookup operations, or any combination of the foregoing operations. Accordingly, the operation instruction may be one or more microinstructions executed within the operation circuit parsed from the computation instruction.

Further, depending on the application scenario, the operation instruction obtained after parsing the calculation instruction may be an operation instruction decoded by the control circuit 41 or may be an operation instruction that is not decoded by the control circuit 41. When the operation instruction is an operation instruction that is not decoded by the control circuit 41, a corresponding decoding circuit may be included in the operation circuit 42 to decode the operation instruction, for example, to obtain a plurality of microinstructions.

The arithmetic circuit 42 may include a master processing circuit 421 and a plurality of slave processing circuits 422. The master processing circuit and the slave processing circuits and the plurality of slave processing circuits may communicate with each other through various connections.

The master processing circuit and the slave processing circuit may cooperate with each other, thereby realizing parallel arithmetic processing. In such a configuration, the master processing circuit may be used, for example, to perform a prologue processing on the input data, such as splitting the data, and to receive intermediate results from the plurality of slave processing circuits and perform subsequent processing to obtain a final operation result of the operation instruction. The slave processing circuit may be configured to perform an intermediate operation on corresponding data (e.g., the split data) in parallel according to the operation instruction, for example, to obtain a plurality of intermediate results, and to transmit the plurality of intermediate results back to the master processing circuit.

In different application scenarios, the connection manner between the multiple slave processing circuits may be a hard connection manner arranged by a hard wire, or a logic connection manner configured according to, for example, a microinstruction, so as to form a topology of multiple slave processing circuit arrays. The disclosed embodiments are not limited in this respect.

By configuring the operation circuit 42 as a master-slave configuration (e.g., a master-slave configuration, or a multi-master-slave configuration, the disclosure is not limited in this respect), for a forward-direction operation computation instruction, data can be split according to the computation instruction, so that a portion with a large computation amount is operated in parallel by a plurality of slave processing circuits to increase the computation speed, save the computation time, and further reduce the power consumption.

To support the arithmetic function, the master processing circuit and the slave processing circuit may include various computation circuits, for example, may include a vector operation unit and a matrix operation unit. The vector operation unit is used for executing vector operation and can support complex operations such as vector multiplication, addition, nonlinear transformation and the like; the matrix operation unit is responsible for core calculation of the deep learning algorithm, such as matrix multiplication and convolution.

The storage circuit 40 is used for storing or transporting related data. In deep learning, the memory circuit may be used to store neurons, weight values, and other operation data, or may store operation results. The Memory circuit may include, for example, one or any combination of a cache 402, a register 404, and a Direct Memory Access (DMA) module 406. The direct memory access module DMA 406 may be used for data interaction with an off-chip memory (not shown).

The foregoing describes an exemplary computing device in which the master processing circuit and the slave processing circuit may belong to different modules of the same processor or chip, or may belong to different processors, according to embodiments of the present disclosure, which is not limited in this respect.

In one embodiment, the control circuit 41 may be configured to parse the histogram calculation instruction, for example, using the instruction decode unit 412. The operation object of the histogram calculation instruction includes data to be detected and target data. The histogram calculation instruction indicates the number of occurrences of the statistical target data in the data to be detected as output data. Further, in this embodiment, the operation circuit 42 may be configured to schedule a corresponding number of slave processing circuits 422 to perform the histogram operation task according to the histogram calculation instruction, wherein each of the scheduled slave processing circuits performs the histogram operation for a portion of the data to be detected and/or the target data.

In some embodiments, in order to fully utilize the parallel computing resources of the computing device, the master processing circuit 421 in the operation circuit 42 may be configured to determine a splitting scheme of the histogram operation task according to a splitting rule according to the histogram calculation instruction and the processing capability of the slave processing circuit.

The amount of operations to be performed may be determined from the histogram calculation instruction. The histogram calculation instruction may indicate information such as the size of the data to be detected, the size of the target data, and the like. For example, the data to be detected may be one-dimensional, two-dimensional, or higher-dimensional data; the target data may include one or more histogram statistical elements, each of which may be a numerical value.

The processing power of the slave processing circuits represents the available computational resources, which may include, for example, the number of available slave processing circuits, the amount of data that each slave processing circuit can process at one time, and so on.

Thus, based on the amount of computation to be done and the amount of available computational resources, the main processing circuitry may determine a splitting scheme for the histogram computation task in accordance with the splitting rules.

In particular, the splitting scheme may include one or more of: the transmission mode of the data to be detected and the target data, the splitting mode of the data to be detected and the target data, and the multiplexing times R of the data to be detected _N And the number of multiplexing times of the target data R _S And so on.

Fig. 5 shows a partial structural schematic diagram of an arithmetic circuit in an arithmetic device according to an embodiment of the present disclosure, in order to describe various splitting schemes by showing distribution, transmission, and processing of various data in the arithmetic circuit.

As shown, the arithmetic circuit 500 (e.g., the arithmetic circuit 42 in fig. 4) may include a first storage circuit 530 and a second storage circuit 540 in addition to the master processing circuit 510, the plurality of slave processing circuits 520 described above.

The first memory circuit 530 may be used to store multicast data, i.e. the data in the first memory circuit will be transmitted via the broadcast bus to a plurality of slave processing circuits, which receive the same data. It will be appreciated that broadcast and multicast may be implemented via a broadcast bus. Multicast refers to a communication mode in which a piece of data is transmitted to a plurality of slave processing circuits; broadcast is a communication mode for transmitting a piece of data to all slave processing circuits, and is a special case of multicast. Since multicast and broadcast both correspond to one-to-many transmission modes, and are not specifically distinguished herein, broadcast and multicast may be collectively referred to as multicast, and a person skilled in the art may make their meanings clear from context. The second memory circuit 540 may be used for storing distribution data, i.e. data in the second memory circuit will be transmitted to different slave processing circuits, respectively, each receiving different data. By providing the first storage circuit and the second storage circuit separately, it is possible to support transmission in different transmission manners for data to be operated on, thereby reducing data throughput by multiplexing multicast data among a plurality of slave processing circuits.

In some embodiments, the main processing circuit 510 may be further configured to: determining one of the data to be detected and the target data as multicast data, and determining the other one of the data to be detected and the target data as distribution data; storing the multicast data in the first storage circuit 530 for transmission to the scheduled plurality of slave processing circuits 520 over the broadcast bus during operation; and storing the split distribution data in the second storage circuit 540 for the scheduled slave processing circuit 520 to load the corresponding part of the distribution data for operation.

In the first scenario, the histogram statistic element included in the target data is more, for example, the number of histogram statistic elementsThe amount is comparable to the number of slave processing circuits, where the histogram calculation task can be split up to count for one histogram statistic element on each slave processing circuit. In such an embodiment, the main processing circuit is further configured to determine the data to be detected as broadcast data and the target data as distribution data. Thus, the number of times R of multiplexing of data to be detected _N Corresponding to the number of slave processing circuits scheduled-1 (i.e. subtracting once to be used per se), for example 64-1=63. At this time, the histogram statistic elements processed by each slave processing circuit are different from each other, and thus the target data is not multiplexed, and the number of times of multiplexing R of the target data _S And =0. It can be seen that (R) _N +1)*(R _S + 1) = Ns, ns is the number of slave processing circuits scheduled.

Further, the size of the data to be detected may be relatively large, and the secondary processing circuit cannot process the data at one time. In this case, the data to be detected can be split into detection data blocks, each of which has a size not exceeding the one-time throughput of a single slave processing circuit. Because the statistics of the occurrence times of the histogram statistical elements in the histogram operation is independent of the dimension and/or direction of the data to be detected, that is, the final result is not affected no matter which dimension of the data to be detected is counted first, a proper dimension can be selected for data splitting.

Although the multidimensional data has a plurality of dimensions, because the layout of the memory circuits (e.g., the first memory circuit and the second memory circuit) is always one-dimensional, there is a correspondence between the multidimensional data and the storage order on the memory circuits. Multidimensional data is usually allocated in a continuous storage space, i.e. multidimensional data can be one-dimensionally expanded and stored on a storage circuit in sequence. The data is typically spread in order of low dimension to high dimension, i.e. a low dimension first mode, whereby the low dimension data is continuous over the memory circuit.

Thus, in some embodiments, the data to be detected may be split into L's in order from the low dimension to the high dimension of the storage dimension _N Each detecting data block, so that it can be directly split and read according to the storage sequence, and has no need of usingThe storage order of the data needs to be readjusted. Splitting into L _N Each detected data block may pass through L _N The histogram operation task is realized by round operation, namely, one detection data block is transmitted to a plurality of scheduled slave processing circuits through a broadcast bus in each round of operation.

Further, the target data determined as distribution data may be split according to the included histogram statistic element, i.e. into a plurality of histogram statistic elements, to be allocated to different slave processing circuits, so that each slave processing circuit counts a different histogram statistic element.

In such an embodiment, each slave processing circuit may retrieve its assigned target data, e.g. a histogram statistic, from the second storage circuit. Then, for the detection data blocks broadcast in each round of operation, each slave processing circuit counts the occurrence frequency of the histogram statistical elements distributed to itself, and obtains the statistical result of the current round. The statistical result is only a part of statistical values in all the data to be detected, and therefore, the statistical result can also be called a part of statistical value. Each slave processing circuit accumulates partial statistical values obtained by all rounds of operation, and then a final statistical value of a corresponding statistical element on the whole data to be detected can be obtained.

In a second scenario, the histogram statistic elements included in the target data are fewer, for example, the number of histogram statistic elements is only half of the number of slave processing circuits, and then the histogram operation task may be split into statistics on each of multiple (for example, ng may be 2,4, for example, etc.) slave processing circuits for the same histogram statistic element. In such an embodiment, the main processing circuit may determine the data to be detected as multicast data and the target data as distribution data. Thus, the multiplexing times R of the data to be detected _N Corresponding to 1/Ng-1 of the number of scheduled slave processing circuits, e.g. Ng =2, R when the number of scheduled slave processing circuits Ns is 64 _N =64/2-1=31. At this time, the histogram statistic elements processed every Ng slave processing circuits are the same, and thus the number of multiplexing times R of the target data is the same _S = Ng-1. It can also be seen，(R _N +1)*(R _S + 1) = Ns, ns is the number of slave processing circuits scheduled.

Generally, the size of the data to be detected is large, and the slave processing circuit cannot process the data at one time. Thus, likewise, the data to be detected can be split into detection data blocks, each of which has a size not exceeding the one-time throughput of a single slave processing circuit. The splitting method can be the same as the previous one, and is not described herein again.

Since in this scenario multiple (e.g., ng) slave processing circuits would count against the same histogram statistic element. Therefore, ng = R needs to be transmitted in each round of operation _S +1 detection data blocks, each detection data block being transmitted to corresponding R in the scheduled multiple slave processing circuits by multicast mode _S +1 packets. Taking Ng =2 as an example, one detection data block needs to be transmitted to the first packet, for example, the first half, in each round of operation in a multicast manner in the scheduled multiple slave processing circuits; and the other detection data block is transmitted in a multicast manner to a second packet, e.g. the other half, in the scheduled plurality of slave processing circuits.

Further, the target data determined as the distribution data is also split according to the included histogram statistical element, that is, split into a plurality of histogram statistical elements. At this point, however, the same histogram statistic will be assigned to Ng slave processing circuits.

In such an embodiment, each slave processing circuit may retrieve its assigned target data, e.g. a histogram statistic, from the second storage circuit. Then, for the detection data block multicast in each round of operation, each slave processing circuit counts the occurrence number of the histogram statistical element allocated to itself in the detection data block, and obtains the statistical result of the current round, wherein the statistical result is also a partial statistical value. Each slave processing circuit accumulates partial statistical values obtained by all rounds of operation, and then the statistical value of the corresponding statistical element on 1/Ng of the whole data to be detected can be obtained. Therefore, in this embodiment, it is also necessary to accumulate the statistical values of the slave processing circuits assigned with the same histogram statistical element respectively, so as to obtain the final statistical value of the corresponding statistical element on the whole data to be detected.

As mentioned before, the master processing circuit may for example receive intermediate results from a plurality of slave processing circuits and perform subsequent processing to obtain a final operation result of the operation instruction. In particular, in the above-described embodiment, the master processing circuit 510 may be configured to accumulate the statistics of each Ng of slave processing circuits corresponding to the same histogram statistic element to obtain a final statistic for all histogram statistic elements.

In other embodiments, multiple slave processing circuits assigned to the same histogram statistic may accumulate and sum their respective partial statistics before returning them to the master processing circuit. For example, each Ng number of slave processing circuits may accumulate and sum up two by two in a multi-level tree structure manner, and then return the final statistical value to the master processing circuit. The disclosed embodiments are not limited in the particular manner of accumulation.

As can be seen from the above description, the first scenario can also be seen as one special case corresponding to Ng =1 in the second scenario. When Ng becomes larger, i.e. the number of histogram statistic elements included in the target data is very small, e.g. the number of histogram statistic elements is 4, and only 1/16 of the number of slave processing circuits (e.g. 64), the implementation in the second scenario may still be used to split the histogram calculation task, i.e. split into statistics for the same histogram statistic element on every Ng (Ng =16 in this example) slave processing circuits. However, in this embodiment, the transmission modes of the data to be detected and the target data can be adjusted to better reduce data access and memory and improve transmission efficiency.

Thus, in a third scenario, the main processing circuit may determine the data to be detected as distribution data and the target data as multicast data. The multiplexing times R of the data to be detected _N Still corresponding to 1/Ng-1 of the number of scheduled slave processing circuits, e.g. Ng =16, R when the number of scheduled slave processing circuits Ns is 64 _N =64/16-1=3. At this time, the histogram statistic elements processed every Ng slave processing circuits are the same, and thus of the target dataNumber of multiplexing R _S And = Ng-1=15. It can be seen that (R) _N +1)*(R _S + 1) = Ns, ns is the number of slave processing circuits scheduled.

Similarly, the data to be detected can be split into detection data blocks, each of which has a size not exceeding the one-time throughput of a single slave processing circuit. The splitting pattern may be the same as before. Since in this scenario, multiple (e.g., 16) slave processing circuits would count for the same histogram statistic element. Therefore, 16 detection data blocks need to be transmitted in each round of operation. These test data blocks are stored in a second memory circuit, from which the slave processing circuit can read the test data blocks assigned to itself. For example, each group of 16 slave processing circuits reads the 16 detection data blocks respectively, and the total number of the groups is 4, so that each detection data block will be used 4 times, namely, the number of times of multiplexing is 3.

The target data, which is multicast data, is transmitted to the corresponding packet in the scheduled plurality of slave processing circuits through the broadcast bus. Taking Ng =16 as an example, that is, 16 slave processing circuits perform statistics on different detection data blocks for one histogram statistical element, in each round of operation, 4 histogram statistical elements may be respectively transmitted to 4 groups of slave processing circuits, each group including 16 slave processing circuits.

Then, for the detection data blocks distributed in each round of operation, each slave processing circuit counts the occurrence times of the detection data blocks according to the histogram statistical elements distributed to the slave processing circuit in a multicast mode, and obtains the statistical result of the current round, wherein the statistical result is also a partial statistical value. Each slave processing circuit accumulates partial statistical values obtained by all rounds of operation, and then the statistical value of the corresponding statistical element on 1/Ng of the whole data to be detected can be obtained. Since each Ng slave processing circuits process the same histogram statistical element, similarly to the second scenario, the statistical values of the slave processing circuits to which the same histogram statistical element is assigned need to be accumulated respectively, so as to obtain the final statistical value of the corresponding statistical element on the whole data to be detected. The further accumulation process may be similar to the second scenario and is not repeated here.

It will be appreciated by those skilled in the art that although the respective processing and memory circuits are shown in fig. 5 as separate modules, the memory and processing circuits may be combined into one module according to different configurations. For example, the first memory circuit 530 may be incorporated with the master processing circuit 510, and the second memory circuit 540 may be shared by a plurality of slave processing circuits 520, and each slave processing circuit may be assigned a separate memory region to speed up access. The disclosed embodiments are not limited in this respect.

The split approach of the histogram operation is described above in connection with fig. 5. As can be seen from the above example, the splitting scheme of the histogram calculation task may follow at least one of the following rules: splitting the data to be detected according to the alignment requirement on the lowest dimension; preferentially satisfying the multiplexing times R of the multicast data _N The maximum value is reached, and the multiplexing times R of the distributed data are selectively met _S (ii) a Preferentially, the multicast data is subjected to round division multiplexing splitting according to the dimension of the output data, and when the dimension of the output data is smaller than a preset threshold value, the data is jointly distributed to perform round division multiplexing splitting; and the multiplexing times of the multicast data in each round of operation are more than the minimum multiplexing times M of the multicast data and not more than the maximum multiplexing times N of the multicast data.

In one implementation, splitting is performed on the lowest dimension of the data to be detected according to the alignment requirement. The foregoing describes splitting data to be detected into a plurality of detection data blocks, preferably in order of storage dimension from low dimension to high dimension. The term "alignment requirement" as used herein refers to a requirement for data size during operation, for example, to make the most of the hardware, for example, to fill up the operator. In one example, the alignment requirement is 64 bytes, for example, so the data to be detected can be split into detection data blocks of 64 byte size so that each detection data block can fully utilize the scheduled slave processing circuit.

Alternatively or additionally, in one implementation, to take full advantage of the transmission advantages of the broadcast bus and reduce data access, the number of times R of multiplexing of multicast data is preferentially satisfied when determining the splitting scheme _N Reaches a maximum value and selectivity is satisfiedNumber of times of multiplexing of distribution data R _S . As can be seen from the foregoing examples, (R) _N +1)*(R _S + 1) = Ns, ns is the number of scheduled slave processing circuits, so if the number of slave processing circuits is fixed, the respective multiplexing times of multicast data and distribution data can be allocated according to the above rule.

Alternatively or additionally, in one implementation, the multicast data is preferentially split round-robin according to the dimensionality of the output data. In the histogram operation processing of the embodiment of the present disclosure, the output data is the number of occurrences of each histogram statistical element in the data to be detected, and thus the dimension size of the output data corresponds to the number of histogram statistical elements. The splitting mode can be determined according to the number of histogram statistical elements. For example, when the number of histogram statistic elements N _hist When the number of the slave processing circuits is more than or equal to the number Ns of the schedulable slave processing circuits, the slave processing circuits can be split by taking Ns as a unit, each round of statistics is performed on Ns histogram statistical elements, and each slave processing circuit corresponds to one histogram statistical element.

In some cases, N _hist After splitting in Ns, the number of the last remaining blocks may be much smaller than Ns, and if the method is still used, the hardware efficiency may be low because the idle hardware accounts for too much in the last round of operation. To this end, alternatively or additionally, in one implementation, the number of multicast data multiplexes per round of operation may be controlled to be greater than the minimum number of multicast data multiplexes M and not to exceed the maximum number of multicast data multiplexes N. In the above example, N = Ns.

The two rules above may be collectively represented as the following logic:

mode 0, if N _hist Entering S0_1 when the number is more than or equal to N;

s0_1: on the dimension of the statistic elements of the histogram, searching according to 2N as a unit, and simultaneously splitting by taking N as a unit, if N is N _hist >2N, entering S0_2; if N + M<N _hist Less than or equal to 2N, and entering S0_3; if N is present<N _hist Entering S0_4 when the sum of N and M is less than or equal to N; otherwise, entering S0_5;

s0_2: n times per multiplexing, N _hist ＝N _hist -N, go to S0_1;

s0_3: multiplexing in two rounds, using N times and N times respectively in each round _hist -N times;

s0_4: multiplexing in two rounds, each round using N separately _hist -M times and M times;

s0_5: direct use of N _hist Next, the process is carried out.

For example, assume N _hist If =136, n =64, m =4, the operation is split into 3 rounds of operations, and the number of times of use of the multicast data in each round of operations is 64, and 8, respectively. If M =16 and the other data are not changed, the operation is divided into 3 rounds, and the number of times of using the multicast data in each round is 64, 56, and 16, respectively.

When the number of histogram statistic elements N _hist Less than the number of slave processing circuits Ns that can be scheduled, the histogram statistic elements may be multiplexed as described in the second and third scenarios above.

For example, assume N _hist =40,n =64, the multicast data is used 40 times directly. If N is present _hist =32,n =64, the slave processing circuit is divided into 2 groups, the number of times of use of multicast data per round of operation is 32, and the number of times of use of distribution data is 2. If N is present _hist If =20,n =64, the slave processing circuits are divided into 3 groups of 20 slave processing circuits, the number of times of use of multicast data per calculation is 20, and the number of times of use of distribution data is 3.

FIG. 6 shows a partial block schematic diagram of a slave processing circuit according to an embodiment of the present disclosure. In embodiments of the present disclosure, each slave processing circuit 600 (e.g., slave processing circuit 520 in fig. 5) may include a comparison circuit 610 and an accumulation circuit 620 based on the requirements of the histogram calculation task.

In each round of operation, the comparison circuit 610 may be configured to perform bit comparison between the target data (i.e., histogram statistic elements) allocated to the current slave processing circuit and the detection data block of the current round, and output a comparison result.

In some implementations, the comparison circuit 610 may include a plurality of comparators 611, each for comparing one of the histogram statistic element and the detection data block to be detected assigned to the current slave processing circuit and outputting a comparison result indicating whether the two are the same. For example, a "1" may be output when the histogram statistic is the same as one of the data to be detected in the detection data block, otherwise a "0" is output.

In each round of operation, the accumulation circuit 620 may be configured to accumulate the comparison result of the comparison circuit to obtain a partial statistic, and accumulate the partial statistic of the current round and the partial statistic of the previous round. The comparison result may indicate whether the histogram statistic element is present in the detection data block, and thus the number of occurrences may be counted by accumulating the comparison result. When data to be detected is split into a plurality of detection data blocks, part of statistics for each detection data block needs to be accumulated.

In some implementations, the accumulation circuit 620 may include an addition circuit 621 and a register 622. The adding circuit 621 is configured to perform an adding operation on the comparison results of the plurality of comparators to obtain a partial statistical value of the current round. Further, the adding circuit 621 may also perform an adding operation on the partial statistic of the current round and the partial statistic of the previous round to update the partial statistic. The addition circuit 621 may include, for example, a multi-stage wallace tree compressor group arranged in a multi-stage tree structure; or the adder circuit 621 may include a multi-stage adder group arranged in a multi-stage tree structure, and the embodiment of the present disclosure is not limited in this respect.

A register 622 may be used in conjunction with the addition circuit 621 to hold part of the statistics that are updated, i.e. the register holds the most recent accumulated result.

In some embodiments, for example, in the foregoing scenario where statistics are performed for one histogram statistic element every Ng slave processing circuits, instead of performing accumulation of statistics values by the master processing circuit, statistics values may also be accumulated by the slave processing circuit and then output to the master processing circuit. For example, every Ng slave processing circuits may accumulate and sum up two by two in a multi-level tree structure manner, and then return the final statistics to the master processing circuit. In such an implementation, ng slave processing circuits may design an appropriate cumulative aggregation manner according to the topology of the interconnect circuit to save computation time. The disclosed embodiments are not limited in the particular manner of accumulation.

The embodiment of the disclosure also provides a method for executing data processing by using the computing device. FIG. 7 illustrates an exemplary flow diagram of a data processing method 700 according to an embodiment of the disclosure.

As shown in fig. 7, in step 710, a histogram calculation instruction whose operation object includes data to be detected and target data is resolved, and which indicates the number of occurrences of statistical target data in the data to be detected as output data. This step may be performed, for example, by the control circuit 41 of fig. 4.

Next, in step 720, according to the histogram calculation instruction, a corresponding number of slave processing circuits are scheduled to perform the histogram operation task, wherein each of the scheduled slave processing circuits performs the histogram operation on a part of the data to be detected and/or the target data. This step may be performed, for example, by the arithmetic circuitry 42 of fig. 4.

In some embodiments, the arithmetic circuitry includes a master processing circuit, a plurality of slave processing circuits, a first storage circuit, and a second storage circuit. Thus, step 720 may further include sub-step 721 in which the main processing circuit determines one of the data to be detected and the target data as multicast data and the other of the data to be detected and the target data as distribution data. The master processing circuit may determine a splitting scheme of the histogram calculation task according to the histogram calculation instruction and the processing capability of the slave processing circuit and according to a splitting rule. The splitting scheme includes one or more of: the transmission mode (e.g. broadcast, multicast, distribution, etc.) of the data to be detected and the target data, the splitting mode of the data to be detected and the target data, and the multiplexing times R of the data to be detected _N And the number of multiplexing times of the target data R _S 。

Step 720 may further include sub-step 722, the master processing circuit storing the multicast data in the first storage circuit for transmission to the scheduled plurality of slave processing circuits over the broadcast bus during the operation; and a sub-step 723, the master processing circuit stores the split distribution data in the second storage circuit, so that the scheduled slave processing circuit loads the corresponding part of distribution data for operation.

Step 720 may further include a sub-step 724 of performing histogram operations on the transmitted data to be detected and target data from the processing circuit and returning the results. In particular, the histogram operation may include a bit comparison and accumulation operation, see description above.

Step 720 may further include sub-step 725 of accumulating a portion of the statistics computed from the processing circuitry based on the multiplexing. This step may be performed by the slave processing circuit or by the master processing circuit. For example, when the multiplexing frequency Rs of the target data is 0, that is, each slave processing circuit performs statistics on a different histogram statistical element, at this time, each slave processing circuit only needs to accumulate a part of statistical values in each round of operation, and the accumulation result of all rounds is the final statistical value. For another example, when the multiplexing frequency Rs of the target data is greater than 0, that is, each Rs +1 slave processing circuits perform statistics on one same histogram statistical element, at this time, partial statistical values of all rounds of each Rs +1 slave processing circuits need to be accumulated again to obtain a final statistical value.

Those skilled in the art will appreciate that the steps described in the method flow diagrams correspond to the various circuits of the computing device described above in connection with fig. 4-6, and therefore the features described above apply equally to the method steps and are not repeated here.

The disclosed embodiments also provide a chip that may include the computing device of any of the embodiments described above in connection with the figures. Further, the disclosure also provides a board card, which may include the foregoing chip.

According to different application scenarios, the electronic device or apparatus of the present disclosure may include a server, a cloud server, a server cluster, a data processing apparatus, a robot, a computer, a printer, a scanner, a tablet computer, a smart terminal, a PC device, a terminal of the internet of things, a mobile terminal, a mobile phone, a vehicle recorder, a navigator, a sensor, a camera, a video camera, a projector, a watch, an earphone, a mobile storage, a wearable device, a visual terminal, an autopilot terminal, a vehicle, a household appliance, and/or a medical device. The vehicle comprises an airplane, a ship and/or a vehicle; the household appliances comprise a television, an air conditioner, a microwave oven, a refrigerator, an electric cooker, a humidifier, a washing machine, an electric lamp, a gas stove and a range hood; the medical equipment comprises a nuclear magnetic resonance instrument, a B ultrasonic instrument and/or an electrocardiograph. The electronic device or apparatus of the present disclosure may also be applied to the fields of the internet, the internet of things, data centers, energy, transportation, public management, manufacturing, education, power grid, telecommunications, finance, retail, construction site, medical, and the like. Further, the electronic device or apparatus disclosed herein may also be used in application scenarios related to artificial intelligence, big data, and/or cloud computing, such as a cloud, an edge, and a terminal. In one or more embodiments, a computationally powerful electronic device or apparatus according to the present disclosure may be applied to a cloud device (e.g., a cloud server), while a less power-consuming electronic device or apparatus may be applied to a terminal device and/or an edge-end device (e.g., a smartphone or a camera). In one or more embodiments, the hardware information of the cloud device and the hardware information of the terminal device and/or the edge device are compatible with each other, so that appropriate hardware resources can be matched from the hardware resources of the cloud device according to the hardware information of the terminal device and/or the edge device to simulate the hardware resources of the terminal device and/or the edge device, so as to complete unified management, scheduling and cooperative work of end-cloud integration or cloud-edge-end integration.

It is noted that for the sake of brevity, the present disclosure describes some methods and embodiments thereof as a series of acts and combinations thereof, but those skilled in the art will appreciate that the aspects of the present disclosure are not limited by the order of the acts described. Accordingly, one of ordinary skill in the art will appreciate that certain steps may be performed in other sequences or simultaneously, in accordance with the disclosure or teachings of the present disclosure. Further, those skilled in the art will appreciate that the embodiments described in this disclosure are capable of alternative embodiments, in which acts or modules are involved, which are not necessarily required to practice one or more aspects of the disclosure. In addition, the present disclosure may focus on the description of some embodiments, depending on the solution. In view of the above, those skilled in the art will understand that portions of the disclosure that are not described in detail in one embodiment may also be referred to in the description of other embodiments.

In particular implementation, based on the disclosure and teachings of the present disclosure, one skilled in the art will appreciate that the several embodiments disclosed in the present disclosure may be implemented in other ways not disclosed herein. For example, as for the units in the foregoing embodiments of the electronic device or apparatus, the units are split based on the logic function, and there may be another splitting manner in the actual implementation. Also for example, multiple units or components may be combined or integrated with another system or some features or functions in a unit or component may be selectively disabled. The connections discussed above in connection with the figures may be direct or indirect couplings between the units or components in terms of connectivity between the different units or components. In some scenarios, the aforementioned direct or indirect coupling involves a communication connection utilizing an interface, where the communication interface may support electrical, optical, acoustic, magnetic, or other forms of signal transmission.

In the present disclosure, units described as separate parts may or may not be physically separate, and parts shown as units may or may not be physical units. The aforementioned components or units may be co-located or distributed across multiple network elements. In addition, according to actual needs, part or all of the units can be selected to achieve the purpose of the solution of the embodiment of the present disclosure. In addition, in some scenarios, multiple units in embodiments of the present disclosure may be integrated into one unit or each unit may exist physically separately.

In other implementation scenarios, the integrated unit may also be implemented in hardware, that is, a specific hardware circuit, which may include a digital circuit and/or an analog circuit, etc. The physical implementation of the hardware structure of the circuit may include, but is not limited to, physical devices, which may include, but are not limited to, transistors or memristors, among other devices. In this regard, the various devices described herein (e.g., computing devices or other processing devices) may be implemented by suitable hardware processors, such as central processing units, GPUs, FPGAs, DSPs, ASICs, and the like. Further, the aforementioned storage unit or storage device may be any suitable storage medium (including magnetic storage medium or magneto-optical storage medium, etc.), and may be, for example, a variable Resistive Memory (RRAM), a Dynamic Random Access Memory (DRAM), a Static Random Access Memory (SRAM), an Enhanced Dynamic Random Access Memory (EDRAM), a High Bandwidth Memory (HBM), a Hybrid Memory Cube (HMC), a ROM, a RAM, or the like.

The foregoing may be better understood in light of the following clauses:

clause 1, a computing device comprising control circuitry and arithmetic circuitry, wherein:

the control circuit is used for analyzing a histogram calculation instruction, an operation object of the histogram calculation instruction comprises data to be detected and target data, and the histogram calculation instruction indicates that the occurrence frequency of the target data in the data to be detected is counted to serve as output data; and

the operation circuit comprises a plurality of slave processing circuits, and the operation circuit is used for scheduling a corresponding number of slave processing circuits to execute histogram operation tasks according to the histogram calculation instruction, wherein each scheduled slave processing circuit executes the histogram operation aiming at a part of the data to be detected and/or target data.

Clause 2, the computing device of clause 1, wherein the arithmetic circuitry further comprises main processing circuitry for:

determining a splitting scheme of the histogram operation task according to the histogram calculation instruction and the processing capacity of the slave processing circuit and a splitting rule;

wherein the resolution scheme comprises the followingOne or more of: the transmission mode of the data to be detected and the target data, the splitting mode of the data to be detected and the target data, and the multiplexing times R of the data to be detected _N And the number of times of multiplexing R of the target data _S 。

Clause 3, the computing device of clause 2, wherein the arithmetic circuitry further comprises first storage circuitry and second storage circuitry, the main processing circuitry further to:

determining one of the data to be detected and the target data as multicast data, and determining the other one of the data to be detected and the target data as distribution data;

storing the multicast data in the first storage circuit for transmission to the scheduled plurality of slave processing circuits over a broadcast bus during operation; and

and storing the split distribution data in the second storage circuit so that the scheduled slave processing circuit can load the distribution data of the corresponding part for operation.

Clause 4, the computing apparatus of clause 3, wherein the split rule comprises at least one of:

splitting the data to be detected on the lowest dimension according to the alignment requirement;

preferentially satisfying the multiplexing times R of the multicast data _N Reaching the maximum value, and selectively meeting the multiplexing times R of the distributed data _S ；

Preferentially performing round multiplexing splitting on the multicast data according to the dimension of the output data, and performing round multiplexing splitting on the combined distribution data when the dimension of the output data is smaller than a preset threshold value; and

the multiplexing times of the multicast data in each round of operation are more than the minimum multiplexing times M of the multicast data and not more than the maximum multiplexing times N of the multicast data.

Clause 5, the computing device of clause 3, wherein the main processing circuit is further to:

determining the data to be detected as multicast data, and determining the target data as distribution data;

splitting the data to be detected into L according to the sequence from low dimensionality to high dimensionality of storage dimensionality _N The detection data blocks are used for realizing the histogram operation through multiple rounds of operation, and in each round of operation, according to the multiplexing frequency Rs of target data, rs +1 detection data blocks are transmitted to corresponding Rs +1 groups in the scheduled slave processing circuits through a broadcast bus respectively; and

splitting the target data into a plurality of histogram statistical elements for distribution to different slave processing circuits for counting the corresponding histogram statistical elements.

Clause 6, the computing device of clause 5, wherein each of the slave processing circuits comprises a comparison circuit and an accumulation circuit, and in each round of operation:

the comparison circuit is used for carrying out alignment comparison on the target data distributed to the current slave processing circuit and the detection data block of the current round and outputting a comparison result; and

the accumulation circuit is used for accumulating the comparison result of the comparison circuit to obtain a part of statistical value and accumulating the part of statistical value and the part of statistical value of the previous round.

Clause 7, the computing device of clause 6, wherein the comparison circuit comprises a plurality of comparators each for comparing the histogram statistic element assigned to the slave processing circuit with one piece of data to be detected in the piece of detection data, and outputting a comparison result indicating whether or not both are identical.

Clause 8, the computing device of clause 7, wherein the accumulation circuit comprises an addition circuit and a register, wherein:

the adding circuit is used for performing adding operation on the comparison results of the plurality of comparators to obtain a partial statistic value of the current round, and performing adding operation on the partial statistic value of the current round and the partial statistic value of the previous round to update the partial statistic value; and

the register is used for storing the updated partial statistic value.

Clause 9, the computing device of any of clauses 5-8, wherein the arithmetic circuitry is further configured to:

according to the multiplexing times R of the target data _S R to be directed to the same histogram statistic element _S And +1 partial statistic values of the slave processing circuits are accumulated to obtain a final statistic value corresponding to the histogram statistic element.

Clause 10, a chip comprising the computing device of any of clauses 1-9.

Clause 11, a board comprising the chip of clause 10.

Clause 12, a data processing method implemented by a computing device comprising a control circuit and an operational circuit, the operational circuit comprising a plurality of slave processing circuits, the method comprising:

the control circuit analyzes a histogram calculation instruction, an operation object of the histogram calculation instruction comprises data to be detected and target data, and the histogram calculation instruction indicates that the occurrence frequency of the target data in the data to be detected is counted to serve as output data; and

the operation circuit is used for scheduling a corresponding number of slave processing circuits to execute histogram operation tasks according to the histogram calculation instruction, wherein each scheduled slave processing circuit executes the histogram operation aiming at one part of the data to be detected and/or the target data.

Clause 13, the method of clause 12, wherein the arithmetic circuitry further comprises main processing circuitry, and the method further comprises:

the main processing circuit determines a splitting scheme of the histogram operation task according to the histogram calculation instruction and the processing capacity of the slave processing circuit and a splitting rule;

wherein the resolution scheme comprises one or more of: the transmission mode of the data to be detected and the target data, the splitting mode of the data to be detected and the target data, and the multiplexing times R of the data to be detected _N And the number of times of multiplexing R of the target data _S 。

Clause 14, the method of clause 13, wherein the arithmetic circuitry further comprises first storage circuitry and second storage circuitry, and the method further comprises:

the main processing circuit determines one of the data to be detected and the target data as multicast data, and determines the other of the data to be detected and the target data as distribution data;

Clause 15, the method of clause 14, wherein the splitting rule comprises at least one of:

splitting the data to be detected according to the alignment requirement on the lowest dimension;

preferentially satisfying the multiplexing times R of the multicast data _N The maximum value is reached, and the multiplexing times R of the distributed data are selectively met _S ；

Clause 16, the method of clause 14, wherein the method further comprises:

the main processing circuit determines the data to be detected as multicast data, and the target data is determined as distribution data;

splitting the data to be detected into L according to the sequence from low dimensionality to high dimensionality of storage dimensionality _N A plurality of detection data blocks for realizing the histogram operation by a plurality of rounds of operation, each round of operationAccording to the multiplexing times Rs of target data, transmitting Rs +1 detection data blocks to corresponding Rs +1 groups in the scheduled multiple slave processing circuits through a broadcast bus respectively; and

the target data is split into a plurality of histogram statistical elements to be distributed to different slave processing circuits to carry out statistics on the corresponding histogram statistical elements.

Clause 17, the method of clause 16, wherein each of the slave processing circuits comprises a comparison circuit and an accumulation circuit, and the method comprises: in each round of the operation, the operation is carried out,

the comparison circuit carries out alignment comparison on the target data distributed to the current slave processing circuit and the detection data block of the current round and outputs a comparison result; and

the accumulation circuit accumulates the comparison result of the comparison circuit to obtain a partial statistic value, and accumulates the partial statistic value and the partial statistic value of the previous round to update the partial statistic value.

Clause 18, the method of any of clauses 16-17, wherein the method further comprises:

The foregoing detailed description of the embodiments of the present disclosure has been presented for purposes of illustration and description and is intended to be exemplary only and is not intended to be exhaustive or to limit the invention to the precise forms disclosed; meanwhile, for a person skilled in the art, according to the idea of the present disclosure, there may be variations in the specific embodiments and the application scope, and in summary, the present disclosure should not be construed as a limitation to the present disclosure.

Claims

1. A computing device comprising control circuitry and arithmetic circuitry, wherein:

2. The computing device of claim 1, wherein the operational circuitry further comprises main processing circuitry to:

wherein the resolution scheme comprises one or more of: the transmission mode of the data to be detected and the target data, the splitting mode of the data to be detected and the target data, and the multiplexing times R of the data to be detected _N And the number of multiplexing times R of the target data _S 。

3. The computing device of claim 2, wherein the operational circuitry further comprises first storage circuitry and second storage circuitry, the main processing circuitry further to:

4. The computing device of claim 3, wherein the split rule comprises at least one of:

5. The computing device of claim 3, wherein the primary processing circuit is further to:

splitting the data to be detected into L according to the sequence from low dimensionality to high dimensionality of storage dimensionality _N The detection data blocks are used for realizing the histogram operation through multiple rounds of operation, and in each round of operation, according to the multiplexing times Rs of target data, rs +1 detection data blocks are transmitted through a broadcast bus to corresponding Rs +1 groups in the scheduled slave processing circuits respectively; and

6. The computing device of claim 5, wherein each of the slave processing circuits comprises a compare circuit and an accumulate circuit, and in each round of operation:

7. The computing device of claim 6, wherein the comparison circuitry comprises a plurality of comparators, each comparator for comparing a histogram statistic element assigned to the slave processing circuitry with one of the blocks of test data to be tested and outputting a comparison result indicating whether the two are the same.

8. The computing device of claim 7, wherein the accumulation circuit comprises an addition circuit and a register, wherein:

the register is used for storing the updated partial statistic value.

9. The computing device of any of claims 5-8, wherein the operational circuitry is further to:

according to the multiplexing times R of the target data _S R to be directed to the same histogram statistic element _S And +1 partial statistical values of the slave processing circuits are accumulated to obtain a final statistical value corresponding to the histogram statistical element.

10. A chip comprising a computing device according to any one of claims 1-9.

11. A board card comprising the chip of claim 10.

12. A data processing method implemented by a computing device comprising control circuitry and operational circuitry comprising a plurality of slave processing circuits, the method comprising:

13. The method of claim 12, wherein the operational circuitry further comprises main processing circuitry, and the method further comprises:

14. The method of claim 13, wherein the operational circuit further comprises a first storage circuit and a second storage circuit, and the method further comprises:

15. The method of claim 14, wherein the split rule comprises at least one of:

16. The method of claim 14, wherein the method further comprises:

17. The method of claim 16, wherein each of the slave processing circuits includes a compare circuit and an accumulate circuit, and the method comprises: in each round of the operation, the operation is carried out,

18. The method according to any of claims 16-17, wherein the method further comprises: