CN112099959B - Data processing method and device - Google Patents

Data processing method and device Download PDF

Info

Publication number
CN112099959B
CN112099959B CN202011310504.5A CN202011310504A CN112099959B CN 112099959 B CN112099959 B CN 112099959B CN 202011310504 A CN202011310504 A CN 202011310504A CN 112099959 B CN112099959 B CN 112099959B
Authority
CN
China
Prior art keywords
target
chromosomes
chromosome
processing
strategy
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011310504.5A
Other languages
Chinese (zh)
Other versions
CN112099959A (en
Inventor
金跃
张尧
赵瑞
陈勇
刘永超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alipay Hangzhou Information Technology Co Ltd
Original Assignee
Alipay Hangzhou Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alipay Hangzhou Information Technology Co Ltd filed Critical Alipay Hangzhou Information Technology Co Ltd
Priority to CN202011310504.5A priority Critical patent/CN112099959B/en
Publication of CN112099959A publication Critical patent/CN112099959A/en
Application granted granted Critical
Publication of CN112099959B publication Critical patent/CN112099959B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5017Task decomposition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2209/00Indexing scheme relating to G06F9/00
    • G06F2209/50Indexing scheme relating to G06F9/50
    • G06F2209/5018Thread allocation

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Data Mining & Analysis (AREA)
  • Computing Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiment of the specification provides a data processing method and device, and in the data processing method, a target array to be processed is obtained, wherein the target array comprises a first number of elements. A policy space of multiple dimensions is constructed that is sized based on the first number. The plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array. In the strategy space, the target point with the shortest time for processing the target array is searched. And taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number. And carrying out parallel processing on the segmented target array by calling the target number of parallel computing units.

Description

Data processing method and device
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a data processing method and apparatus.
Background
Machine learning models are widely used at present, and the requirement on the processing speed of the machine learning models follows. In the conventional technology, the processing speed of the machine learning model can be optimized only by a manual mode. This can be extremely time consuming and inefficient to optimize.
Disclosure of Invention
One or more embodiments of the present disclosure describe a data processing method and apparatus, which can automatically optimize a data processing process, thereby greatly increasing a data processing speed.
In a first aspect, a data processing method is provided, including:
acquiring a target array to be processed, wherein the target array comprises a first number of elements;
constructing a policy space of a plurality of dimensions of a determined size based on the first number; the plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array;
searching a target point with the shortest time for processing the target array in the strategy space;
taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number;
and performing parallel processing on the segmented target array by calling the target number of parallel computing units.
In a second aspect, there is provided a data processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target array to be processed, and the target array comprises a first number of elements;
a construction unit for constructing a policy space of a plurality of dimensions of a size determined based on the first number; the plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array;
the searching unit is used for searching a target point with the shortest time for processing the target array in the strategy space;
the segmentation unit is used for taking the value of the first dimension of the target point as a target number and segmenting the target array according to the target number;
and the processing unit is used for carrying out parallel processing on the segmented target array by calling the target number of parallel computing units.
In a third aspect, there is provided a computer storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method of the first aspect.
In a fourth aspect, there is provided a computing device comprising a memory having stored therein executable code and a processor that, when executing the executable code, implements the method of the first aspect.
In the data processing method and apparatus provided in one or more embodiments of the present specification, a multi-dimensional policy space having a size determined based on a first number is first constructed for a target array to be processed. And searching a target point with the shortest time for processing the target array in the strategy space. And taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number. And carrying out parallel processing on the segmented target array by calling the target number of parallel computing units. That is, in the present solution, when processing an array, a target point with the shortest time required for processing a target array is automatically searched in a policy space, and the target point may represent a policy for processing the target array. Thus, it can also be expressed as: the optimal strategy is automatically searched in the strategy space, and then the target array is processed according to the optimal strategy, so that the data processing speed can be greatly improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
FIG. 1 is a diagram illustrating a conventional array operation process provided herein;
FIG. 2 is a schematic diagram of the optimized array operation process provided in the present specification;
FIG. 3 is a schematic diagram of a data processing method provided herein;
FIG. 4 is a flow diagram of a data processing method provided by one embodiment of the present description;
FIG. 5 is a schematic diagram of array slicing provided herein;
fig. 6 is a schematic diagram of a data processing apparatus according to an embodiment of the present disclosure.
Detailed Description
The scheme provided by the specification is described below with reference to the accompanying drawings.
Before describing the solution provided in the present specification, the inventive concept of the present solution will be explained below.
Current machine learning models typically involve a variety of complex operations. Such as reduction operations and matrix multiplications. The reduction operation is an operation of reducing a plurality of elements in an array into a single result. This may be, for example, a summation operation, a maximum value operation, or a minimum value operation. It will be appreciated that the time required to perform such operations will typically directly affect the performance of the machine learning model. For this reason, the present application will begin with reducing the time required for the reduction operation to improve the performance of the machine learning model.
First, conventional reduction operations are time consuming because reduction operations are typically performed serially in sequence. For example, assume the following array: a [6] = {1,2,3,4,5,6}, and if the maximum value is obtained for this array, the current operation method is: firstly, solving the maximum values of 1 and 2 to obtain 2; then, 2 and 3 are subjected to maximum value calculation to obtain 3; and analogizing until the maximum value is obtained by 6, and obtaining a final result: 6.
when each element in the array is represented as a rounded square, and each max operation is represented as a rectangular or rectangular square, the conventional array operation process can be as shown in fig. 1. As can be seen from fig. 1, when the maximum value is found for an array comprising 6 elements, 5 maximum value calculations are required.
In fact, the operation sequence of each element in the array does not affect the final result of the reduction operation. Based on this, the optimization method of the partial reduction operation provides that the reduction operation can be performed on partial elements in the array in parallel, and then the reduction operation is performed on the generated intermediate result. For example, the array in the foregoing example may be partitioned as follows: {1,2}, {3,4} and {5,6}, and performing maximum evaluation on each subarray obtained by splitting in parallel to obtain three intermediate results: 2, 4 and 6. Then, 2 and 4 are maximized again, resulting in an intermediate result: 4. finally, the maximum values of 4 and 6 are obtained to obtain the final result: 6.
similarly, when each element in the array is represented as a rounded square, and each max operation is represented as a rectangular or rectangular square, the optimized reduction operation process can be as shown in fig. 2. As can be seen from fig. 2, when the maximum value is found for an array including 6 elements, 3 maximum value finding operations are required. The reason why the maximum value calculation is performed 3 times here is that the maximum value calculation represented by each rectangular or rectangular block located in the same layer is performed in parallel and can be regarded as one calculation. Finally, since fig. 2 contains 3 layers of positive direction blocks in total, the optimization process needs 3 maximum operations in total.
Of course, FIG. 2 is only one way to split the arrays, and in practical applications, the arrays may be split as follows {1,2,3} and {4,5,6}, etc.
It should be noted that, the conventional optimization method of the reduction operation is usually implemented manually, specifically, the number of parallel computing units is set manually according to an empirical value, and then the array is segmented according to the number of parallel computing units. However, the manual optimization method has poor adaptability, i.e., the number of parallel computing units cannot be flexibly adjusted according to the number of elements included in the currently processed array, and thus the data processing speed cannot be effectively increased.
Therefore, the application provides a data processing method, that is, aiming at the array to be processed currently, a target point with the shortest time for processing the target array is automatically searched in a strategy space, and the target point can represent a strategy for processing the target array. Thus, it can also be expressed as: the optimal strategy is automatically searched in the strategy space, and then the target array is processed according to the optimal strategy, so that the data processing speed can be greatly improved.
The present invention has been made in view of the above-described aspects, and it is to be understood that the present invention is not limited to the above-described embodiments.
Fig. 3 is a schematic diagram of a data processing method provided in this specification. In FIG. 3, a target array to be processed is obtained, which includes a first number of elements. A policy space of multiple dimensions is constructed that is sized based on the first number. The plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array. In the strategy space, the target point with the shortest time for processing the target array is searched. And taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number to obtain a target number sub-array. And calling a target number of parallel computing units to perform parallel processing on each sub-array to obtain a target number of intermediate results. And finally, processing the intermediate results with the target number to obtain a final result.
Fig. 4 is a flowchart of a data processing method according to an embodiment of the present disclosure. The execution subject of the method may be a device with processing capabilities: a server or a system or device. As shown in fig. 4, the method may specifically include:
step 402, a target array to be processed is obtained, wherein the target array comprises a first number of elements.
In one example, the target array may be expressed as: items [ K ]. It should be appreciated that in this example, the target array includes K elements, i.e., the first number is K. Specifically, the K elements in the target array may be represented as: items [0], items [1], …, items [ K-1 ]. K is a positive integer.
Step 404, construct a multi-dimensional policy space of a size determined based on the first number.
The plurality of dimensions here includes at least a first dimension corresponding to the number of parallel computing units used to process the target array. The parallel computing units here may be threads, for example.
For example, taking a parallel computing unit as a thread, the number of parallel computing units for processing the target array may be determined according to the product of the number of GPU allocated blocks (BlockX) and the number of single GPU block allocated threads (ThreadX). In this example, the number of the first dimensions may be two, where one first dimension corresponds to the number of GPU allocation blocks and the other first dimension corresponds to the number of single GPU block allocation threads.
In another example, the plurality of dimensions may further include a second dimension corresponding to a single processing element number (UnrollX). The single-processing element number here indicates a unit of processing when a single parallel computing unit processes array elements. The processing unit therein refers to the number of processing (expansion) elements of one for-loop. Additionally, a third dimension may be included that corresponds to whether shared memory (SharedMemory) usage of the GPU is required.
It should be noted that, the value of each dimension of each policy point in the policy space may constitute a policy for processing the target array. For example, assume that the dimensions include two first dimensions corresponding to BlockX and ThreadX, and a second dimension corresponding to UnrollX, and their values are: (8, 8, 4), then the corresponding policy may be: the number of parallel computing units is: 8 by 8=64, the number of single treatment elements is: 4 of the Chinese herbal medicines. The concrete explanation is as follows: firstly, the target array is segmented according to 64, and then the segmented target array is subjected to parallel processing by calling 64 parallel computing units. And each parallel computing unit has 4 elements for processing the segmented target array.
Furthermore, in the above example, since none of BlockX, ThreadX, and UnrollX exceeds the number of elements of the target array, i.e., does not exceed the first number. Thus, in the above example, the size of the constructed policy space may be: k. In addition, the value range of SharedMemory should be 0 or 1. Therefore, when the policy space further includes a third dimension, the size of the policy space may be: k2. That is, the size of the constructed policy space is determined based on the first number, or the size of the constructed policy space is positively correlated to the first number.
Step 406, searching a target point with the shortest time for processing the target array in the strategy space.
Here, the target point may be searched using a conventional traversal algorithm, or may be searched using an optimal solution solving algorithm. The optimal solution solving algorithm may include, but is not limited to, any one of the following: genetic algorithm, ant colony algorithm, simulated annealing algorithm, hill climbing algorithm, particle swarm algorithm and the like.
Taking the optimal solution solving algorithm as the genetic algorithm, for example, the searching for the target point in the policy space may specifically include:
step a, selecting N strategy points in a strategy space, wherein the product of the values of all the dimensions does not exceed a first number.
In one example, first, Q (e.g., 1000) strategy points with a product of values of each dimension not exceeding a first number may be selected in the strategy space, where each strategy point may be used as a candidate chromosome. Then, N (e.g., 64) strategy points are randomly selected from the Q strategy points. Wherein Q and N are positive integers, and Q is more than or equal to N.
When the multiple dimensions of the strategy space include two first dimensions and one second dimension, each strategy point of the Q strategy points may have values of three dimensions, so that the corresponding candidate chromosome may include three genes. In one example, 1 of the Q candidate chromosomes can be represented as follows: (5,5,2). It is understood that the 1 st "5" thereof is the first gene of the candidate chromosome, which may represent blockax, the 2 nd "5" thereof is the second gene of the candidate chromosome, which may represent ThreadX, and the 2 nd "2" thereof is the third gene of the candidate chromosome, which may represent UnrollX.
And b, executing multiple chromosome iterations by taking the N strategy points as N primary chromosomes, wherein each chromosome iteration comprises the step of selectively combining the N current chromosomes based on the respective fitness of the N current chromosomes so as to generate N next generation chromosomes.
In the present specification, fitness of a chromosome is used to evaluate whether the chromosome is good or bad.
It should be noted that, when the chromosome iteration is the first iteration, the N current generation chromosomes are N initial generation chromosomes, and the fitness of each of the N current generation chromosomes is obtained through initialization. When the chromosome iteration is a non-first iteration, the N current generation chromosomes are N next generation chromosomes generated in the previous iteration. The fitness of each of the N current generation chromosomes is inversely related to the processing time of each of the N next generation chromosomes generated in the previous iteration. In one example, the reciprocal of the processing time of each of the N next generation chromosomes generated in the previous iteration can be used as the fitness of each of the N current generation chromosomes. Taking an arbitrary first chromosome of the N next generation chromosomes as an example, the processing time of the first chromosome is the time required to process the target array according to the strategy represented by the strategy point mapped by the first chromosome in the strategy space.
The selectively combining the N current generation chromosomes to generate the N next generation chromosomes based on the fitness of each of the N current generation chromosomes may specifically include:
and (4) sequencing the N current generation chromosomes from high to low according to the fitness, and taking the M current generation chromosomes which are ranked at the top as M next generation chromosomes. N-M chromosome combinations are performed on the remaining N-M current generation chromosomes to generate N-M next generation chromosomes. Each chromosome combination comprises that two current generation chromosomes are selected to be combined and corrected by adopting a random algorithm based on the respective fitness of N-M current generation chromosomes to obtain an initial next generation chromosome. The similarity between the initial next generation chromosome and the Q candidate chromosomes is calculated, and the candidate chromosome corresponding to the maximum similarity is taken as a final next generation chromosome. M is a positive integer and M is less than or equal to N.
It will be appreciated that after the generation of the N-M progeny chromosomes, the N progeny chromosomes are obtained by adding the previously selected M progeny chromosomes. As in the previous example, each chromosome of the N generated progeny chromosomes may contain three genes.
The random algorithm may include, but is not limited to, a roulette algorithm or a voting method, etc. In the case of roulette algorithm, before performing N-M chromosome combinations, a random number sequence may be generated, which may be composed of (N-M) × 2 random numbers, wherein each random number takes on a value between 0 and 1. In addition, for any first chromosome in the N-M current generation chromosomes, the ratio of the fitness of the first chromosome to the sum of the fitness of other chromosomes is calculated to obtain the selection probability of the first chromosome. Similarly, the selection probabilities for each of the other chromosomes can be calculated. Then, for the first chromosome, the selection probability of the first chromosome can be accumulated with the selection probabilities of chromosomes arranged before the first chromosome, so as to obtain the accumulated probability of the first chromosome. Similarly, the cumulative probabilities for each of the other chromosomes can be obtained.
Then, when the chromosome combination is performed the ith time (i is more than or equal to 1 and less than or equal to N-M), two random numbers corresponding to the ith time are selected from the random number sequence. And for each random number, sequentially comparing the random number with each accumulation probability, and selecting the current chromosome which is greater than the first accumulation probability of the random number and corresponds to the first accumulation probability as one of the current chromosomes. Similarly, another current generation chromosome may be selected.
It should be understood that since in the roulette algorithm, the accumulated probability of each chromosome is calculated based on the fitness of each chromosome, the greater the accumulated probability, the greater the probability that the corresponding chromosome is selected, and thus the greater the fitness, the greater the probability that the chromosome that is selected will generate the next generation chromosome.
Returning to step b, each chromosome iteration in step b may further include obtaining respective processing times for the N next generation chromosomes. The reciprocal of the processing time of each of the N next generation chromosomes is used as the updated fitness of each. The N next generation chromosomes are used as the updated N current generation chromosomes.
Taking an arbitrary first chromosome of the N next generation chromosomes as an example, before the processing time of the first chromosome is obtained, a first strategy point mapped by the first chromosome in the strategy space may be determined, and a value of a first dimension of the first strategy point may be taken as the number of parallel computing units used for processing the target array, or as the target number, where the target number of parallel computing units may be referred to as the number of parallel computing units for short. It should be understood that if the first strategy point has two first dimensions (i.e., one first dimension corresponds to the number of GPU allocation blocks and the other first dimension corresponds to the number of single GPU block allocation threads), the product of the values of the first dimensions is taken as the target number. The target array may then be sliced according to the target number. The specific splitting process is described later.
After the target array is sliced, statistics of processing time for the first chromosome may be performed. Specifically, the segmented target array is subjected to parallel processing by calling the target number of parallel computing units, and the processing time is counted.
In one example, the processing time may be calculated by the following formula: number of treatments a single treatment was time consuming. Wherein, the time consumption of single treatment can be preset. The processing times may be maximum-value-calculating times, minimum-value-calculating times, summation times, or the like. Taking the example of fig. 2, the number of processes is 3.
Further, as described above, the chromosome having a higher fitness is selected with a higher probability of generating the next generation chromosome, and the reciprocal of the processing time of each of the N next generation chromosomes is used as the updated fitness, which means that the probability of being selected is higher for the chromosome having a shorter processing time, and this is matching the search condition of the target point.
And c, taking strategy points mapped in the strategy space by the chromosome corresponding to the maximum fitness in the N next generation chromosomes obtained after multiple chromosome iterations as target points.
In one example, the number of times the chromosome iteration is performed may be 128 times, for example.
In addition, as described above, each next generation chromosome is selected from the Q candidate chromosomes by calculating the similarity. And each candidate chromosome may map a strategy point in the strategy space (i.e., the product of the values of the dimensions does not exceed a first number of strategy points). Thus, after the chromosome corresponding to the maximum fitness is determined, a strategy point can be uniquely determined in the strategy space.
And 408, taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number.
Specifically, the target array may be partitioned into a target number of sub-arrays. The number of elements included in each subarray may be the same or different.
Fig. 5 is a schematic diagram of array slicing provided in this specification. In fig. 5, the target array to be split includes K elements, the target number is m, or the number of parallel computing units used for processing the target array is m, then m sub-arrays may be obtained by splitting the target array. It is assumed that m sub-arrays include the same number of elements, e.g., n. Then in fig. 5, K, n and m satisfy the following relationship: k = m × n. Wherein n and m are both positive integers.
It should be understood that fig. 5 is only an exemplary illustration, and in practical applications, when K and m do not satisfy the relationship of integer multiples, the number of elements included in each of m sub-arrays may also be different, for example, the number of elements included in each of the first m-1 sub-arrays is n, the number of elements included in the last 1 sub-array is less than n, and the like.
It should be noted that, when the multiple dimensions of the policy space further include the second dimension, the number of the single processing elements may also be determined according to the value of the second dimension of the target point.
And step 410, performing parallel processing on the segmented target array by calling the target number of parallel computing units.
After the target array is segmented, the number of the obtained sub-arrays is the same as the target number (i.e. the number of parallel computing units for processing the target array). Each parallel computing unit can process one sub-array, and therefore parallel processing of the sub-arrays can be achieved.
In one example, the processing here may be a reduction operation, i.e., each parallel computing unit maximizes, minimizes, sums, etc. the elements in each subarray.
It should be understood that when the number of single processing elements is also determined, then each parallel computing unit processes each element in the corresponding sub-array in the unit of the number of single processing elements. I.e. the number of elements processed (or expanded) in one for loop is the above-mentioned number of elements processed in one time.
For example, assuming that the target array includes 10000 elements and the target number is 100 (i.e. the number of threads used for processing the target array is 100), 100 sub-arrays may be obtained by splitting, and each sub-array includes 100 elements. Further, it is also assumed that the number of single processing elements is 10. Then 100 subarrays may be processed in parallel by calling 100 threads, and each thread needs to perform 10 (100 elements/number of elements processed once) for loops when processing 100 elements in the corresponding subarray, where 10 elements are unrolled per for loop.
It should be noted that, in the embodiment of the present specification, each thread expands several elements in each for loop, so that the number of times of determination can be greatly reduced. For example, in the foregoing example, if the for loop is not expanded, 100 for loops need to be executed, and thus 100 determinations need to be made. Whereas if 10 elements are expanded per for-loop, only 10 for-loops need to be performed, making only 10 decisions. It should be appreciated that as the number of determinations decreases, the time consumed by a single processing of a thread decreases accordingly, thereby increasing the data processing speed.
It should be appreciated that after 100 threads are invoked to process 100 sub-arrays in parallel, 100 intermediate results may be obtained. And the corresponding processing times can be recorded as 100 times, that is, the number of elements included in one sub-array.
After 100 intermediate results are obtained, the number of intermediate results is small enough. Therefore, a single thread can be directly called to process the 100 intermediate results to obtain a final result. The number of processes here is likewise denoted as 100 (i.e. the number of intermediate results). So that the final processing times of the target array are: 100+100=200 times.
Of course, in practical applications, after 100 intermediate results are obtained, processing can also be performed in parallel by calling several threads. For example, 10 threads (obtained by searching) may be called to perform processing in parallel, each thread processes 10 intermediate results, and finally 10 second intermediate results may be obtained. The number of times of processing here can be written as: 10 times, the number of intermediate results processed by a single thread.
For 10 second intermediate results, processing may be performed by calling 1 thread to obtain the final result. The number of times of processing here can be written as: 10 times. Thus, the final number of processes of the target array is: 100+10+10=120 times.
It should be appreciated that in this example, the processing time of the target array =120 × a single processing time.
It should be noted that, in the embodiment of the present specification, after the target point is searched, and the target number and the single processing element number are determined according to the value of the first dimension and the value of the second dimension of the target point, the first number, the target number, and the single processing element number may be stored correspondingly, so that when a similar array having the same length as that of the target array is processed subsequently, the similar array is processed directly based on the stored target number and the single processing element number, and thus the processing speed of the similar array may be greatly increased.
Finally, it should be noted that the solution provided in the embodiment of the present disclosure may be implemented by using a Halide language (a domain description language). Because the core idea of the Halide language is to decouple the algorithm description from the algorithm scheduling, the Halide language is a language for optimizing codes. Therefore, when the scheme is implemented by adopting the Halide language, the data processing speed can be increased.
In summary, according to the solution provided in the embodiment of the present disclosure, when processing an array, a target point with the shortest time required to process a target array is automatically searched in a policy space, and the target point may represent a policy for processing the target array. Thus, it can also be expressed as: the optimal strategy is automatically searched in the strategy space, and then the target array is processed according to the optimal strategy, so that the data processing speed can be greatly improved. In addition, the optimal strategy is automatically acquired without manual participation, so that the automatic optimization processing of the arrays can be realized, the optimization cost can be greatly saved, and the optimization efficiency can be improved.
Corresponding to the data processing method, an embodiment of the present specification further provides a data processing apparatus, as shown in fig. 6, the apparatus may include:
the obtaining unit 602 is configured to obtain a target array to be processed, where the target array includes a first number of elements.
A building unit 604 for building a multi-dimensional policy space of a size determined based on the first number. The plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array.
The searching unit 606 is configured to search, in the policy space, a target point that requires the shortest time to process the target array.
The search unit 606 may specifically be configured to:
in the strategy space, an optimal solution solving algorithm is adopted to search a target point with the shortest time for processing the target array.
The optimal solution solving algorithm herein may include, but is not limited to: genetic algorithm, ant colony algorithm, simulated annealing algorithm, hill climbing algorithm or particle swarm algorithm.
The search unit 606 may specifically include:
a selecting module 6062, configured to select, in the policy space, N policy points whose product of values of the dimensions does not exceed the first number.
An executing module 6064, configured to execute multiple chromosome iterations with the N strategy points as N primary chromosomes, where each chromosome iteration includes selectively combining the N current generation chromosomes to generate N next generation chromosomes based on respective fitness of the N current generation chromosomes. Where fitness of the chromosome is inversely related to treatment time. The processing time of the chromosome is the time required to process the target array according to the strategy represented by the strategy points mapped by the chromosome in the strategy space.
The execution module 6064 may be specifically configured to:
and (4) sequencing the N current generation chromosomes from high to low according to the fitness, and taking the M current generation chromosomes which are ranked at the top as M next generation chromosomes.
N-M chromosome combinations are performed on the remaining N-M current generation chromosomes to generate N-M next generation chromosomes. Each chromosome combination comprises that two current generation chromosomes are selected to be combined and corrected by adopting a random algorithm based on the respective fitness of N-M current generation chromosomes to obtain an initial next generation chromosome. The similarity between the initial next generation chromosome and the Q candidate chromosomes is calculated, and the candidate chromosome corresponding to the maximum similarity is taken as a final next generation chromosome. The Q candidate chromosomes correspond to Q strategy points whose product of the values of the dimensions in the strategy space does not exceed a first number.
The execution module 6064 is further to:
the processing time of each of the N next generation chromosomes is obtained.
The reciprocal of the processing time of each of the N next generation chromosomes is used as the updated fitness of each.
The N next generation chromosomes are used as the updated N current generation chromosomes.
The selecting module 6062 is further configured to use, as the target point, a strategy point mapped in the strategy space by the chromosome corresponding to the maximum fitness among the N next generation chromosomes obtained after the chromosome iteration is performed for multiple times.
The segmentation unit 608 is configured to take the value of the first dimension of the target point as a target number, and segment the target array according to the target number.
And the processing unit 610 is configured to perform parallel processing on the sliced target array by calling the target number of parallel computing units.
The processing unit 610 may specifically be configured to:
and carrying out parallel summation, maximum value solving or minimum value solving on the segmented target array by calling the target number of parallel computing units.
Optionally, the plurality of dimensions further include a second dimension corresponding to the number of single processing elements. The above apparatus may further include:
the determining unit 612 is configured to determine the number of single processing elements according to the value of the second dimension of the target point. The one-time processing element number is used to indicate a processing unit of each of the target number of parallel computing units when the divided target array is processed.
The functions of each functional module of the device in the above embodiments of the present description may be implemented through each step of the above method embodiments, and therefore, a specific working process of the device provided in one embodiment of the present description is not repeated herein.
The data processing device provided by one embodiment of the specification can greatly improve the data processing speed.
According to an embodiment of another aspect, there is also provided a computer-readable storage medium having stored thereon a computer program which, when executed in a computer, causes the computer to perform the method described in connection with fig. 4.
According to an embodiment of yet another aspect, there is also provided a computing device comprising a memory and a processor, the memory having stored therein executable code, the processor, when executing the executable code, implementing the method described in connection with fig. 4.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.
The steps of a method or algorithm described in connection with the disclosure herein may be embodied in hardware or may be embodied in software instructions executed by a processor. The software instructions may consist of corresponding software modules that may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. An exemplary storage medium is coupled to the processor such the processor can read information from, and write information to, the storage medium. Of course, the storage medium may also be integral to the processor. The processor and the storage medium may reside in an ASIC. Additionally, the ASIC may reside in a server. Of course, the processor and the storage medium may reside as discrete components in a server.
Those skilled in the art will recognize that, in one or more of the examples described above, the functions described in this invention may be implemented in hardware, software, firmware, or any combination thereof. When implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one place to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The above-mentioned embodiments, objects, technical solutions and advantages of the present specification are further described in detail, it should be understood that the above-mentioned embodiments are only specific embodiments of the present specification, and are not intended to limit the scope of the present specification, and any modifications, equivalent substitutions, improvements and the like made on the basis of the technical solutions of the present specification should be included in the scope of the present specification.

Claims (14)

1. A method of data processing, comprising:
acquiring a target array to be processed, wherein the target array comprises a first number of elements;
constructing a policy space of a plurality of dimensions of a determined size based on the first number; the plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array; the value of each dimension in the plurality of dimensions does not exceed the first number;
in the strategy space, searching a target point with the shortest time for processing the target array by adopting an optimal solution solving algorithm;
taking the value of the first dimension of the target point as a target number, and segmenting the target array according to the target number;
performing parallel processing on the segmented target array by calling the target number of parallel computing units;
in the policy space, searching a target point with the shortest time required for processing the target array by adopting an optimal solution solving algorithm, wherein the method comprises the following steps:
selecting N strategy points in the strategy space, wherein the product of the values of all the dimensions does not exceed the first number;
performing chromosome iteration for a plurality of times by taking the N strategy points as N initial chromosomes, wherein each chromosome iteration comprises selectively combining the N current chromosomes based on the respective fitness of the N current chromosomes to generate N next generation chromosomes; wherein the fitness is negatively correlated with a processing time, the processing time being a time required for processing the target array according to a strategy represented by a strategy point mapped in the strategy space by the corresponding chromosome; the product of the N descendant chromosomes and the value of each dimension in the strategy space does not exceed the first number of N strategy points;
and taking the strategy point mapped in the strategy space by the chromosome corresponding to the maximum fitness in the N next generation chromosomes obtained after the multiple chromosome iterations as the target point.
2. The method of claim 1, the selectively combining the N current generation chromosomes to generate N next generation chromosomes based on their respective fitness, comprising:
sorting the N current generation chromosomes from high to low according to fitness, and taking M current generation chromosomes which are sorted at the front as M next generation chromosomes;
performing N-M chromosome combinations for the remaining N-M current generation chromosomes to generate N-M next generation chromosomes; each chromosome combination comprises that two current generation chromosomes are selected to be combined and corrected by adopting a random algorithm based on the respective fitness of N-M current generation chromosomes to obtain an initial next generation chromosome; calculating the similarity between the initial next generation chromosome and the Q candidate chromosomes, and taking the candidate chromosome corresponding to the maximum similarity as a final next generation chromosome; the Q candidate chromosomes correspond to Q strategy points, the product of the values of all the dimensions in the strategy space does not exceed the first number.
3. The method of claim 1, each chromosome iteration further comprising:
acquiring respective processing time of the N next generation chromosomes;
taking the reciprocal of the processing time of each of the N next generation chromosomes as the updated fitness of each of the N next generation chromosomes;
and taking the N next generation chromosomes as updated N current generation chromosomes.
4. The method of claim 1, the optimal solution solving algorithm further comprising: ant colony algorithm, simulated annealing algorithm, hill climbing algorithm or particle swarm algorithm.
5. The method of claim 1, the plurality of dimensions further comprising a second dimension corresponding to a single processing element number; the method further comprises the following steps:
determining the number of single processing elements according to the value of the second dimension of the target point; the single-time processing element number is used for indicating a processing unit of each parallel computing unit in the target number of parallel computing units when the segmented target array is processed.
6. The method of claim 1, wherein the parallel processing of the sliced target array by invoking the target number of parallel computing units comprises:
and carrying out reduction operation on the segmented target array in parallel by calling the target number of parallel computing units.
7. A data processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a target array to be processed, and the target array comprises a first number of elements;
a construction unit for constructing a policy space of a plurality of dimensions of a size determined based on the first number; the plurality of dimensions includes at least a first dimension corresponding to a number of parallel computing units used to process the target array;
the searching unit is used for searching a target point with the shortest time for processing the target array in the strategy space by adopting an optimal solution solving algorithm;
the segmentation unit is used for taking the value of the first dimension of the target point as a target number and segmenting the target array according to the target number;
the processing unit is used for calling the parallel computing units with the target number and carrying out parallel processing on the segmented target array;
the search unit includes:
a selecting module, configured to select, in the policy space, N policy points for which a product of values of the dimensions does not exceed the first number;
an execution module, configured to execute multiple chromosome iterations with the N strategy points as N primary chromosomes, where each chromosome iteration includes selectively combining the N current generation chromosomes based on their respective fitness to generate N next generation chromosomes; wherein the fitness is negatively correlated with a processing time, the processing time being a time required for processing the target array according to a strategy represented by a strategy point mapped in the strategy space by the corresponding chromosome; the product of the N descendant chromosomes and the value of each dimension in the strategy space does not exceed the first number of N strategy points;
and the selection module is further configured to use, as the target point, a strategy point mapped in the strategy space by the chromosome corresponding to the maximum fitness among the N next generation chromosomes obtained after the multiple chromosome iterations.
8. The apparatus of claim 7, the execution module to:
sorting the N current generation chromosomes from high to low according to fitness, and taking M current generation chromosomes which are sorted at the front as M next generation chromosomes;
performing N-M chromosome combinations for the remaining N-M current generation chromosomes to generate N-M next generation chromosomes; each chromosome combination comprises that two current generation chromosomes are selected to be combined and corrected by adopting a random algorithm based on the respective fitness of N-M current generation chromosomes to obtain an initial next generation chromosome; calculating the similarity between the initial next generation chromosome and the Q candidate chromosomes, and taking the candidate chromosome corresponding to the maximum similarity as a final next generation chromosome; the Q candidate chromosomes correspond to Q strategy points, the product of the values of all the dimensions in the strategy space does not exceed the first number.
9. The apparatus of claim 7, the execution module further to:
acquiring respective processing time of the N next generation chromosomes;
taking the reciprocal of the processing time of each of the N next generation chromosomes as the updated fitness of each of the N next generation chromosomes;
and taking the N next generation chromosomes as updated N current generation chromosomes.
10. The apparatus of claim 7, the optimal solution solving algorithm further comprising: ant colony algorithm, simulated annealing algorithm, hill climbing algorithm or particle swarm algorithm.
11. The apparatus of claim 7, the plurality of dimensions further comprising a second dimension corresponding to a single processing element number; the device further comprises:
the determining unit is used for determining the number of the single processing elements according to the value of the second dimension of the target point; the single-time processing element number is used for indicating a processing unit of each parallel computing unit in the target number of parallel computing units when the segmented target array is processed.
12. The apparatus according to claim 7, wherein the processing unit is specifically configured to:
and carrying out reduction operation on the segmented target array in parallel by calling the target number of parallel computing units.
13. A computer-readable storage medium, on which a computer program is stored which, when executed in a computer, causes the computer to carry out the method of any one of claims 1-6.
14. A computing device comprising a memory having executable code stored therein and a processor that, when executing the executable code, implements the method of any of claims 1-6.
CN202011310504.5A 2020-11-20 2020-11-20 Data processing method and device Active CN112099959B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011310504.5A CN112099959B (en) 2020-11-20 2020-11-20 Data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011310504.5A CN112099959B (en) 2020-11-20 2020-11-20 Data processing method and device

Publications (2)

Publication Number Publication Date
CN112099959A CN112099959A (en) 2020-12-18
CN112099959B true CN112099959B (en) 2021-03-02

Family

ID=73785289

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011310504.5A Active CN112099959B (en) 2020-11-20 2020-11-20 Data processing method and device

Country Status (1)

Country Link
CN (1) CN112099959B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104849698A (en) * 2015-05-21 2015-08-19 中国人民解放军海军工程大学 Radar signal parallel processing method and system based on heterogeneous multinucleated system
CN110389819A (en) * 2019-06-24 2019-10-29 华中科技大学 A kind of dispatching method and system of computation-intensive batch processing task

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104849698A (en) * 2015-05-21 2015-08-19 中国人民解放军海军工程大学 Radar signal parallel processing method and system based on heterogeneous multinucleated system
CN110389819A (en) * 2019-06-24 2019-10-29 华中科技大学 A kind of dispatching method and system of computation-intensive batch processing task

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Halide: a language and compiler for optimizing parallelism,locality, and recomputation in image processing pipelines;Jonathan Ragan-Kelley等;《Association for Computing Machinery》;20130630;第519-530页 *
Jonathan Ragan-Kelley等.Halide: a language and compiler for optimizing parallelism,locality, and recomputation in image processing pipelines.《Association for Computing Machinery》.2013,第519-530页. *

Also Published As

Publication number Publication date
CN112099959A (en) 2020-12-18

Similar Documents

Publication Publication Date Title
US8583649B2 (en) Method and system for clustering data points
US8543517B2 (en) Distributed decision tree training
CN111902813B (en) Apparatus and method for convolution operation
WO2019069304A1 (en) System and method for compact and efficient sparse neural networks
CN109597647B (en) Data processing method and device
CN111858651A (en) Data processing method and data processing device
CN111292805B (en) Third generation sequencing data overlap detection method and system
CN107580698B (en) System and method for determining the complicating factors of the scheduling size of parallel processor kernel
CN105117326A (en) Test case set generation method based on combination chaotic sequence
CN112734239B (en) Task planning method, device and medium based on task and resource capability attributes
CN115168281B (en) Neural network on-chip mapping method and device based on tabu search algorithm
US20210209690A1 (en) Order matching
US20220147828A1 (en) Cluster-connected neural network
CN112099959B (en) Data processing method and device
CN112035234B (en) Distributed batch job distribution method and device
CN113469354A (en) Memory-constrained neural network training
CN114818458A (en) System parameter optimization method, device, computing equipment and medium
Vidal et al. Solving the DNA fragment assembly problem with a parallel discrete firefly algorithm implemented on GPU
CN114399027A (en) Method for sequence processing by using neural network and device for sequence processing
CN113963241A (en) FPGA hardware architecture, data processing method thereof and storage medium
Li et al. A weighted mutual information biclustering algorithm for gene expression data
CN106096022B (en) Method and device for dividing multi-domain network packet classification rules
CN110620818B (en) Method, device and related equipment for realizing node distribution
CN111752700B (en) Hardware selection method and device on processor
CN110968454B (en) Method and apparatus for determining recovery data for lost data blocks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant