CN102981886A

CN102981886A - Method for generating optimized memset standard library function assembly code

Info

Publication number: CN102981886A
Application number: CN2012105639690A
Authority: CN
Inventors: 朱浩; 应欢; 王东辉; 洪缨
Original assignee: Institute of Acoustics CAS
Current assignee: Institute of Acoustics CAS
Priority date: 2012-12-21
Filing date: 2012-12-21
Publication date: 2013-03-20
Anticipated expiration: 2032-12-21
Also published as: CN102981886B

Abstract

The invention discloses a method for generating an optimized memset standard library function assembly code. The method comprises the steps as follows: determining the attributes and the size of an optimized filling execution segment to be generated according to the hardware characteristics of a target machine; constructing a centralized jump table according to the size of the optimized filling execution segment to be generated; carrying out branch judgment on a target filling address and a filling size in an input parameter according to the centralized jump table, and establishing a mapping relation from the input parameter to the centralized jump table; generating a filling mode set meeting the filling requirements according to the available data transmission instruction set of the target machine and the attributes of the optimized filling execution segment to be generated; and carrying out performance screening on the filling mode set according to the hardware characteristics of the target machine to obtain a filling mode with the optimal filling performance so as to generate the optimized filling execution segment. According to the method, the optimized data filling effect is achieved, the data filling performance of a memset standard library function is improved, and the portability is good.

Description

A kind of memset standard library function assembly code generation method of optimization

Technical field

The present invention relates to standard library function assembly code generation technique, relate in particular to a kind of memset standard library function assembly code generation method of optimization.

Background technology

The digital signal processing task usually need to be finished a large amount of data and calculate, such as FIR(Finite Impulse Response commonly used in the digital signal processing, limit for length's unit impulse response is arranged) wave filter and FFT(Fast Fourier Transformation, fast Fourier transform) algorithm, and the initial work of array is generally finished by the memset standard library function in the C standard.In the C11 standard, the memset standard library function is defined as all fills the region of memory of a certain specific size with specific byte data.Since the speed of internal memory will be slowly with respect to microprocessor many, and for the microprocessor of data-oriented intensive applications, the memset standard library function that is used for data initialization belongs to and calls intensity, and it is optimized highly significant.Usually, for the microprocessor with different hardware characteristic, the realization of standard library function on the higher level lanquage aspect is consistent.Yet exactly because this consistance, the standard library function on the higher level lanquage aspect is difficult to accomplish the thorough optimization for the specific objective architecture.Be optimized program in assembly level the opportunity from optimizing, and program is got over bottom, and the easier scheduling of code more can effectively utilize instruction set.Therefore, Modern microprocessor is in order to improve handling property, and a lot of standard library functions all are in the embedded static library of form that collects.

The operation that above-mentioned mentioned memset standard library function is done is that a certain particular memory piece is all filled with specific byte data, and its typical case realizes it being the byte data stuffing.This algorithm is realized simple, when memory block scale to be filled hour, performance still can.Yet, when the data bandwidth of microprocessor greater than 8 bits, and memory block to be filled is when larger, the filling mode of this byte data is far from bringing into play the data bandwidth of microprocessor, performance is lower.Under most of platforms, begin to give full play to its realization data stuffing the data bandwidth of microprocessor from internal memory alignment boundary.

GCC(GNU Complier Collection of the prior art) compiler has utilized this point just when the memset standard library function is realized optimizing, memory block scale to be filled is filled address align according to target whether be divided into three parts: the memory block that the memory block before the alignment border, alignment are filled, the memory block after the alignment border.Wherein, the memory block of destination address alignment adopts the disposable filling multibyte data of multibyte data transfer instruction, does not line up part and still adopts the byte data transfer instruction to finish filling.The GCC compiler has been realized identical code for the optimization of memset standard library function at the C language level, yet realize optimizing in assembly level need to be for different architecture, realizes respectively that in conjunction with separately ardware feature portability is relatively poor.Microprocessor for other independent researches, existing optimization for the standard library function assembly level, major part is the ardware feature according to target processor, on the assembly code basis that is generated by corresponding compiler compiling higher level lanquage, to its carry out manual optimize with obtain on the current microprocessor than the dominance energy.Yet this optimization method for standard library function is based on the assembly code that compiler compiling higher level lanquage obtains, and is redundant more, optimizes thorough not.

Summary of the invention

The objective of the invention is the behavior essence based on standard library function, a kind of generation method of memset standard library function assembly code of optimization is provided.

For achieving the above object, the invention provides a kind of memset standard library function assembly code generation method of optimization, the method comprises:

According to the ardware feature of target machine, determine that attributive character and the scale that fragment has carried out in the filling of optimization to be generated; Carry out the scale of fragment according to the filling of above-mentioned optimization to be generated, structure is concentrated jump list, and this concentrated jump list is made of many branch's skip instructions, and every branch's skip instruction is intended to execution route is guided into the filling of corresponding optimization and carries out fragment; According to above-mentioned concentrated jump list, the machinable logic instruction of based target, target in the input parameter is filled attributive character that address and filling scale carry out fragment according to the filling of above-mentioned optimization to be generated carry out branch and judge, set up the input parameter collection to the mapping relations of concentrating jump list; According to target machine can with data transfer instruction collection, the filling of above-mentioned optimization to be generated carry out the attributive character that fragment has, i.e. the specific information that requires of filling generates to satisfy and fills the fill pattern set that all fill patterns of requiring consist of; According to the ardware feature of target machine, the performance screening is carried out in above-mentioned fill pattern set, obtain having the fill pattern of optimum filling capacity, generate thus the filling of optimizing and carry out fragment.

The embodiment of the invention is based on the behavior essence of standard library function, designed a kind of memset standard library function assembly code generation method of optimization, the memset standard library function assembly code that the method generates is when carrying out, can be according to input parameter, namely fill the information that requires, realize pointed optimally data stuffing, and only need be by revising the ardware feature user-defined file, can expand to other based on RISC (Reduced Instruction Set Computer, Reduced Instruction Set Computer) on the architecture, better portable.

Description of drawings

After embodiments of the present invention being described in detail with way of example below in conjunction with accompanying drawing, other features of the present invention, characteristics and advantage will be more obvious.

Fig. 1 generates the method flow synoptic diagram for the memset standard library function assembly code of a kind of optimization that the embodiment of the invention provides;

Fig. 2 is that the block diagram that fragment generates is carried out in the filling that the embodiment of the invention is optimized;

Fig. 3 is the expansion synoptic diagram of embodiment of the invention padding data;

Fig. 4 is the decomposing schematic representation of embodiment of the invention data stuffing task;

Function when Fig. 5 is the basic fill pattern generation of the embodiment of the invention screens out schematic flow sheet;

Fig. 6 (a) is a kind of fill pattern of 16 for embodiment of the invention data scale;

Fig. 6 (b) is 16 another kind of fill pattern for embodiment of the invention data scale.

Embodiment

Below by drawings and Examples, the application's technical scheme is described in further detail.

The initial work of array is finished by the memset standard library function in the C standard usually.Because the memset standard library function in the C language considers for compatibility, lag behind the assembly code that the target machine according to particular architecture writes out in speed.Based on the optimum theory basis, program is got over bottom, and the easier scheduling of code more can effectively utilize instruction set.And assembly code is the ardware feature of compiler combining target machine, and higher level lanquage by compile optimization, is generated after the operation such as code block merging.Therefore, Modern microprocessor is in order to improve handling property, and a lot of standard library functions all are in the embedded static library of form that collects.And the optimization of standard library function is different from the optimization of other development sequences: the behavior of standard library function is clear and definite, can directly generate optimized code according to its behavior description, makes its performance more excellent but not make an amendment on the redundant code basis.

In the C11 standard, memset standard library function prototype is void*memset (void*s, intc, size_t, n), and wherein input parameter s, c and n correspond respectively to target and fill address, specific padding data and filling scale.Memset standard library function essence be with a certain particular memory region with specific byte data stuffing, be generally used for carrying out initialization for the memory block of new application.It is the data stuffing of byte that its typical case carries out, however this algorithm memory block hour performance still can, when memory block to be filled was larger, the cost consuming time that repeatedly circulates was large, performance is extremely low.Therefore, how to realize that multibyte filling is the design focal point of numerous memset standard library function optimized algorithms.The instruction set of Modern microprocessor all provides byte, and the multibyte data transmission is supported in half-word, word addressing.In order to give full play to the data bandwidth of processor, when adopting the multibyte data transfer instruction that the particular memory piece is realized data stuffing, need to consider the alignment problem of target filling address.For example, 4 byte data transfer instructions require target to fill the address and are necessary for 4 byte-aligned, otherwise the access that does not line up can trigger unusually.Therefore, whether aliging according to memory address to be filled, choose available data transfer instruction, is the key point that generates the memset standard library function assembly code of optimizing.In addition, consider when data to be filled are not hit in cache memory (Cache), can the influence function execution performance.Therefore, each only take the Cache block size as cyclic pac king unit take the upper bound of Cache block size as the filling scale, improved the execution performance of memset standard library function from the angle that reduces the Cache miss rate.To sum up, the memset standard library function of optimization according to input parameter, carries out branch and judges when carrying out, and jumps to respectively pointed optimally the filling according to judged result and carries out on the fragment.Therefore, the assembly code of the memset standard library function of optimization need to be carried out these several angles of fragment part from the filling of expansion, branch judgment part and the optimization of padding data and automatically generated by program.

Fig. 1 generates the method flow synoptic diagram for the memset standard library function assembly code of a kind of optimization that the embodiment of the invention provides.The assembly code generation method of the memset standard library function of a kind of optimization that the embodiment of the invention provides comprises the filling execution fragment part of branch judgment part and optimization.As shown in Figure 1, the embodiment of the invention comprises step 101-step 105.

Step 101 to the generation of step 103 mainly for the branch judgment part is described.

When program generates the branch judgment part, be actually the input parameter collection that will generate from the memset standard library function, the mapping relations of fragment collection are carried out in the filling of arriving each optimization, and namely each concrete input parameter can both have the filling of unique optimization to carry out with it correspondence of fragment.And the filling of each concrete optimization execution fragment has the unique attribute feature, be that (target in the input parameter is filled the alignment pattern that address s has determined target filling address to alignment pattern, alignment pattern has determined the selection of the data transfer instruction that target machine can be used), the information such as filling scale.Ardware feature according to target machine, can obtain the alignment pattern scale of the data transfer instruction that target machine can use (such as 8B, 4B alignment etc.), and take the Cache block size of target machine as the boundary with filling scale classification (be divided into the filling scale and just equal separately 0～Cache block size).Therefore, the concrete filling carried out scale that fragment integrates as alignment pattern scale * (Cache block size+1).Comprehensive above-mentioned to probing into that the branch judgment part generates, the generation step of branch judgment part is as follows:

In step 101, according to the ardware feature of target machine, determine that attributive character and the scale that fragment has carried out in the filling of optimization to be generated.

Particularly, the alignment pattern of the data transfer instruction that can use according to target machine and the Cache block size of target machine obtain alignment pattern and filling scale that target is filled the address, determine that thus the filling of optimization to be generated carries out the scale of fragment.

In step 102, carry out the scale of fragment according to the filling of described optimization to be generated, structure is concentrated jump list, concentrates jump list to be made of many branch's skip instructions.Every branch's skip instruction is intended to execution route is guided into the filling execution fragment of corresponding optimization.

In step 103, according to concentrated jump list, the machinable logic instruction of based target, branch's judgement is carried out according to the attributive character of the filling execution fragment of optimization to be generated in target in input parameter filling address and filling scale, set up the input parameter collection to the mapping relations of concentrating jump list, these mapping relations are one to one or many-one.

Need to prove, before the branch judgment part generates, need to according to the specific padding data in the input parameter, the figure place of specific byte padding data be extended to the maximum number of digits of the data register of target machine.In the C11 standard, padding data type definition in the memset canonical function input parameter is integer, and in function body, specific padding data coercion of type can be converted to the unsigned character type, namely the memset canonical function is to carry out repeatability with the 8bits data to fill carrying out when particular memory region is filled.Yet the data stuffing that at every turn carries out byte is far from bringing into play the data bandwidth of processor.For the Modern microprocessor of general support multibyte data transfer instruction, fill in order to realize each multibyte internal memory, need to the padding data of 8bits be expanded according to the ardware feature of target machine, to the maximum number of digits of the data register of target machine.The expansion of data bits can be by accomplished in many ways, such as displacement or, the operations such as addition, multiplication.For example, be that its padding data expansion synoptic diagram as shown in Figure 3 on 32 the hardware platform at the data register maximum number of digits.

The generation that step 104 is carried out the fragment part to step 105 mainly for the filling of optimizing is described.

The main part that fragment partly is the memset standard library function assembly code realization data stuffing of optimization is carried out in the filling of optimizing.When the memset standard library function of optimizing when carrying out, this part can be filled address and filling scale according to target, finishes pointed optimally data stuffing.The attributive character that fragment has is carried out in filling according to the optimization to be generated of determining in the branch judgment part, i.e. specific filling requires (alignment pattern, filling scale that target is filled the address), in conjunction with data mining thought, generate the fill pattern set of all fill patterns formations that satisfy specific filling requirement.The ardware feature of based target machine, by the Executing Cost assessment, the performance screening is carried out in set to fill pattern, obtains having the fill pattern of optimum filling capacity, generates thus the filling of optimizing and carries out fragment.

In step 104, carry out the attributive character of fragment according to the filling of target machine data available transfer instruction collection, optimization to be generated, i.e. concrete filling requirement generates to satisfy and fills the fill pattern set that requires.

Particularly, such as Fig. 2 and shown in Figure 4, be that the filling Task-decomposing of n becomes head to fill task, cyclic pac king task and afterbody to fill task, consist of the form of cyclic pac king with the filling scale.

In practice, when memory block zone to be filled is larger, if whole filling scale is all launched to realize filling one by one, the fill pattern set scale that not only obtains is very big, and each fill pattern size also with the linear growth of filling scale, it is rapid to cause the filling of each concrete optimization to carry out the assembly code scale of fragment.Therefore, carrying out in the process of fragment in the filling of generate optimizing, is that the filling Task-decomposing of n is that head is filled task with the filling scale, the cyclic pac king task, and afterbody is filled task, consists of the form of cyclic pac king.Fill task according to head, the cyclic pac king task, afterbody is filled task, respectively to target machine can with the data transfer instruction collection impose basic fill pattern generating algorithm, obtain respectively the set of head fill pattern, cyclic pac king set of modes, the set of afterbody fill pattern, and obtain making up the fill pattern set after respectively each element in the set of head fill pattern, cyclic pac king set of modes, the set of afterbody fill pattern being made up in order, wherein each element namely consisted of the main body that the fragment assembly code is carried out in the filling of optimizing.

Below be described for basic fill pattern generating algorithm:

The behavior of Memset standard library function is in the nature the filling of a certain particular memory region.According to filling requirement, the machinable data transfer instruction collection of combining target and corresponding address align requirement thereof can be met the fill pattern set of filling all fill patterns formations that require.

In linear space, be defined as follows: set Г is the instruction set that target machine is supported, i (possesses attribute bytes for the concrete data transfer instruction of certain bar wherein, represent the data word joint number that this instruction can be transmitted), then definable set Φ be the data transfer instruction collection that can use of target machine (Φ=(i ∈ Г) | i.behavior_attri=MOVE_DATA_TO_MEM}), definition is satisfied and is filled the set that all fill patterns of requiring consist of: Ψ={ (i ₁, i ₂..., i _n) | (i ₁, i ₂..., i _n∈ Φ) ∧ sum (i ₁.bytes, i ₂.bytes ..., i _n.bytes)=SIZE} wherein, SIZE is the filling scale in fill requiring, sum is summing function commonly used.

Because the realization of assembly code depends on target hardware platform, so the ardware feature of the generation of fill pattern and target machine is closely related.For the memset standard library function, its behavior essence is particular memory region to be carried out the filling of specific byte data.Therefore, when generating fill pattern, need to be according to the address align requirement of data transfer instruction, to target machine can with the data transfer instruction collection impose function and screen out.If in the pattern generative process, the correlated condition that current fill pattern generates needs can not be satisfied in the current address, and such as the corresponding address align requirement of data transfer instruction, then present mode generates and lost efficacy.Concrete function screens out realization as shown in Figure 5, and at first, the data transfer instruction collection that the traversal target machine can be used is i depending on the current data transfer instruction; Secondly, judge whether the current address satisfies the corresponding address align requirement of current data transfer instruction, if satisfy, then add according to the order of sequence current data transfer instruction to current fill pattern; Otherwise, screen out the current data transfer instruction, travel through next bar data transfer instruction.After current fill pattern generated and finishes, to its invocation performance valuation functions, if current fill pattern is screened out, the fill pattern that then begins a new round generated; Otherwise, add current fill pattern to the fill pattern set.

When requiring to generate fill pattern according to a certain concrete filling, based on the behavior essence of memset function, carrying out after function screens out, can clearly be met and fill the set Ψ that all fill patterns of requiring consist of.For each elements A among the set Ψ _k(n tuple (i ₁, i ₂..., i _n)), A _kExecution performance depend on the element number that it contains.As being 16 in the filling scale, all the other fillings require in the identical situation, and two kinds of typical fill patterns of generation are shown in Fig. 6 (a) and Fig. 6 (b):

(a) is the fill pattern of typical each transmission byte among Fig. 6, and Fig. 6 (b) is the fill pattern of each transmission 4 bytes.Contrast the two as can be known, when the element number that contains when fill pattern was fewer, the scale of assembly code was also less, and the interaction times of processor and internal memory is fewer during padding data.And, for general microprocessor, be as good as on list, the multibyte data transfer instruction Executing Cost.Therefore, the fill pattern of more simplifying more can effectively be utilized instruction set, gives full play to the data bandwidth of processor.Therefore, take the element number of fill pattern as criterion, pair set Ψ invocation performance valuation functions is removed the fill pattern fill pattern of being simplified most after filtration.

For a certain specific filling requirement, the generation step of its basic fill pattern fill pattern is as follows:

1, current fill pattern generates entrance,

2, traveling through the data transfer instruction collection Φ that target machine can be used, is i depending on current transfer instruction,

3, instruction i is called the screening function and impose functional screening (as shown in Figure 5), fall if present instruction is screened, then turn to 2, otherwise, turn to 4,

4, add according to the order of sequence present instruction i to current fill pattern, and judge whether current fill pattern is finished, if do not finish, then turn to 2, otherwise turn to 5,

5, to current fill pattern invocation performance valuation functions, if, then beginning the fill pattern of a new round by filtering, current fill pattern generates, namely turn to 1, otherwise, turn to 6,

6, current fill pattern is added in the fill pattern set under current filling requires.

Under a certain specific filling required, the fill pattern set Ψ that obtains by above-mentioned steps met the complete or collected works that fill requirement, and scale is very big.Based on the behavior essence of built-in function as can be known, contain at least an element among the Ψ, i.e. the data stuffing of byte, i.e. (i ₁, i ₂..., i _n) (i wherein ₁, i ₂..., i _nBe the byte data transfer instruction in the instruction set), although the non-optimum solution of this element.

In step 105, according to the ardware feature of target machine, the performance screening is carried out in set to fill pattern, obtains having the fill pattern of optimum filling capacity, generates thus the filling of optimizing and carries out fragment.

For general microprocessor, one of index of the Executing Cost of assembly code is the execution cycle number of instruction, and the ardware feature of this index and target machine is closely related, can define in the ardware feature user-defined file according to different architecture.The behavior of Memset standard library function is in the nature the data stuffing in the particular memory region, so the Executing Cost of the data transfer instruction in the assembly code has determined the Executing Cost of whole filling execution fragment to a great extent.According to the instruction Executing Cost information in the ardware feature user-defined file of target machine, each fill pattern Executing Cost summation in the set of calculation combination fill pattern, take the instruction Executing Cost as criterion, screen by the combination fill pattern is gathered, obtain the fill pattern of Executing Cost minimum.Fragment is carried out in the concrete filling that generates thus pointed optimization.

The embodiment of the invention is based on the behavior essence of standard library function, designed a kind of memset standard library function assembly code generation method of optimizing that generated by program, the memset standard library function assembly code that the method generates is when carrying out, can be according to input parameter, realize pointed optimally data stuffing, and only need be by revising the ardware feature user-defined file, can expand to other based on RISC (Reduced Instruction Set Computer, Reduced Instruction Set Computer) on the architecture, better portable.

Obviously, under the prerequisite that does not depart from true spirit of the present invention and scope, the present invention described here can have many variations.Therefore, the change that all it will be apparent to those skilled in the art that all should be included within the scope that these claims contain.The present invention's scope required for protection is only limited by described claims.

Claims

1. the memset standard library function assembly code generation method of an optimization is characterized in that:

According to the ardware feature of target machine, determine that attributive character and the scale that fragment has carried out in the filling of optimization to be generated;

Carry out the scale of fragment according to the filling of described optimization to be generated, structure is concentrated jump list, and described concentrated jump list is made of many branch's skip instructions, and every branch's skip instruction is intended to execution route is guided into the filling of corresponding optimization and carries out fragment;

According to described concentrated jump list, the machinable logic instruction of based target, target in the input parameter is filled attributive character that address and filling scale carry out fragment according to the filling of described optimization to be generated carry out branch and judge, set up the input parameter collection to the mapping relations of described concentrated jump list;

The attributive character of fragment is carried out in the filling of the data transfer instruction collection that can use according to target machine, described optimization to be generated, generates to satisfy and fills the fill pattern set that requires;

According to the ardware feature of described target machine, the performance screening is carried out in described fill pattern set, obtain having the fill pattern of optimum filling capacity, generate thus the filling of optimizing and carry out fragment.

2. method according to claim 1 is characterized in that: the attributive character of fragment is carried out in described filling according to described optimization to be generated, generates to satisfy to fill the fill pattern set step that requires and comprise:

Carrying out the attributive character of fragment according to the filling of described optimization to be generated, is that the filling Task-decomposing of n becomes head to fill task, cyclic pac king task and afterbody to fill task with the data stuffing scale;

Fill task, cyclic pac king task and afterbody according to described head and fill task, respectively to target machine can with the data transfer instruction collection impose respectively basic fill pattern generating algorithm, be met described head and fill task, cyclic pac king task and afterbody and fill the head fill pattern set of task, cyclic pac king set of modes, afterbody fill pattern and gather.

3. method according to claim 2, it is characterized in that: described basic fill pattern generating algorithm realizes by following steps:

The data transfer instruction collection that the traversal target machine can be used;

Judge whether the current address satisfies the corresponding address align requirement of current data transfer instruction, if satisfy, then add according to the order of sequence current data transfer instruction to current fill pattern; Otherwise, screen out the current data transfer instruction, travel through next bar data transfer instruction;

After described current fill pattern generation is finished, described current fill pattern is carried out Performance Evaluation, if, then beginning the fill pattern of a new round by filtering, current fill pattern generates; Otherwise, add current fill pattern to the fill pattern set.

4. method according to claim 3 is characterized in that: to described current fill pattern carry out Performance Evaluation be element number in the current fill pattern as criterion, current fill pattern is screened.

5. method according to claim 1 is characterized in that: described ardware feature according to described target machine, the performance screening is carried out in described fill pattern set, and the fill pattern step that obtains having optimum filling capacity comprises:

According to the instruction Executing Cost information in the ardware feature user-defined file of target machine, calculate the Executing Cost summation of each fill pattern in the described fill pattern set, take the instruction Executing Cost as criterion, by described fill pattern set is screened, obtain the fill pattern of Executing Cost minimum.

6. method according to claim 1 is characterized in that: also comprised before described structure is concentrated the jump list step:

According to the specific byte padding data in the input parameter, by one or more operations in displacement, logical OR, addition and the multiplication of target machine support, the figure place of specific byte padding data is extended to the maximum number of digits of the data register of target machine.

7. method according to claim 1 is characterized in that: the filling of described optimization to be generated is carried out attributive character that fragment has and is comprised that target fills alignment pattern and the filling scale of address.