CN103744644B - The four core processor systems built using four nuclear structures and method for interchanging data - Google Patents
The four core processor systems built using four nuclear structures and method for interchanging data Download PDFInfo
- Publication number
- CN103744644B CN103744644B CN201410014522.7A CN201410014522A CN103744644B CN 103744644 B CN103744644 B CN 103744644B CN 201410014522 A CN201410014522 A CN 201410014522A CN 103744644 B CN103744644 B CN 103744644B
- Authority
- CN
- China
- Prior art keywords
- data
- micro
- processor
- core
- kernel
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Multi Processors (AREA)
Abstract
The present invention provides a kind of four core processor systems built using four nuclear structures and method for interchanging data, and described system includes:Using single block many data modes processing data, system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, and each micro-processor kernel includes:Command memory, for store instruction;Data storage in core, for data storage;Central processing unit, for the corresponding operation of instruction and data execution according to input, updates the register file within central processing unit and outside data storage.The present invention utilizes the concurrency of algorithm, improve the execution efficiency of algorithm, additionally by shared depositor and two kinds of data exchange ways building multilamellar bus between micro-processor kernel and the data storage of outside set up each interior internuclear data path of four core processors, improve performance during four core processor parallel data processings, improve data exchange efficiency.
Description
Technical field
The present invention relates to a kind of four core processor systems built using four nuclear structures and method for interchanging data.
Background technology
Four core processors are also referred to as on-chip multi-processor, or chip multiprocessors.This design philosophy is in 1996 by U.S.
Stanford University of state proposes first, by being internally integrated, in one chip, the performance that multiple kernels improve processor.At four cores
Each process kernel structure of reason device is fairly simple, and using the predominance of kernel, execution simultaneously is several times as much as single core processor
Thread or task, greatly improve the parallel performance of processor.Meanwhile, by using shared resource on piece, effectively carry
High traffic rate and reduce power consumption etc..These features all make four core processors have great advantage.
Four nuclear technology represent the once innovation of technical development of computer.After the development of more than ten years, four core processors
Range of application has covered the crowds such as multimedia calculating, embedded device, personal computer, commercial server and high-performance computer
Multi-field, become the main flow of processor development.Compared with single core processor, four core processors mainly have following significant
Advantage:
1st, control logic is simple:For relative excess standard quota microprocessor architecture and very long instruction word structure, four core processors
The control logic complexity of structure is substantially much lower.The hardware of corresponding four core processors is realized must be simply too much.
2nd, high primary frequency:Because the control logic of four core processor structures is relatively easy, comprise few overall signal, therefore
Wire delay affects smaller on it, and therefore, under equal process conditions, the hardware of four core processors is realized obtaining comparing superscale
Microprocessor and the higher operating frequency of very long instruction word microprocessor.
3rd, low-power consumption:By dynamic regulation voltage/frequency, optimize load distribution etc., can effectively reduce by four core processors
Power consumption.
4th, design and proving period are short:Microprocessor manufacturer typically adopts existing maturation single core processor as processor
Kernel, thus design and proving period can be shortened, saves R&D costs.
Four core processor structures not only have that performance potential is big, integrated level is high, degree of parallelism is high, structure is simple and design verification side
Just wait many advantages, and it can also some achievements in the research of inheriting tradition uniprocessor, such as simultaneous multi-threading, wide transmitting
Instruction, blood pressure lowering Low-power Technology etc..But four core processors are a kind of new structure after all, in four nuclear structure designs and application and development
In occur in that before the new problem that do not run into, these problems propose challenge to the future of four core processors.
At present in the evolution of four nuclear technology, following problem values obtain us and consider emphatically.
1st, the selection of core type
The inner core of current four core processors mainly has isomorphism and two kinds of isomery.
Homogeneous structure adopts symmetric design, and principle is simple, hardware is easier to realize.The double-core of current main-stream and four cores are processed
Device is substantially all and adopts homogeneous structure.But, lift the performance of processor by increasing central processor core, exist certain
The limit.After reaching the limit values, performance just cannot improve with the increase of number of cores again.Here it is famous A Mu
DahI's law:Allow to be continuously increased the central processor core of same type to strengthen parallel processing capability, but whole system
Process performance still can be subject in software must the restricting of the part that executes of serial.The problem of isomorphism design is:With
How being on the increase of number of cores, keep the data of each kernel consistent;How to meet kernel storage access and input/defeated
Go out requirements for access;How to select the processor that a various aspects of performance equalizes, area is less and power consumption is relatively low;If how to balance
The load of dry-cure device and task coordinate etc..
Isomery refers within a processor using different types of kernel, such as central processor core, programmable core etc..
Compared with homogeneous structure, the advantage of isomery is by organizing the core of different characteristics come optimized processor internal structure, at realization
The optimization of reason device performance, and power consumption can be effectively reduced.Such as, floating-point fortune central processor core being bad at
Calculate and signal processing work, by the other programmable core execution being integrated on same chip block.But heterogeneous structure there is also
Some difficult points.First, arranging in pairs or groups, which plants different kernels, and how interior internuclear task is divided the work and how to be realized.Secondly, structure
Whether there is good autgmentability, also suffer from the restriction of number of cores.Furthermore, processor instruction system design and realization are also
Problem.Because the instruction system used by different IPs is also critically important to the realization of system, then using these different cores, be
Using identical instruction system or different instruction systems, can run operating system etc. be also the content needing to consider.
2nd, on piece storage organization design
Gaps between their growth rates between processor and main storage are always the problem must take in processor structure design, this
It is exactly famous " storage wall " problem.Because the architecture Design of storage system itself is directly connected to systematic entirety energy,
The each side such as the size of whole chip, power consumption, layout, performance and operational efficiency can be had a huge impact.In the past in list
Pass through in processor substantially can preferably solve this problem using buffer structure, can guarantee that processor performance is played.
But, developed into for four core processor epoch, the problem brought because of gaps between their growth rates between kernel and main memory becomes seriously.By
Increase in the internal number of cores of processor, the requirements for access hosting is increased, and the caching level in uniprocessor epoch and access
The requirements for access that bandwidth can not keep up with four core processors sets it is necessary to carry out corresponding storage organization for four core processors
Meter, and resolve the efficiency of storage system.
Currently to design of memory systems, most processors adopt caching design, and also some processors employ on piece
Memory construction.The advantage of buffer structure design is hardware designs and realizes easily it is easy to application and development and programming, shortcoming is to need
Ensure data cached consistent, and structural extended is difficult.For caching data consistency problem, its resolution policy mainly has
Bus snooping agreement and the directory protocol based on catalogue.Snoopy protocol is that every piece of caching is intercepted always by caching the detectaphone moment
Line, to accept concordance order, unfortunately it is only suitable for the less situation of number of cores.Directory protocol is to be marked by catalogue
Record state in other cachings for itself memory block, to maintain during concordance using point-to-point communication, shortcoming is to realize generation
Valency is too big, there is performance bottleneck when concurrently accessing catalogue.Except above-mentioned hardware coherence algorithm, also it is based on polyprocessor
Software conformance algorithm, but can be used as the cache coherence mechanisms of four nuclear structures, these need further discussion to study.Mesh
Front most of four core processors adopt the snoopy protocol of bus.On-chip memory is to have guided in piece by the memorizer outside piece, it
It is unified addressing as chip external memory, therefore it avoids caching and is not hit by and consistency problem, but it is due to employing
Memory construction, its access delay relatively caches greatly.Some research worker current by using high-speed dynamic random access memory come group
Become on-chip memory, the performance gap between reducing and caching.In addition to selecting which kind of storage organization, the asking of node store structure design
Topic also has:Memorizer is much proper;Which rank of realizes the shared of data at and communication is proper;Which rank of solves slow at
Deposit consistency problem more reasonable;How storage organization supports application of multithreading etc..
3rd, chip-on communication
Although the multiple kernels on four core pieces each execute the code of oneself, in difference, internuclear possible needs are carried out
The shared and synchronization of data, therefore the performance of on chip communication architecture will directly affect the performance of processor.Current chip-on communication master
There are 3 kinds of modes:Bus is shared, cross bar switch interconnects and network-on-chip.
Bus shared structure refers to kernel on piece, input/output port and memorizer by shared two grades or three-level high speed
Caching, or communicated by the bus connecting kernel.Bus-structured strong point is relatively simple it is easy to design is realized, when
Front majority double-core and four core processors are substantially all and employ this structure.Bus structures are the communication venations of existing chip architecture,
With the expansion of circuit scale, bus structures will become the bottleneck of chip design:Although bus can connect multiple logical effectively
Letter side, but bus address resource can not with the increase of computing unit infinite expanding;Although bus can by multiple users share,
But one bus cannot support that more than one pair of user communicates, and that is, serial access mechanism result in the bottleneck of communication simultaneously.Additionally,
Chip-on communication is the main source of power consumption, and huge clock network will occupy the big absolutely portion of chip total power consumption with the power consumption of bus
Point.So bus network is applied to the less situation of interior check figure.There are Xin Dela, English than more typical bus shared structure processor
The Duo of Te Er, strength 4/5 of IBM limited company etc..
Cross bar switch interconnection structure is made up of cross bar switch and interface logic.Cross bar switch is compared with bus structures, excellent
Gesture is that data channel is many, and access bandwidth is bigger, but deficiency is that the chip area that cross bar structure takies is also larger, and with
The increase of interior check figure, performance also can decline, and therefore it is also only applicable to the less situation of interior check figure.Such as advanced micro devices company
Anlong Harold Ickes 2 dual core processor control kernel and outside communication with cross bar switch.
Network-on-chip includes calculating and communicating two subsystems.Computing subsystem completes " calculating " task of broad sense, they
Both can be the IP core of central processing unit, network-on-chip or various special function in existing meaning or deposited
Memory array, reconfigurable hardware etc..Communication subsystem is responsible for connecting microprocessor core, realizes the high-speed communication between computing resource.
The network that communication node and interconnection line therebetween are constituted is referred to as chip-on communication network, and it has used for reference distributed computing system
Communication mode, substitutes traditional on-chip bus with route and packet-switch technology and completes communication task.Network-on-chip with parallel
A lot of identical points have been compared in the interconnection of computer:Support packet communication, communication service that expansible, offer is transparent etc.;But also have not
Same part:Network-on-chip technical support accesses simultaneously, and has reliability high and the features such as reusability is high.It is with total knot
Structure, cross bar structure are compared, network-on-chip can connect that more intellectual property nuclear components, reliability be high, extensibility by force with
And relatively low power consumption, therefore network-on-chip is considered as four core processor interconnection techniques on more extensive piece.Currently
Network-on-chip mainly has two-dimensional grid network, the interconnection structure such as touring.Network-on-chip design problem be find network overhead and
The optimal balance of four core degree of coupling, and consider the extensibility of network simultaneously.Network processor just employs two-dimensional network on piece
Structure, it passes through integrated high-speed network and the routing algorithm optimizing, and on piece, interior internuclear communication delay maximum is not over 6 week
Phase, and this structure extensibility is strong.
Though this 3 kinds of structures are each advantageous and not enough, also can merge, such as in global scope using network-on-chip in office
Portion selects bus or cross bar structure, to realize the balance of performance and complexity.
4th, low-power consumption
One bottleneck of conventional single-processor is just as the lifting of frequency, power consumption more and more higher, finally makes chip no
Method is normally run.In the four core processor designs of early stage, mainly reduce the power consumption of processor by reducing core frequency, but
It is the operational performance which limit kernel, fundamentally do not realize the purpose of high-performance, low-power consumption.Power consumption is too high not only
Lead to energy resource consumption, and hot stack and too high power dissipation density also can impact to system stability.A present chip
On can be with integrated close to 1,000,000,000 transistors, so numerous Resources on Chip, how to control its power consumption, keep superior performance,
Become an important problem.
Before four core processors produce, Low-power Technology mainly has reduction dynamic power consumption and reduces quiescent dissipation technology two
Aspect.Dynamic consume the electric energy including being consumed during each element normal work inside processor, for example capacitive discharge and recharge, cut
Change State Transferring of frequency, gate etc..Reduce dynamic consumption is all the emphasis of people's research all the time, and Technical comparing
Ripe.Reducing dynamic consumption technology primarily now has multi thresholds Technology, dynamic voltage regulation, clock-disabling technology etc..Quiet
State consumption refers to the power consumption from leakage current, is in idle condition and also can consume electric energy even if feature is element, concrete wraps
Include sub-threshold current leakage and door leakage current.The major technique reducing static consumption technology has passage length adjustment, registers latch
Technology, energy gate technology etc..
Technology of both above mainly carries out low power dissipation design and technological development up in circuit level.Process in four cores
Before device occurs, these technology just have occurred in single core processor.With the generation of four core processors, due to four core processors
Have the characteristics that new in structure with realizing, so research worker is found that the method reducing power consumption at new aspect, such as again
Heterogeneous structure design, dynamic thread assignment and transfer techniques etc..The structure design of isomery is exactly to be provided on piece using heterogeneous structure
The optimization configuration in source, the execution efficiency of processor is lifted so that processor not only has high-performance also reduces power consumption.Dynamically
Thread dispatch and transfer techniques are using many kernel processes ability, the multi load excessively on certain kernel is transferred to and loads in little
On core, research worker to reduce the operation power consumption of four core processors also by the design of operating system and optimization.For example when appoint
When being engaged in less, operating system can be closed a kernel or be reduced processor frequencies, and reduces block rotating speed, so that whole system is dropped
Low consumption.Therefore, low power dissipation design contains the content of the many aspects such as circuit-level, structural level, algorithm level and operating system grade,
It is a problem needing to be considered from many aspects.
Content of the invention
It is an object of the invention to provide a kind of four core processor systems built using four nuclear structures and data exchange side
Method, can significantly utilize the concurrency of algorithm, improve the execution efficiency of algorithm.
For solving the above problems, the present invention provides a kind of four core processor systems built using four nuclear structures, including:
Using single block many data modes processing data, that is, synchronization all micro-processor kernels strict implement is same
Program segment, concurrently processes multidimensional data, and described system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates inside central processing unit
Register file and outside data storage.
Further, in said system, described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next week
The pointer value of phase;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and depositing
All control signals needed for device heap module;
ALU, for computing, receives the data of data storage in register file, core, to depositor
In heap, core, data storage sends write enable signal and data to be written;
Whether comparator, for receiving the output from register file, and judge jump instruction according to the output receiving
Occur, if redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, may be used simultaneously
Data is sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from performing module, to fetching mould
Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
Further, in said system, each register file includes local register file and shared register file, its
In,
Local register file, for the closing computing of data in core, in calculating process, with core not there is any friendship in outer data
Mutually, local micro-processor kernel has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each
Data interaction between micro-processor kernel, local micro-processor kernel has read right, write permission to its shared register file
Need to be respectively allocated to local micro-processor kernel or other micro-processor kernel according to application.
Further, in said system, each local register file is divided into two groups, and every group has a read port and one
Write port, wherein, two groups of register files receive different reading address signals, provide corresponding readout;Two groups of register files connect
Receive same write address data input signal, consistent to ensure two groups of register file content
Further, in said system, each micro-processor kernel carries out data exchange by the following two kinds mode:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.
Further, in said system, described multilamellar bus is in micro-processor kernel and outside data storage
Between setting cross bar switch, 4 micro-processor kernels by the data storage outside different bus selection, if selected
Outside data storage all different, then 4 micro-processor kernels synchronize transmission;If the data of selected outside is deposited
Reservoir is identical, then select which micro-processor kernel to carry out prioritised transmission to according to default sequence rule.
Further, in said system, the instruction set that each micro-processor kernel uses includes arithmetic operation instruction, patrols
Collect operational order, branch instruction, access instruction.
Further, in said system, each micro-processor kernel also includes configuration register, for belonging to configuring
The connected mode of the shared register file of micro-processor kernel, to improve the motility of this structure, simultaneously in each microprocessor
The instruction set aspect that kernel uses increases configuration-direct, to support configure to implement.
According to the another side of the present invention, provide a kind of method for interchanging data, using four above-mentioned core processor systems, described
Method includes:
The configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific, and that is, each is micro-
The configuration register of processor cores carries out configuration information setting according to configuration-direct;
Data exchange between outside data storage and micro-processor kernel, first data exchange is outside data
Memorizer writes data in the register file of micro-processor kernel, subsequently has in data storage and the microprocessor of outside
Data between core exchange process repeatedly;
The exchange of internuclear data is realized by the shared register file of each micro-processor kernel.
Compared with prior art, the present invention adopts single block many data modes processing data, and that is, synchronization is all micro-
Processor cores strict implement same program section, concurrently processes multidimensional data, and described system includes 4 reduced instruction set computer frameworks
Micro-processor kernel, wherein, each micro-processor kernel includes:Command memory, for store instruction;Data storage in core
Device, for data storage;Central processing unit, for the corresponding operation of instruction and data execution according to input, updates centre
Register file within reason device and outside data storage, significantly make use of the concurrency of algorithm, improve the execution of algorithm
Efficiency.
In addition, the present invention builds by shared depositor and between micro-processor kernel and the data storage of outside
Two kinds of data exchange ways of multilamellar bus set up each interior internuclear data path of four core processors, improve four core processors parallel
Performance during processing data, improves data exchange efficiency.
Brief description
Fig. 1 is the multilamellar bus cross bar switch concrete structure diagram of one embodiment of the invention;
Fig. 2 is the schematic diagram of the register file of one embodiment of the invention;
Fig. 3 is the structure chart of the shared register file of one embodiment of the invention;
Fig. 4 be one embodiment of the invention micro-processor kernel between data exchange schematic diagram;
Fig. 5 is the flow chart of the method for interchanging data of one embodiment of the invention.
Specific embodiment
Understandable for enabling the above objects, features and advantages of the present invention to become apparent from, below in conjunction with the accompanying drawings and specifically real
The present invention is further detailed explanation to apply mode.
Embodiment one
As shown in figure 1, the present invention provides a kind of four core processor systems built using four nuclear structures, using single block
Many data modes processing data, i.e. synchronization all micro-processor kernels strict implement same program section, concurrently process many
Dimension data, described system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates inside central processing unit
Register file and outside data storage.
Preferably, described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next week
The pointer value of phase;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and depositing
All control signals needed for device heap module;
ALU, for computing, receives the data of data storage in register file, core, to depositor
In heap, core, data storage sends write enable signal and data to be written;
Whether comparator, for receiving the output from register file, and judge jump instruction according to the output receiving
Occur, if redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, may be used simultaneously
Data is sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from performing module, to fetching mould
Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
Preferably, each register file includes local register file and shared register file, wherein,
Local register file, for the closing computing of data in core, in calculating process, with core not there is any friendship in outer data
Mutually, local micro-processor kernel has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each
Data interaction between micro-processor kernel, local micro-processor kernel has read right, write permission to its shared register file
Need to be respectively allocated to local micro-processor kernel or other micro-processor kernel according to application.Specifically, this enforcement
Modification that example is made to register file is as shown in Fig. 2 the internal structure of shared register file is as shown in figure 3, here former Ou Pu
The register file of Fa Er processor is divided into two parts:Local register file and shared register file.
Preferably, each local register file is divided into two groups, and every group has a read port and a write port, wherein, two
Group register file receives different reading address signals, provides corresponding readout;Two groups of register files receive same write address
Data input signal is consistent to ensure two groups of register file content.Specifically, local register file includes 16 depositors,
It is depositor 0~depositor 15 that corresponding registers are numbered, and each depositor is 32.Local register file is used for data in core
Closing computing, does not interact with outside core, data generation is not any in calculating process, locally the local register file of interior verification has completely
Access limit.Local register file is divided into two groups, and every group has a read port and a write port.Two groups of register files receive not
Same reading address signal, provides corresponding readout;Receive same write address data input signal, to ensure that two groups are deposited
Device heap content is consistent.Shared register file includes 4 depositors, and each depositor is 32, and it is to deposit that corresponding registers are numbered
Device 16~depositor 19(Alternatively referred to as shared depositor 0~shared depositor 3).Shared register file and core other kernels outer
There is particular interconnected mode between shared depositor, be used for realizing internuclear data interaction.Locally register file tool is shared in interior verification
There is read right, write permission is respectively allocated to local or other kernels according to application needs.Shared register file has two readings
Port and four write ports, at most can accept the write signal from four different kernels.
Preferably, each micro-processor kernel also includes configuration register, for configuring affiliated micro-processor kernel
The connected mode of shared register file, to improve the motility of this structure, simultaneously in the instruction of each micro-processor kernel use
Collection aspect increases configuration-direct, to support configure to implement.
Preferably, the data exchange path between 4 micro-processor kernels is as shown in figure 4, each micro-processor kernel passes through
The following two kinds mode carries out data exchange:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.Specifically
, shared register file establishes direct data path each other for four kernels, may be used by configuration register simultaneously again
Flexibly define the connected mode of each path, reach the purpose of the internuclear exchange realizing low volume data.
Preferably, described multilamellar bus is that the intersection arranging between micro-processor kernel and the data storage of outside is opened
Close, 4 micro-processor kernels by the data storage outside different bus selection, if the data storage of selected outside
Device is all different, then 4 micro-processor kernels synchronize transmission;If the data storage of selected outside is identical, basis
Default sequence rule selects which micro-processor kernel to carry out prioritised transmission to.Specifically, described multilamellar bus, for leading
Equipment and between equipment arrange cross bar switch, multiple main equipments pass through different buses go selection from equipment, if selected
All different from equipment, then the transmission that multiple main equipments can be synchronous;If selected is identical from equipment, specify according in design
Sequence rule go select to which main equipment prioritised transmission.
Detailed, as shown in figure 1, the structure of multilamellar bus may include input end module 11, decoder module 12, moderator
Module 13 and oriented module 14 are it is achieved that 4 main equipments(Micro-processor kernel)To 4 from equipment(Outside data storage)
While access, wherein,
Input end module 11 keeps in Read-write Catrol and the data signal from micro-processor kernel, intercepts the height of address signal
Two, as the selection signal of outside data storage;Intercept address signal low 12, after being moved to right 2 output with
The new address signal that the address input end mouth of outside data storage is consistent, the data of output write simultaneously enables letter with writing
Number;
Decoder module 12 receives the selection signal of the data storage of outside from input, judges in microprocessor
Which outside data storage what the read-write operation of core selected is, selects output to put 1 to corresponding;In addition, outside receiving 4
The reading data of the data storage in portion, the choosing of the data storage according to outside the data storage selection of the outside decoding
Select signal behavior and correctly read data and be sent to micro-processor kernel;
Arbitrator module 12, for the arbitration to bus authority, when multiple primary modules(Micro-processor kernel)Ask simultaneously
When the shared bus of occupancy enters row data communication, arbitration algorithm is allocated to bus resource, determines the right to use of bus resource, often
The arbitration algorithm seen has poll, fixed priority, time division multiplex method, algorithm of making wild with joy, random contention arbitration algorithm etc..The design
In, the polling mode that in order to improve arbitration efficiency, selection algorithm is relatively simple, cost is relatively small, as the order of arbitration, is somebody's turn to do
Module receives the selection signal that the data storage of data, control signal and outside is read and write on 0,1,2 three tunnels, right according to poll rule
Its arbitration, one group of read-write control signal of final choice exports to the data storage of corresponding outside, and to other filing of the award
Signal source return corresponding keep signal, inform its this secondary bus application failure.
Oriented module 14, according to straight-through selection signal, controls and data signal and choosing arbitration control data signal from straight-through
Select one group to export to outside data storage.If straight-through selection signal is 1, it is output as through connect signal group;It is otherwise arbitration letter
Number group.
Preferably, the instruction set that each micro-processor kernel uses includes arithmetic operation instruction, logic instruction, branch
Instruction, access instruction.
Other detailed contents of embodiment one specifically can be found in the appropriate section of embodiment one, will not be described here.
The present embodiment make use of the concurrency of algorithm significantly, improves the execution efficiency of algorithm, is built using four nuclear structures
Four core processors, each micro-processor kernel is prototype all using reduced instruction set computer architecture microprocessor, and this is made corresponding
Improvement, including being introduced into of shared depositor, add configuration register and configuration-direct, add in ALU and move to left
Calculation function, the position of modification branch instruction, by shared depositor and in micro-processor kernel and outside data storage
Two kinds of data exchange ways building multilamellar bus between device set up each interior internuclear data path of four core processors, improve four cores
Performance during processor parallel data processing, improves data exchange efficiency.
Embodiment two
As shown in figure 5, the present invention also provides another kind of method for interchanging data, using four core processors described in embodiment one
System, methods described includes:
Step S1, the configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific,
It is that the configuration register of each micro-processor kernel carries out configuration information setting according to configuration-direct;Specifically, here according to spy
The parallel codes initial configuration depositor of fixed application.Pass through configuration register within respectively to four kernels for the configuration-direct
Middle write configuration information;
Step S2, the data exchange between outside data storage and micro-processor kernel, first data exchange is outer
The data storage in portion writes data in the register file of micro-processor kernel, subsequently have outside data storage with micro-
Data between processor cores exchange process repeatedly;Specifically, this process of first computing is deposited to kernel for data storage
Write data in device, calculating process might have the data exchange process repeatedly between memorizer and kernel;
Step S3, realizes the exchange of internuclear data by the shared register file of each micro-processor kernel.Specifically, kernel
Computing and internuclear data exchange aspect, between each micro-processor kernel, shared depositor is similarly data exchange and provides path,
In calculating process, need, by utilizing this data path as far as possible to the analysis of algorithm, to improve operation efficiency.According to different
Applying step S2 and step S3 are possible to back and forth carry out.
In sum, the present invention make use of the concurrency of algorithm significantly, improves the execution efficiency of algorithm, is tied using four cores
Structure builds four core processors, and each micro-processor kernel is prototype all using reduced instruction set computer architecture microprocessor, and this is done
Go out to be correspondingly improved, being introduced into, add configuration register and configuration-direct, add in ALU including shared depositor
Enter shift left operation function, the position of modification branch instruction, by shared depositor and in micro-processor kernel and outside number
Set up each interior internuclear data path of four core processors according to two kinds of data exchange ways building multilamellar bus between memorizer, change
Performance during kind four core processor parallel data processing, improves data exchange efficiency.
In this specification, each embodiment is described by the way of going forward one by one, and what each embodiment stressed is and other
The difference of embodiment, between each embodiment identical similar portion mutually referring to.For system disclosed in embodiment
For, due to corresponding to the method disclosed in Example, so description is fairly simple, referring to method part illustration in place of correlation
?.
Professional further appreciates that, in conjunction with the unit of each example of the embodiments described herein description
And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and
The interchangeability of software, generally describes composition and the step of each example in the above description according to function.These
Function to be executed with hardware or software mode actually, the application-specific depending on technical scheme and design constraint.Specialty
Technical staff can use different methods to each specific application realize described function, but this realization should
Think beyond the scope of this invention.
Obviously, those skilled in the art can carry out the various changes and modification spirit without deviating from the present invention to invention
And scope.So, if these modifications of the present invention and modification belong to the claims in the present invention and its equivalent technologies scope it
Interior, then the present invention is also intended to including these changes and modification.
Claims (8)
1. a kind of four core processor systems built using four nuclear structures are it is characterised in that adopt the many data modes of single block
Processing data, i.e. synchronization all micro-processor kernels strict implement same program section, concurrently process multidimensional data, described
System includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates posting within central processing unit
Storage heap and outside data storage;
And described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next cycle
Pointer value;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and register file
All control signals needed for module;
ALU, for computing, receives the data of data storage in register file, core, to register file, core
Interior data storage sends write enable signal and data to be written;
According to the output receiving, comparator, for receiving the output from register file, and judges whether jump instruction occurs,
If redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, simultaneously can be number
According to being sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from micro-processor kernel, to fetching mould
Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
2. the four core processor systems built using four nuclear structures as claimed in claim 1 are it is characterised in that each depositor
Heap includes local register file and shared register file, wherein,
Local register file, for the closing computing of data in core, in calculating process there is any interaction, originally in not outer with core data
The micro-processor kernel on ground has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each micro- place
Internuclear data interaction in reason device, local micro-processor kernel has read right to its shared register file, write permission according to
Application needs to be respectively allocated to local micro-processor kernel or other micro-processor kernel.
3. the four core processor systems built using four nuclear structures as claimed in claim 2 are it is characterised in that each is locally posted
Storage heap is divided into two groups, and every group has a read port and a write port, and wherein, two groups of register files receive different reading addresses
Signal, provides corresponding readout;Two groups of register files receive same write address data input signal, to ensure that two groups are posted
Storage heap content is consistent.
4. the four core processor systems built using four nuclear structures as claimed in claim 3 are it is characterised in that each microprocessor
Device kernel carries out data exchange by the following two kinds mode:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.
5. the four core processor systems built using four nuclear structures as claimed in claim 4 are it is characterised in that described multilamellar is total
Line is the cross bar switch of setting between micro-processor kernel and the data storage of outside, and 4 micro-processor kernels pass through not
The same data storage outside bus selection, if the data storage of selected outside is all different, in 4 microprocessors
Core synchronizes transmission;Which if the data storage of selected outside is identical, selected to according to default sequence rule
Micro-processor kernel carries out prioritised transmission.
6. the four core processor systems built using four nuclear structures as claimed in claim 5 are it is characterised in that each microprocessor
The instruction set that device kernel uses includes arithmetic operation instruction, logic instruction, branch instruction, access instruction.
7. the four core processor systems built using four nuclear structures as claimed in claim 6 are it is characterised in that each microprocessor
Device kernel also includes configuration register, for configuring the connected mode of the shared register file of affiliated micro-processor kernel, with
Improve the motility of four nuclear structures, increase configuration-direct in terms of the instruction set that each micro-processor kernel uses, to prop up simultaneously
Hold implementing of configuration.
8. a kind of method for interchanging data is it is characterised in that adopt four core processor systems as claimed in claim 7, described side
Method includes:
The configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific, i.e. each microprocessor
The configuration register of device kernel carries out configuration information setting according to configuration-direct;
Data exchange between outside data storage and micro-processor kernel, first data exchange is outside data storage
Device writes data in the register file of micro-processor kernel, subsequently have outside data storage and micro-processor kernel it
Between data exchange process repeatedly;
The exchange of internuclear data is realized by the shared register file of each micro-processor kernel.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410014522.7A CN103744644B (en) | 2014-01-13 | 2014-01-13 | The four core processor systems built using four nuclear structures and method for interchanging data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410014522.7A CN103744644B (en) | 2014-01-13 | 2014-01-13 | The four core processor systems built using four nuclear structures and method for interchanging data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103744644A CN103744644A (en) | 2014-04-23 |
CN103744644B true CN103744644B (en) | 2017-03-01 |
Family
ID=50501664
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410014522.7A Active CN103744644B (en) | 2014-01-13 | 2014-01-13 | The four core processor systems built using four nuclear structures and method for interchanging data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103744644B (en) |
Families Citing this family (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10409606B2 (en) * | 2015-06-26 | 2019-09-10 | Microsoft Technology Licensing, Llc | Verifying branch targets |
US11755484B2 (en) | 2015-06-26 | 2023-09-12 | Microsoft Technology Licensing, Llc | Instruction block allocation |
US10346168B2 (en) | 2015-06-26 | 2019-07-09 | Microsoft Technology Licensing, Llc | Decoupled processor instruction window and operand buffer |
US10565670B2 (en) * | 2016-09-30 | 2020-02-18 | Intel Corporation | Graphics processor register renaming mechanism |
CN108694441B (en) * | 2017-04-07 | 2022-08-09 | 上海寒武纪信息科技有限公司 | Network processor and network operation method |
CN108536642A (en) * | 2018-06-13 | 2018-09-14 | 北京比特大陆科技有限公司 | Big data operation acceleration system and chip |
CN112740192B (en) * | 2018-10-30 | 2024-04-30 | 北京比特大陆科技有限公司 | Big data operation acceleration system and data transmission method |
WO2021134521A1 (en) * | 2019-12-31 | 2021-07-08 | 北京希姆计算科技有限公司 | Storage management apparatus and chip |
CN113759246B (en) * | 2020-05-22 | 2024-01-30 | 北京机械设备研究所 | Dual-core processor-based motor drive test method and motor driver |
CN112834819B (en) * | 2021-01-04 | 2024-04-02 | 杭州万高科技股份有限公司 | Digital signal processing device and method for electric energy metering chip |
CN112834820B (en) * | 2021-04-09 | 2024-01-23 | 杭州万高科技股份有限公司 | Electric energy meter and metering device thereof |
CN114398299B (en) * | 2021-12-24 | 2024-05-10 | 北京四方继保工程技术有限公司 | Data processing method of four-core cooperative measurement and control processor and processor |
CN117132450B (en) * | 2023-10-24 | 2024-02-20 | 芯动微电子科技(武汉)有限公司 | Computing device capable of realizing data sharing and graphic processor |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876892A (en) * | 2010-05-20 | 2010-11-03 | 复旦大学 | Communication and multimedia application-oriented single instruction multidata processor circuit structure |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7810093B2 (en) * | 2003-11-14 | 2010-10-05 | Lawrence Livermore National Security, Llc | Parallel-aware, dedicated job co-scheduling within/across symmetric multiprocessing nodes |
-
2014
- 2014-01-13 CN CN201410014522.7A patent/CN103744644B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876892A (en) * | 2010-05-20 | 2010-11-03 | 复旦大学 | Communication and multimedia application-oriented single instruction multidata processor circuit structure |
Non-Patent Citations (1)
Title |
---|
一种基于可配置共享寄存器堆的多核处理器核间数据交换结构设计;方颖等;《微电子学与计算机》;20110430;第28卷(第4期);第65-72页 * |
Also Published As
Publication number | Publication date |
---|---|
CN103744644A (en) | 2014-04-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103744644B (en) | The four core processor systems built using four nuclear structures and method for interchanging data | |
CN105718390B (en) | Low-power in shared memory link enters | |
Pellauer et al. | Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration | |
CN109582611A (en) | Accelerator structure | |
CN104699631A (en) | Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor) | |
CN107346351A (en) | For designing FPGA method and system based on the hardware requirement defined in source code | |
CN103246625B (en) | A kind of method of data and address sharing pin self-adaptative adjustment memory access granularity | |
Li et al. | A performance & power comparison of modern high-speed dram architectures | |
KR101830685B1 (en) | On-chip mesh interconnect | |
Daneshtalab et al. | Memory-efficient on-chip network with adaptive interfaces | |
CN109582605A (en) | Pass through the consistency memory devices of PCIe | |
KR20100017897A (en) | Shared storage for multi-threaded ordered queues in an interconnect | |
TWI465908B (en) | Methods and apparatus for efficient communication between caches in hierarchical caching design | |
Daneshtalab et al. | A low-latency and memory-efficient on-chip network | |
US9424193B2 (en) | Flexible arbitration scheme for multi endpoint atomic accesses in multicore systems | |
CN105988970B (en) | The processor and chip of shared storing data | |
CN108804348A (en) | Calculating in parallel processing environment | |
US9372796B2 (en) | Optimum cache access scheme for multi endpoint atomic access in a multicore system | |
CN104598404B (en) | Computing device extended method and device and expansible computing system | |
Contini et al. | Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication | |
Marino et al. | Insights on memory controller scaling in multi-core embedded systems | |
Noami et al. | High speed data transactions for memory controller based on AXI4 interface protocol SoC | |
CN105893036A (en) | Compatible accelerator extension method for embedded system | |
Wang et al. | PMCNOC: A pipelining multi-channel central caching network-on-chip communication architecture design | |
Wang et al. | Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |