CN103744644B - The four core processor systems built using four nuclear structures and method for interchanging data - Google Patents

The four core processor systems built using four nuclear structures and method for interchanging data Download PDF

Info

Publication number
CN103744644B
CN103744644B CN201410014522.7A CN201410014522A CN103744644B CN 103744644 B CN103744644 B CN 103744644B CN 201410014522 A CN201410014522 A CN 201410014522A CN 103744644 B CN103744644 B CN 103744644B
Authority
CN
China
Prior art keywords
data
micro
processor
core
kernel
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410014522.7A
Other languages
Chinese (zh)
Other versions
CN103744644A (en
Inventor
谢憬
王琴
郭筝
王超
毛志刚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CN201410014522.7A priority Critical patent/CN103744644B/en
Publication of CN103744644A publication Critical patent/CN103744644A/en
Application granted granted Critical
Publication of CN103744644B publication Critical patent/CN103744644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Multi Processors (AREA)

Abstract

The present invention provides a kind of four core processor systems built using four nuclear structures and method for interchanging data, and described system includes:Using single block many data modes processing data, system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, and each micro-processor kernel includes:Command memory, for store instruction;Data storage in core, for data storage;Central processing unit, for the corresponding operation of instruction and data execution according to input, updates the register file within central processing unit and outside data storage.The present invention utilizes the concurrency of algorithm, improve the execution efficiency of algorithm, additionally by shared depositor and two kinds of data exchange ways building multilamellar bus between micro-processor kernel and the data storage of outside set up each interior internuclear data path of four core processors, improve performance during four core processor parallel data processings, improve data exchange efficiency.

Description

The four core processor systems built using four nuclear structures and method for interchanging data
Technical field
The present invention relates to a kind of four core processor systems built using four nuclear structures and method for interchanging data.
Background technology
Four core processors are also referred to as on-chip multi-processor, or chip multiprocessors.This design philosophy is in 1996 by U.S. Stanford University of state proposes first, by being internally integrated, in one chip, the performance that multiple kernels improve processor.At four cores Each process kernel structure of reason device is fairly simple, and using the predominance of kernel, execution simultaneously is several times as much as single core processor Thread or task, greatly improve the parallel performance of processor.Meanwhile, by using shared resource on piece, effectively carry High traffic rate and reduce power consumption etc..These features all make four core processors have great advantage.
Four nuclear technology represent the once innovation of technical development of computer.After the development of more than ten years, four core processors Range of application has covered the crowds such as multimedia calculating, embedded device, personal computer, commercial server and high-performance computer Multi-field, become the main flow of processor development.Compared with single core processor, four core processors mainly have following significant Advantage:
1st, control logic is simple:For relative excess standard quota microprocessor architecture and very long instruction word structure, four core processors The control logic complexity of structure is substantially much lower.The hardware of corresponding four core processors is realized must be simply too much.
2nd, high primary frequency:Because the control logic of four core processor structures is relatively easy, comprise few overall signal, therefore Wire delay affects smaller on it, and therefore, under equal process conditions, the hardware of four core processors is realized obtaining comparing superscale Microprocessor and the higher operating frequency of very long instruction word microprocessor.
3rd, low-power consumption:By dynamic regulation voltage/frequency, optimize load distribution etc., can effectively reduce by four core processors Power consumption.
4th, design and proving period are short:Microprocessor manufacturer typically adopts existing maturation single core processor as processor Kernel, thus design and proving period can be shortened, saves R&D costs.
Four core processor structures not only have that performance potential is big, integrated level is high, degree of parallelism is high, structure is simple and design verification side Just wait many advantages, and it can also some achievements in the research of inheriting tradition uniprocessor, such as simultaneous multi-threading, wide transmitting Instruction, blood pressure lowering Low-power Technology etc..But four core processors are a kind of new structure after all, in four nuclear structure designs and application and development In occur in that before the new problem that do not run into, these problems propose challenge to the future of four core processors.
At present in the evolution of four nuclear technology, following problem values obtain us and consider emphatically.
1st, the selection of core type
The inner core of current four core processors mainly has isomorphism and two kinds of isomery.
Homogeneous structure adopts symmetric design, and principle is simple, hardware is easier to realize.The double-core of current main-stream and four cores are processed Device is substantially all and adopts homogeneous structure.But, lift the performance of processor by increasing central processor core, exist certain The limit.After reaching the limit values, performance just cannot improve with the increase of number of cores again.Here it is famous A Mu DahI's law:Allow to be continuously increased the central processor core of same type to strengthen parallel processing capability, but whole system Process performance still can be subject in software must the restricting of the part that executes of serial.The problem of isomorphism design is:With How being on the increase of number of cores, keep the data of each kernel consistent;How to meet kernel storage access and input/defeated Go out requirements for access;How to select the processor that a various aspects of performance equalizes, area is less and power consumption is relatively low;If how to balance The load of dry-cure device and task coordinate etc..
Isomery refers within a processor using different types of kernel, such as central processor core, programmable core etc.. Compared with homogeneous structure, the advantage of isomery is by organizing the core of different characteristics come optimized processor internal structure, at realization The optimization of reason device performance, and power consumption can be effectively reduced.Such as, floating-point fortune central processor core being bad at Calculate and signal processing work, by the other programmable core execution being integrated on same chip block.But heterogeneous structure there is also Some difficult points.First, arranging in pairs or groups, which plants different kernels, and how interior internuclear task is divided the work and how to be realized.Secondly, structure Whether there is good autgmentability, also suffer from the restriction of number of cores.Furthermore, processor instruction system design and realization are also Problem.Because the instruction system used by different IPs is also critically important to the realization of system, then using these different cores, be Using identical instruction system or different instruction systems, can run operating system etc. be also the content needing to consider.
2nd, on piece storage organization design
Gaps between their growth rates between processor and main storage are always the problem must take in processor structure design, this It is exactly famous " storage wall " problem.Because the architecture Design of storage system itself is directly connected to systematic entirety energy, The each side such as the size of whole chip, power consumption, layout, performance and operational efficiency can be had a huge impact.In the past in list Pass through in processor substantially can preferably solve this problem using buffer structure, can guarantee that processor performance is played. But, developed into for four core processor epoch, the problem brought because of gaps between their growth rates between kernel and main memory becomes seriously.By Increase in the internal number of cores of processor, the requirements for access hosting is increased, and the caching level in uniprocessor epoch and access The requirements for access that bandwidth can not keep up with four core processors sets it is necessary to carry out corresponding storage organization for four core processors Meter, and resolve the efficiency of storage system.
Currently to design of memory systems, most processors adopt caching design, and also some processors employ on piece Memory construction.The advantage of buffer structure design is hardware designs and realizes easily it is easy to application and development and programming, shortcoming is to need Ensure data cached consistent, and structural extended is difficult.For caching data consistency problem, its resolution policy mainly has Bus snooping agreement and the directory protocol based on catalogue.Snoopy protocol is that every piece of caching is intercepted always by caching the detectaphone moment Line, to accept concordance order, unfortunately it is only suitable for the less situation of number of cores.Directory protocol is to be marked by catalogue Record state in other cachings for itself memory block, to maintain during concordance using point-to-point communication, shortcoming is to realize generation Valency is too big, there is performance bottleneck when concurrently accessing catalogue.Except above-mentioned hardware coherence algorithm, also it is based on polyprocessor Software conformance algorithm, but can be used as the cache coherence mechanisms of four nuclear structures, these need further discussion to study.Mesh Front most of four core processors adopt the snoopy protocol of bus.On-chip memory is to have guided in piece by the memorizer outside piece, it It is unified addressing as chip external memory, therefore it avoids caching and is not hit by and consistency problem, but it is due to employing Memory construction, its access delay relatively caches greatly.Some research worker current by using high-speed dynamic random access memory come group Become on-chip memory, the performance gap between reducing and caching.In addition to selecting which kind of storage organization, the asking of node store structure design Topic also has:Memorizer is much proper;Which rank of realizes the shared of data at and communication is proper;Which rank of solves slow at Deposit consistency problem more reasonable;How storage organization supports application of multithreading etc..
3rd, chip-on communication
Although the multiple kernels on four core pieces each execute the code of oneself, in difference, internuclear possible needs are carried out The shared and synchronization of data, therefore the performance of on chip communication architecture will directly affect the performance of processor.Current chip-on communication master There are 3 kinds of modes:Bus is shared, cross bar switch interconnects and network-on-chip.
Bus shared structure refers to kernel on piece, input/output port and memorizer by shared two grades or three-level high speed Caching, or communicated by the bus connecting kernel.Bus-structured strong point is relatively simple it is easy to design is realized, when Front majority double-core and four core processors are substantially all and employ this structure.Bus structures are the communication venations of existing chip architecture, With the expansion of circuit scale, bus structures will become the bottleneck of chip design:Although bus can connect multiple logical effectively Letter side, but bus address resource can not with the increase of computing unit infinite expanding;Although bus can by multiple users share, But one bus cannot support that more than one pair of user communicates, and that is, serial access mechanism result in the bottleneck of communication simultaneously.Additionally, Chip-on communication is the main source of power consumption, and huge clock network will occupy the big absolutely portion of chip total power consumption with the power consumption of bus Point.So bus network is applied to the less situation of interior check figure.There are Xin Dela, English than more typical bus shared structure processor The Duo of Te Er, strength 4/5 of IBM limited company etc..
Cross bar switch interconnection structure is made up of cross bar switch and interface logic.Cross bar switch is compared with bus structures, excellent Gesture is that data channel is many, and access bandwidth is bigger, but deficiency is that the chip area that cross bar structure takies is also larger, and with The increase of interior check figure, performance also can decline, and therefore it is also only applicable to the less situation of interior check figure.Such as advanced micro devices company Anlong Harold Ickes 2 dual core processor control kernel and outside communication with cross bar switch.
Network-on-chip includes calculating and communicating two subsystems.Computing subsystem completes " calculating " task of broad sense, they Both can be the IP core of central processing unit, network-on-chip or various special function in existing meaning or deposited Memory array, reconfigurable hardware etc..Communication subsystem is responsible for connecting microprocessor core, realizes the high-speed communication between computing resource. The network that communication node and interconnection line therebetween are constituted is referred to as chip-on communication network, and it has used for reference distributed computing system Communication mode, substitutes traditional on-chip bus with route and packet-switch technology and completes communication task.Network-on-chip with parallel A lot of identical points have been compared in the interconnection of computer:Support packet communication, communication service that expansible, offer is transparent etc.;But also have not Same part:Network-on-chip technical support accesses simultaneously, and has reliability high and the features such as reusability is high.It is with total knot Structure, cross bar structure are compared, network-on-chip can connect that more intellectual property nuclear components, reliability be high, extensibility by force with And relatively low power consumption, therefore network-on-chip is considered as four core processor interconnection techniques on more extensive piece.Currently Network-on-chip mainly has two-dimensional grid network, the interconnection structure such as touring.Network-on-chip design problem be find network overhead and The optimal balance of four core degree of coupling, and consider the extensibility of network simultaneously.Network processor just employs two-dimensional network on piece Structure, it passes through integrated high-speed network and the routing algorithm optimizing, and on piece, interior internuclear communication delay maximum is not over 6 week Phase, and this structure extensibility is strong.
Though this 3 kinds of structures are each advantageous and not enough, also can merge, such as in global scope using network-on-chip in office Portion selects bus or cross bar structure, to realize the balance of performance and complexity.
4th, low-power consumption
One bottleneck of conventional single-processor is just as the lifting of frequency, power consumption more and more higher, finally makes chip no Method is normally run.In the four core processor designs of early stage, mainly reduce the power consumption of processor by reducing core frequency, but It is the operational performance which limit kernel, fundamentally do not realize the purpose of high-performance, low-power consumption.Power consumption is too high not only Lead to energy resource consumption, and hot stack and too high power dissipation density also can impact to system stability.A present chip On can be with integrated close to 1,000,000,000 transistors, so numerous Resources on Chip, how to control its power consumption, keep superior performance, Become an important problem.
Before four core processors produce, Low-power Technology mainly has reduction dynamic power consumption and reduces quiescent dissipation technology two Aspect.Dynamic consume the electric energy including being consumed during each element normal work inside processor, for example capacitive discharge and recharge, cut Change State Transferring of frequency, gate etc..Reduce dynamic consumption is all the emphasis of people's research all the time, and Technical comparing Ripe.Reducing dynamic consumption technology primarily now has multi thresholds Technology, dynamic voltage regulation, clock-disabling technology etc..Quiet State consumption refers to the power consumption from leakage current, is in idle condition and also can consume electric energy even if feature is element, concrete wraps Include sub-threshold current leakage and door leakage current.The major technique reducing static consumption technology has passage length adjustment, registers latch Technology, energy gate technology etc..
Technology of both above mainly carries out low power dissipation design and technological development up in circuit level.Process in four cores Before device occurs, these technology just have occurred in single core processor.With the generation of four core processors, due to four core processors Have the characteristics that new in structure with realizing, so research worker is found that the method reducing power consumption at new aspect, such as again Heterogeneous structure design, dynamic thread assignment and transfer techniques etc..The structure design of isomery is exactly to be provided on piece using heterogeneous structure The optimization configuration in source, the execution efficiency of processor is lifted so that processor not only has high-performance also reduces power consumption.Dynamically Thread dispatch and transfer techniques are using many kernel processes ability, the multi load excessively on certain kernel is transferred to and loads in little On core, research worker to reduce the operation power consumption of four core processors also by the design of operating system and optimization.For example when appoint When being engaged in less, operating system can be closed a kernel or be reduced processor frequencies, and reduces block rotating speed, so that whole system is dropped Low consumption.Therefore, low power dissipation design contains the content of the many aspects such as circuit-level, structural level, algorithm level and operating system grade, It is a problem needing to be considered from many aspects.
Content of the invention
It is an object of the invention to provide a kind of four core processor systems built using four nuclear structures and data exchange side Method, can significantly utilize the concurrency of algorithm, improve the execution efficiency of algorithm.
For solving the above problems, the present invention provides a kind of four core processor systems built using four nuclear structures, including:
Using single block many data modes processing data, that is, synchronization all micro-processor kernels strict implement is same Program segment, concurrently processes multidimensional data, and described system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates inside central processing unit Register file and outside data storage.
Further, in said system, described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next week The pointer value of phase;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and depositing All control signals needed for device heap module;
ALU, for computing, receives the data of data storage in register file, core, to depositor In heap, core, data storage sends write enable signal and data to be written;
Whether comparator, for receiving the output from register file, and judge jump instruction according to the output receiving Occur, if redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, may be used simultaneously Data is sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from performing module, to fetching mould Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
Further, in said system, each register file includes local register file and shared register file, its In,
Local register file, for the closing computing of data in core, in calculating process, with core not there is any friendship in outer data Mutually, local micro-processor kernel has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each Data interaction between micro-processor kernel, local micro-processor kernel has read right, write permission to its shared register file Need to be respectively allocated to local micro-processor kernel or other micro-processor kernel according to application.
Further, in said system, each local register file is divided into two groups, and every group has a read port and one Write port, wherein, two groups of register files receive different reading address signals, provide corresponding readout;Two groups of register files connect Receive same write address data input signal, consistent to ensure two groups of register file content
Further, in said system, each micro-processor kernel carries out data exchange by the following two kinds mode:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.
Further, in said system, described multilamellar bus is in micro-processor kernel and outside data storage Between setting cross bar switch, 4 micro-processor kernels by the data storage outside different bus selection, if selected Outside data storage all different, then 4 micro-processor kernels synchronize transmission;If the data of selected outside is deposited Reservoir is identical, then select which micro-processor kernel to carry out prioritised transmission to according to default sequence rule.
Further, in said system, the instruction set that each micro-processor kernel uses includes arithmetic operation instruction, patrols Collect operational order, branch instruction, access instruction.
Further, in said system, each micro-processor kernel also includes configuration register, for belonging to configuring The connected mode of the shared register file of micro-processor kernel, to improve the motility of this structure, simultaneously in each microprocessor The instruction set aspect that kernel uses increases configuration-direct, to support configure to implement.
According to the another side of the present invention, provide a kind of method for interchanging data, using four above-mentioned core processor systems, described Method includes:
The configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific, and that is, each is micro- The configuration register of processor cores carries out configuration information setting according to configuration-direct;
Data exchange between outside data storage and micro-processor kernel, first data exchange is outside data Memorizer writes data in the register file of micro-processor kernel, subsequently has in data storage and the microprocessor of outside Data between core exchange process repeatedly;
The exchange of internuclear data is realized by the shared register file of each micro-processor kernel.
Compared with prior art, the present invention adopts single block many data modes processing data, and that is, synchronization is all micro- Processor cores strict implement same program section, concurrently processes multidimensional data, and described system includes 4 reduced instruction set computer frameworks Micro-processor kernel, wherein, each micro-processor kernel includes:Command memory, for store instruction;Data storage in core Device, for data storage;Central processing unit, for the corresponding operation of instruction and data execution according to input, updates centre Register file within reason device and outside data storage, significantly make use of the concurrency of algorithm, improve the execution of algorithm Efficiency.
In addition, the present invention builds by shared depositor and between micro-processor kernel and the data storage of outside Two kinds of data exchange ways of multilamellar bus set up each interior internuclear data path of four core processors, improve four core processors parallel Performance during processing data, improves data exchange efficiency.
Brief description
Fig. 1 is the multilamellar bus cross bar switch concrete structure diagram of one embodiment of the invention;
Fig. 2 is the schematic diagram of the register file of one embodiment of the invention;
Fig. 3 is the structure chart of the shared register file of one embodiment of the invention;
Fig. 4 be one embodiment of the invention micro-processor kernel between data exchange schematic diagram;
Fig. 5 is the flow chart of the method for interchanging data of one embodiment of the invention.
Specific embodiment
Understandable for enabling the above objects, features and advantages of the present invention to become apparent from, below in conjunction with the accompanying drawings and specifically real The present invention is further detailed explanation to apply mode.
Embodiment one
As shown in figure 1, the present invention provides a kind of four core processor systems built using four nuclear structures, using single block Many data modes processing data, i.e. synchronization all micro-processor kernels strict implement same program section, concurrently process many Dimension data, described system includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates inside central processing unit Register file and outside data storage.
Preferably, described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next week The pointer value of phase;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and depositing All control signals needed for device heap module;
ALU, for computing, receives the data of data storage in register file, core, to depositor In heap, core, data storage sends write enable signal and data to be written;
Whether comparator, for receiving the output from register file, and judge jump instruction according to the output receiving Occur, if redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, may be used simultaneously Data is sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from performing module, to fetching mould Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
Preferably, each register file includes local register file and shared register file, wherein,
Local register file, for the closing computing of data in core, in calculating process, with core not there is any friendship in outer data Mutually, local micro-processor kernel has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each Data interaction between micro-processor kernel, local micro-processor kernel has read right, write permission to its shared register file Need to be respectively allocated to local micro-processor kernel or other micro-processor kernel according to application.Specifically, this enforcement Modification that example is made to register file is as shown in Fig. 2 the internal structure of shared register file is as shown in figure 3, here former Ou Pu The register file of Fa Er processor is divided into two parts:Local register file and shared register file.
Preferably, each local register file is divided into two groups, and every group has a read port and a write port, wherein, two Group register file receives different reading address signals, provides corresponding readout;Two groups of register files receive same write address Data input signal is consistent to ensure two groups of register file content.Specifically, local register file includes 16 depositors, It is depositor 0~depositor 15 that corresponding registers are numbered, and each depositor is 32.Local register file is used for data in core Closing computing, does not interact with outside core, data generation is not any in calculating process, locally the local register file of interior verification has completely Access limit.Local register file is divided into two groups, and every group has a read port and a write port.Two groups of register files receive not Same reading address signal, provides corresponding readout;Receive same write address data input signal, to ensure that two groups are deposited Device heap content is consistent.Shared register file includes 4 depositors, and each depositor is 32, and it is to deposit that corresponding registers are numbered Device 16~depositor 19(Alternatively referred to as shared depositor 0~shared depositor 3).Shared register file and core other kernels outer There is particular interconnected mode between shared depositor, be used for realizing internuclear data interaction.Locally register file tool is shared in interior verification There is read right, write permission is respectively allocated to local or other kernels according to application needs.Shared register file has two readings Port and four write ports, at most can accept the write signal from four different kernels.
Preferably, each micro-processor kernel also includes configuration register, for configuring affiliated micro-processor kernel The connected mode of shared register file, to improve the motility of this structure, simultaneously in the instruction of each micro-processor kernel use Collection aspect increases configuration-direct, to support configure to implement.
Preferably, the data exchange path between 4 micro-processor kernels is as shown in figure 4, each micro-processor kernel passes through The following two kinds mode carries out data exchange:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.Specifically , shared register file establishes direct data path each other for four kernels, may be used by configuration register simultaneously again Flexibly define the connected mode of each path, reach the purpose of the internuclear exchange realizing low volume data.
Preferably, described multilamellar bus is that the intersection arranging between micro-processor kernel and the data storage of outside is opened Close, 4 micro-processor kernels by the data storage outside different bus selection, if the data storage of selected outside Device is all different, then 4 micro-processor kernels synchronize transmission;If the data storage of selected outside is identical, basis Default sequence rule selects which micro-processor kernel to carry out prioritised transmission to.Specifically, described multilamellar bus, for leading Equipment and between equipment arrange cross bar switch, multiple main equipments pass through different buses go selection from equipment, if selected All different from equipment, then the transmission that multiple main equipments can be synchronous;If selected is identical from equipment, specify according in design Sequence rule go select to which main equipment prioritised transmission.
Detailed, as shown in figure 1, the structure of multilamellar bus may include input end module 11, decoder module 12, moderator Module 13 and oriented module 14 are it is achieved that 4 main equipments(Micro-processor kernel)To 4 from equipment(Outside data storage) While access, wherein,
Input end module 11 keeps in Read-write Catrol and the data signal from micro-processor kernel, intercepts the height of address signal Two, as the selection signal of outside data storage;Intercept address signal low 12, after being moved to right 2 output with The new address signal that the address input end mouth of outside data storage is consistent, the data of output write simultaneously enables letter with writing Number;
Decoder module 12 receives the selection signal of the data storage of outside from input, judges in microprocessor Which outside data storage what the read-write operation of core selected is, selects output to put 1 to corresponding;In addition, outside receiving 4 The reading data of the data storage in portion, the choosing of the data storage according to outside the data storage selection of the outside decoding Select signal behavior and correctly read data and be sent to micro-processor kernel;
Arbitrator module 12, for the arbitration to bus authority, when multiple primary modules(Micro-processor kernel)Ask simultaneously When the shared bus of occupancy enters row data communication, arbitration algorithm is allocated to bus resource, determines the right to use of bus resource, often The arbitration algorithm seen has poll, fixed priority, time division multiplex method, algorithm of making wild with joy, random contention arbitration algorithm etc..The design In, the polling mode that in order to improve arbitration efficiency, selection algorithm is relatively simple, cost is relatively small, as the order of arbitration, is somebody's turn to do Module receives the selection signal that the data storage of data, control signal and outside is read and write on 0,1,2 three tunnels, right according to poll rule Its arbitration, one group of read-write control signal of final choice exports to the data storage of corresponding outside, and to other filing of the award Signal source return corresponding keep signal, inform its this secondary bus application failure.
Oriented module 14, according to straight-through selection signal, controls and data signal and choosing arbitration control data signal from straight-through Select one group to export to outside data storage.If straight-through selection signal is 1, it is output as through connect signal group;It is otherwise arbitration letter Number group.
Preferably, the instruction set that each micro-processor kernel uses includes arithmetic operation instruction, logic instruction, branch Instruction, access instruction.
Other detailed contents of embodiment one specifically can be found in the appropriate section of embodiment one, will not be described here.
The present embodiment make use of the concurrency of algorithm significantly, improves the execution efficiency of algorithm, is built using four nuclear structures Four core processors, each micro-processor kernel is prototype all using reduced instruction set computer architecture microprocessor, and this is made corresponding Improvement, including being introduced into of shared depositor, add configuration register and configuration-direct, add in ALU and move to left Calculation function, the position of modification branch instruction, by shared depositor and in micro-processor kernel and outside data storage Two kinds of data exchange ways building multilamellar bus between device set up each interior internuclear data path of four core processors, improve four cores Performance during processor parallel data processing, improves data exchange efficiency.
Embodiment two
As shown in figure 5, the present invention also provides another kind of method for interchanging data, using four core processors described in embodiment one System, methods described includes:
Step S1, the configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific, It is that the configuration register of each micro-processor kernel carries out configuration information setting according to configuration-direct;Specifically, here according to spy The parallel codes initial configuration depositor of fixed application.Pass through configuration register within respectively to four kernels for the configuration-direct Middle write configuration information;
Step S2, the data exchange between outside data storage and micro-processor kernel, first data exchange is outer The data storage in portion writes data in the register file of micro-processor kernel, subsequently have outside data storage with micro- Data between processor cores exchange process repeatedly;Specifically, this process of first computing is deposited to kernel for data storage Write data in device, calculating process might have the data exchange process repeatedly between memorizer and kernel;
Step S3, realizes the exchange of internuclear data by the shared register file of each micro-processor kernel.Specifically, kernel Computing and internuclear data exchange aspect, between each micro-processor kernel, shared depositor is similarly data exchange and provides path, In calculating process, need, by utilizing this data path as far as possible to the analysis of algorithm, to improve operation efficiency.According to different Applying step S2 and step S3 are possible to back and forth carry out.
In sum, the present invention make use of the concurrency of algorithm significantly, improves the execution efficiency of algorithm, is tied using four cores Structure builds four core processors, and each micro-processor kernel is prototype all using reduced instruction set computer architecture microprocessor, and this is done Go out to be correspondingly improved, being introduced into, add configuration register and configuration-direct, add in ALU including shared depositor Enter shift left operation function, the position of modification branch instruction, by shared depositor and in micro-processor kernel and outside number Set up each interior internuclear data path of four core processors according to two kinds of data exchange ways building multilamellar bus between memorizer, change Performance during kind four core processor parallel data processing, improves data exchange efficiency.
In this specification, each embodiment is described by the way of going forward one by one, and what each embodiment stressed is and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.For system disclosed in embodiment For, due to corresponding to the method disclosed in Example, so description is fairly simple, referring to method part illustration in place of correlation ?.
Professional further appreciates that, in conjunction with the unit of each example of the embodiments described herein description And algorithm steps, can with electronic hardware, computer software or the two be implemented in combination in, in order to clearly demonstrate hardware and The interchangeability of software, generally describes composition and the step of each example in the above description according to function.These Function to be executed with hardware or software mode actually, the application-specific depending on technical scheme and design constraint.Specialty Technical staff can use different methods to each specific application realize described function, but this realization should Think beyond the scope of this invention.
Obviously, those skilled in the art can carry out the various changes and modification spirit without deviating from the present invention to invention And scope.So, if these modifications of the present invention and modification belong to the claims in the present invention and its equivalent technologies scope it Interior, then the present invention is also intended to including these changes and modification.

Claims (8)

1. a kind of four core processor systems built using four nuclear structures are it is characterised in that adopt the many data modes of single block Processing data, i.e. synchronization all micro-processor kernels strict implement same program section, concurrently process multidimensional data, described System includes the micro-processor kernel of 4 reduced instruction set computer frameworks, wherein,
Each micro-processor kernel includes:
Command memory, for store instruction;
Data storage in core, for data storage;
Central processing unit, for the corresponding operation of instruction and data execution according to input, updates posting within central processing unit Storage heap and outside data storage;
And described central processing unit includes:
Fetching module, for current period instruction fetch from command memory according to current pointer value, and calculates next cycle Pointer value;
Decoding module, for the instruction decoding from fetching module, producing ALU, comparator and register file All control signals needed for module;
ALU, for computing, receives the data of data storage in register file, core, to register file, core Interior data storage sends write enable signal and data to be written;
According to the output receiving, comparator, for receiving the output from register file, and judges whether jump instruction occurs, If redirecting, calculate the address of jump instruction by ALU, and fetching module is sent in this address;
Register file, for receiving the data of data storage, ALU, comparator in core, simultaneously can be number According to being sent to ALU, comparator;
Pipeline control module, for controlling streamline, that is, according to the input signal from micro-processor kernel, to fetching mould Block, decoding module, ALU and comparator provide corresponding halted signals it is ensured that the trouble-free operation of streamline.
2. the four core processor systems built using four nuclear structures as claimed in claim 1 are it is characterised in that each depositor Heap includes local register file and shared register file, wherein,
Local register file, for the closing computing of data in core, in calculating process there is any interaction, originally in not outer with core data The micro-processor kernel on ground has completely access limit to its local register file;
Shared register file, for being interconnected with the shared depositor of other micro-processor kernels outside core, realizes each micro- place Internuclear data interaction in reason device, local micro-processor kernel has read right to its shared register file, write permission according to Application needs to be respectively allocated to local micro-processor kernel or other micro-processor kernel.
3. the four core processor systems built using four nuclear structures as claimed in claim 2 are it is characterised in that each is locally posted Storage heap is divided into two groups, and every group has a read port and a write port, and wherein, two groups of register files receive different reading addresses Signal, provides corresponding readout;Two groups of register files receive same write address data input signal, to ensure that two groups are posted Storage heap content is consistent.
4. the four core processor systems built using four nuclear structures as claimed in claim 3 are it is characterised in that each microprocessor Device kernel carries out data exchange by the following two kinds mode:
A kind of mode is the data storage that each micro-processor kernel passes through outside the access of multilamellar bus structures;
Another way is to realize the exchange of internuclear data by the shared register file of each micro-processor kernel.
5. the four core processor systems built using four nuclear structures as claimed in claim 4 are it is characterised in that described multilamellar is total Line is the cross bar switch of setting between micro-processor kernel and the data storage of outside, and 4 micro-processor kernels pass through not The same data storage outside bus selection, if the data storage of selected outside is all different, in 4 microprocessors Core synchronizes transmission;Which if the data storage of selected outside is identical, selected to according to default sequence rule Micro-processor kernel carries out prioritised transmission.
6. the four core processor systems built using four nuclear structures as claimed in claim 5 are it is characterised in that each microprocessor The instruction set that device kernel uses includes arithmetic operation instruction, logic instruction, branch instruction, access instruction.
7. the four core processor systems built using four nuclear structures as claimed in claim 6 are it is characterised in that each microprocessor Device kernel also includes configuration register, for configuring the connected mode of the shared register file of affiliated micro-processor kernel, with Improve the motility of four nuclear structures, increase configuration-direct in terms of the instruction set that each micro-processor kernel uses, to prop up simultaneously Hold implementing of configuration.
8. a kind of method for interchanging data is it is characterised in that adopt four core processor systems as claimed in claim 7, described side Method includes:
The configuration register of each micro-processor kernel is initialized according to the parallel codes of application-specific, i.e. each microprocessor The configuration register of device kernel carries out configuration information setting according to configuration-direct;
Data exchange between outside data storage and micro-processor kernel, first data exchange is outside data storage Device writes data in the register file of micro-processor kernel, subsequently have outside data storage and micro-processor kernel it Between data exchange process repeatedly;
The exchange of internuclear data is realized by the shared register file of each micro-processor kernel.
CN201410014522.7A 2014-01-13 2014-01-13 The four core processor systems built using four nuclear structures and method for interchanging data Active CN103744644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410014522.7A CN103744644B (en) 2014-01-13 2014-01-13 The four core processor systems built using four nuclear structures and method for interchanging data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410014522.7A CN103744644B (en) 2014-01-13 2014-01-13 The four core processor systems built using four nuclear structures and method for interchanging data

Publications (2)

Publication Number Publication Date
CN103744644A CN103744644A (en) 2014-04-23
CN103744644B true CN103744644B (en) 2017-03-01

Family

ID=50501664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410014522.7A Active CN103744644B (en) 2014-01-13 2014-01-13 The four core processor systems built using four nuclear structures and method for interchanging data

Country Status (1)

Country Link
CN (1) CN103744644B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10409606B2 (en) * 2015-06-26 2019-09-10 Microsoft Technology Licensing, Llc Verifying branch targets
US11755484B2 (en) 2015-06-26 2023-09-12 Microsoft Technology Licensing, Llc Instruction block allocation
US10346168B2 (en) 2015-06-26 2019-07-09 Microsoft Technology Licensing, Llc Decoupled processor instruction window and operand buffer
US10565670B2 (en) * 2016-09-30 2020-02-18 Intel Corporation Graphics processor register renaming mechanism
CN108694441B (en) * 2017-04-07 2022-08-09 上海寒武纪信息科技有限公司 Network processor and network operation method
CN108536642A (en) * 2018-06-13 2018-09-14 北京比特大陆科技有限公司 Big data operation acceleration system and chip
CN112740192B (en) * 2018-10-30 2024-04-30 北京比特大陆科技有限公司 Big data operation acceleration system and data transmission method
WO2021134521A1 (en) * 2019-12-31 2021-07-08 北京希姆计算科技有限公司 Storage management apparatus and chip
CN113759246B (en) * 2020-05-22 2024-01-30 北京机械设备研究所 Dual-core processor-based motor drive test method and motor driver
CN112834819B (en) * 2021-01-04 2024-04-02 杭州万高科技股份有限公司 Digital signal processing device and method for electric energy metering chip
CN112834820B (en) * 2021-04-09 2024-01-23 杭州万高科技股份有限公司 Electric energy meter and metering device thereof
CN114398299B (en) * 2021-12-24 2024-05-10 北京四方继保工程技术有限公司 Data processing method of four-core cooperative measurement and control processor and processor
CN117132450B (en) * 2023-10-24 2024-02-20 芯动微电子科技(武汉)有限公司 Computing device capable of realizing data sharing and graphic processor

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876892A (en) * 2010-05-20 2010-11-03 复旦大学 Communication and multimedia application-oriented single instruction multidata processor circuit structure

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7810093B2 (en) * 2003-11-14 2010-10-05 Lawrence Livermore National Security, Llc Parallel-aware, dedicated job co-scheduling within/across symmetric multiprocessing nodes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876892A (en) * 2010-05-20 2010-11-03 复旦大学 Communication and multimedia application-oriented single instruction multidata processor circuit structure

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
一种基于可配置共享寄存器堆的多核处理器核间数据交换结构设计;方颖等;《微电子学与计算机》;20110430;第28卷(第4期);第65-72页 *

Also Published As

Publication number Publication date
CN103744644A (en) 2014-04-23

Similar Documents

Publication Publication Date Title
CN103744644B (en) The four core processor systems built using four nuclear structures and method for interchanging data
CN105718390B (en) Low-power in shared memory link enters
Pellauer et al. Buffets: An efficient and composable storage idiom for explicit decoupled data orchestration
CN109582611A (en) Accelerator structure
CN104699631A (en) Storage device and fetching method for multilayered cooperation and sharing in GPDSP (General-Purpose Digital Signal Processor)
CN107346351A (en) For designing FPGA method and system based on the hardware requirement defined in source code
CN103246625B (en) A kind of method of data and address sharing pin self-adaptative adjustment memory access granularity
Li et al. A performance & power comparison of modern high-speed dram architectures
KR101830685B1 (en) On-chip mesh interconnect
Daneshtalab et al. Memory-efficient on-chip network with adaptive interfaces
CN109582605A (en) Pass through the consistency memory devices of PCIe
KR20100017897A (en) Shared storage for multi-threaded ordered queues in an interconnect
TWI465908B (en) Methods and apparatus for efficient communication between caches in hierarchical caching design
Daneshtalab et al. A low-latency and memory-efficient on-chip network
US9424193B2 (en) Flexible arbitration scheme for multi endpoint atomic accesses in multicore systems
CN105988970B (en) The processor and chip of shared storing data
CN108804348A (en) Calculating in parallel processing environment
US9372796B2 (en) Optimum cache access scheme for multi endpoint atomic access in a multicore system
CN104598404B (en) Computing device extended method and device and expansible computing system
Contini et al. Enabling Reconfigurable HPC through MPI-based Inter-FPGA Communication
Marino et al. Insights on memory controller scaling in multi-core embedded systems
Noami et al. High speed data transactions for memory controller based on AXI4 interface protocol SoC
CN105893036A (en) Compatible accelerator extension method for embedded system
Wang et al. PMCNOC: A pipelining multi-channel central caching network-on-chip communication architecture design
Wang et al. Alloy: Parallel-serial memory channel architecture for single-chip heterogeneous processor systems

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant