CN101123567A

CN101123567A - Method and system for processing network information

Info

Publication number: CN101123567A
Application number: CNA2007100971918A
Authority: CN
Inventors: 埃利泽·阿朗; 尤里·埃尔·朱; 拉弗·沙洛姆; 凯特琳·贝斯特勒
Original assignee: Zyray Wireless Inc
Current assignee: Broadcom Corp; Zyray Wireless Inc
Priority date: 2006-05-01
Filing date: 2007-04-12
Publication date: 2008-02-13

Abstract

Certain aspects of a method and system for user space TCP offload are disclosed. Aspects of a method may include offloading transmission control protocol (TCP) processing of received data to an on-chip processor. The received data may be posted directly to hardware, bypassing kernel processing of the received data, utilizing a user space library. If the received data is not cached in memory, an application buffer comprising the received data may be registered by the user space library. The application buffer may be pinned and posted to the hardware.

Description

Be used to handle the method and system of the network information

Technical field

The present invention relates to tcp data and relevant TCP information processing.More particularly, relate to a kind of method and system that is used for user's space TCP Reduction of Students' Study Load engine.

Background technology

There are a lot of diverse ways to be used for reducing the processing power consumption that the ICP/IP protocol stack is handled at present.In TCP Reduction of Students' Study Load engine (TOE), all are carried out by the Reduction of Students' Study Load engine or most TCP handles, and sends data flow to high level.This method has number of drawbacks.TOE closely is connected with operating system, therefore requires to formulate solution according to operating system, and requires to change operating system to support it.TOE may require protocol stack solution side by side, and described solution requirement carries out certain manual configuration by application program, and for example explicit appointment socket addresses family is so that accelerate the speed of connection.TOE also can require to carry out certain manual configuration by the IT keeper, for example, by explicit formulation IP subnet address, so that accelerate the speed of connection, thereby selects which meeting in the TCP stream be lightened the burden, owing to need the processing of realization TCP bag, the Reduction of Students' Study Load engine is very complicated.

Large-scale segmentation Reduction of Students' Study Load (LSO)/transmission segmentation Reduction of Students' Study Load (TSO) can be used to reduce the main power consumption of handling, and this realizes by simplifying transmission packet transaction process.In this method, main frame will send to NIC than the bigger unit of transfer of MTU (MTU), NIC according to MTU with their sections of being divided into.Because the quantity of the main processing procedure of part and institute's transmitting element is linear, so this has reduced the desired main power consumption of handling.Though be effectively aspect the simplification transmission packet transaction, LSO butt joint contracture group is handled does not have help.In addition, for each independent large-scale transmission unit that main frame sends, main frame receives a plurality of ACK from far-end, the corresponding section that size is MTU of each ACK.A plurality of ACK require to consume rare and expensive bandwidth, thereby have reduced throughput and efficient.

Stateless in large-scale reception Reduction of Students' Study Load (LRO) receives in the Reduction of Students' Study Load mechanism, and according to guaranteeing that specific TCP stream always is written into the hash function of same hardware formation, TCP stream can be cut apart to be written in a plurality of hardware queues.For each hardware queue, the utilization of described mechanism is interrupted in conjunction with coming scan queue, and is a big receiving element with the follow-up data packet aggregation that belongs to same TCP stream in the formation.

Though except a plurality of hardware queues, this mechanism does not require additional any hardware on NIC, it has various performance limitations.For example, if the quantity of stream greater than the quantity of hardware queue, a plurality of failing to be convened for lack of a quorum enters identical formation, causes this formation not have the LRO polymerization.If the quantity of stream can not carried out the LRO polymerization greater than the twice of hardware queue quantity on any stream.Described polymerization meeting is subjected in the available number-of-packet quantitative limitation of main frame interrupt cycle.If interrupt cycle is very short, the quantity of stream is not little, and host CPU can obtain, be used for the data packet number of the polymerization of each stream can be seldom, causes the LRO polymerization limited or do not have a LRO polymerization.Even under large numbers of situations of hardware queue, it is limited or do not have a LRO polymerization LRO polymerization also may to occur.The LRO polymerization may be to be carried out by host CPU, and this will cause extra processing burden.Driving can be sent to tcp protocol stack with the buffering area tabulation of link, described buffering area comprises the header buffering area, header buffering area back is a series of data buffer zone, compares the situation that all data transmit on a buffering area continuously, and this requires more the processing.

When host-processor must be carried out read/write operation, the data buffer zone must be assigned to user's space.Read operation can be used to data are copied to the buffering area of this distribution from file.Write operation is used for the content of buffering area is sent to network.The OS kernel must copy to kernel spacing from user's space with all data.Copy operation is the intensity operation of CPU and bandwidth of memory, can the restriction system performance.

User's space in the ICP/IP protocol stack and the data between the kernel spacing are duplicated and can be expended the main power consumption of handling.For reducing dwelling reason power consumption, many solutions have been proposed.For example, utilize RDMA (RDMA) can avoid sending and receiving the enterprising line storage copy of both direction simultaneously.But these application programming interfaces that will look for novelty (API), new wire protocol are also made amendment to the existing application of line both sides.The local DMA engine memory copy on the both transmit and receive direction that can be used to lighten the burden.Though local DMA engine can be from the CPU copy operation of lightening the burden, it can not reduce desired bandwidth of memory.When platform shifts towards multi-CPU architecture, when having a plurality of kernels, all kernels all to share same memory among each CPU, bandwidth of memory can be a bottleneck serious in the highspeed network applications.

The application hereinafter present invention is described with reference to the accompanying drawings, by traditional system and some aspect of the present invention are compared, for those skilled in the art person, more restrictions and shortcoming conventional and conventional method will become more obvious.

Summary of the invention

A kind of method and system that is used for user's space TCP Reduction of Students' Study Load engine is described in conjunction with at least one width of cloth accompanying drawing, and more complete elaboration is arranged in the claims.

According to an aspect of the present invention, provide a kind of method that is used to handle the network information, described method comprises:

To handle on the sheet borne processor of lightening the burden the transmission control protocol (TCP) that the data of receiving are carried out and carry out; And

Utilize the user's space storehouse, skip the kernel processes that the described data of receiving are carried out, the described data of receiving are directly sent to hardware.

In method of the present invention, described method comprises also whether the data of determining described reception are buffered in the memory.

In method of the present invention, described method also comprises, if the described data of receiving are not buffered in the described memory, registers at least one application buffer, comprises the described data of receiving in the described buffering area.

In method of the present invention, described method also comprises, if the described data of receiving are not buffered in the described memory, locks described at least one application buffer.

In method of the present invention, described method also comprises, if the described data of receiving are not buffered in the described memory, at least one application buffer described locking, that comprise the described data of receiving is directly sent to described hardware.

In method of the present invention, described method also comprises:

Buffering area ID interpolation application buffer virtual address for described at least one application buffer; And

Described at least one application buffer is stored in the cache memory.

In method of the present invention, described method also comprises, if the described metadata cache of receiving in described memory, directly sends to described hardware with the described data of receiving, skips the kernel processes that the described data of receiving are carried out.

In method of the present invention, described method also comprises, will comprise that at least one application buffer of the described data of receiving sends to described hardware in advance.

In method of the present invention, described method comprises that also by upgrading the queue entries of finishing in the described hardware, described data are received in indication.

In method of the present invention, described method also is included in described the finishing after the queue entries of having upgraded in the described hardware, generates event notice.

According to an aspect of the present invention, provide a kind of system that is used to handle the network information, described system comprises:

To handle the circuit that carries out on the sheet borne processor of lightening the burden to the transmission control protocol (TCP) that the data of receiving are carried out; And

Utilize the user's space storehouse, skip kernel processes that the described data of receiving are carried out, the described data of receiving directly sent to the circuit of hardware.

In system of the present invention, described system comprises also whether definite described data of receiving are buffered in the circuit in the memory.

In system of the present invention, described system also comprises, is used for when the described data of receiving are not buffered in described memory, registers the circuit of at least one application buffer, and described buffering area comprises the data of described reception.

In system of the present invention, described system also comprises, is used for when the described data of receiving are not buffered in described memory, the circuit of described at least one application buffer of locking.

In system of the present invention, described system also comprises, be used for when the described data of receiving are not buffered in described memory, with described locking, comprise that at least one application buffer of described reception data directly sends to the circuit of described hardware.

In system of the present invention, described system also comprises:

Add the circuit of application buffer virtual address for the buffering area ID of described at least one application buffer; And

Described at least one application buffer is stored in circuit in the cache memory.

In system of the present invention, described system also comprises, is used for when the described data high-speed of receiving is buffered in described memory, the described data of receiving are directly sent to described hardware, skip the circuit of the kernel processes that the described data of receiving are carried out.

In system of the present invention, described system also comprises, will comprise that at least one application buffer of the described data of receiving sends to the circuit of described hardware in advance.

In system of the present invention, described system also comprises, by upgrading the circuit that queue entries, indication receive described data of finishing in the described hardware.

In system of the present invention, described system also comprises, after described in upgrading described hardware finished queue entries, generates the circuit of event notice.

From following description and accompanying drawing, these and other advantages of the present invention, aspect and novel features, and the details of embodiment all can more fully be understood.

Description of drawings

Figure 1A is the block diagram that is used for the example system of TCP Reduction of Students' Study Load according to an embodiment of the invention;

Figure 1B is the block diagram that is used for another example system of TCP Reduction of Students' Study Load according to an embodiment of the invention;

Fig. 1 C is another embodiment that is used for the example system of TCP Reduction of Students' Study Load according to an embodiment of the invention;

Fig. 1 D is the flow chart of copy data when carrying out write operation according to an embodiment of the invention in the host CPU system;

Fig. 1 E is the flow chart of copy data when carrying out read operation according to an embodiment of the invention in the host CPU system;

Fig. 2 A is the flow chart of the example procedure that connects for user's space TCP Reduction of Students' Study Load engine (TOE) according to an embodiment of the invention;

Fig. 2 B is the block diagram that passes through the example system of user's space storehouse distribution queue according to an embodiment of the invention;

Fig. 3 A is the exemplary transmission of user's space TOE according to an embodiment of the invention or the flow chart of transmission course;

Fig. 3 B is according to an embodiment of the invention when not comprising application buffer in the high-speed cache, the exemplary transmission of user's space TOE or the flow chart of transmission course;

Fig. 3 C is according to an embodiment of the invention when comprising application buffer in the high-speed cache, the exemplary transmission of user's space TOE or the flow chart of transmission course;

Fig. 4 A is the flow chart of the exemplary receiving course of user's space TOE according to an embodiment of the invention;

Fig. 4 B is according to an embodiment of the invention when not comprising application buffer in the high-speed cache,, the flow chart of the exemplary receiving course of user's space TOE;

Fig. 4 C is according to an embodiment of the invention when comprising application buffer in the high-speed cache, the flow chart of the exemplary receiving course of user's space TOE;

Fig. 5 is the flow chart that has the exemplary steps that the work request of the transparent TCP Reduction of Students' Study Load in user's space storehouse finishes according to an embodiment of the invention.

Embodiment

Some embodiment of the present invention relates to the method and system that is used for user's space TCP Reduction of Students' Study Load engine.Described method and system comprises and will carry out on the sheet borne processor that transmission control protocol (TCP) processing of received data is lightened the burden.Can utilize the user's space storehouse that the data that received are directly sent to hardware, skip the kernel processes that received data is carried out.If the data that received are not buffered in the memory, can comprise the application buffer of received data by the registration of user's space storehouse.Described application buffer can be locked and be sent to hardware.

Figure 1A is the block diagram that is used for the example system of TCP Reduction of Students' Study Load according to an embodiment of the invention.Therefore, the system among Figure 1A is used to handle the TCP Reduction of Students' Study Load of transmission control protocol (TCP) datagram or packet.With reference to Figure 1A, described system comprises, for example CPU 102, Memory Controller 104, mainframe memory 106, host interface 108, network subsystem 110 and Ethernet 112.Network subsystem 110 comprises, for example TCP ethernet controller (TEEC) or TCP Reduction of Students' Study Load engine (TOE) 114.Host interface 110 can comprise, for example network interface unit (NIC).Host interface 108 is, for example the bus of peripheral component interconnect (pci), PCI-X, PCI-Express, ISA, SCSI or other type.Memory Controller 104 is connected to CPU 102, memory 106 and host interface 108.Host interface 108 is connected to network subsystem 110 by TEEC/TOE 114.

Figure 1B is the block diagram that is used for another example system of TCP Reduction of Students' Study Load according to an embodiment of the invention.With reference to Figure 1B, described system comprises, for example CPU 102, mainframe memory 106, private memory 116 and chipset 118.Chipset 118 comprises, for example network subsystem 110 and Memory Controller 104.Chipset 118 is connected to CPU 102, mainframe memory 106, private memory 116 and Ethernet 112.The network subsystem 110 of chipset 118 is connected to Ethernet 112.Network subsystem 110 comprises, for example is connected to the TEEC/TOE 114 of Ethernet 112.Network subsystem 110 is communicated by letter with Ethernet 112 by for example wired and/or wireless connections.Wireless connections can be that for example the wireless lan (wlan) supported of IEEE 802.11 standards connects.Network subsystem 110 can comprise that also for example sheet carries memory 113.Private memory 116 can be context (context) and/or data provide buffering area.

Network subsystem 110 comprises processor, and for example colligator 111.Colligator 111 can comprise suitable logic, circuit and/or code, is used for the accumulation or the combination of processing TCP.In this, colligator 111 can utilize continuous query table (FLT) to safeguard the information that flows about current network, collects and polymerization tcp data section.FLT for example can be stored in the network subsystem 110.FLT comprises at least one in following: for example source IP address, purpose IP address, TCP address, source, purpose TCP address.In another embodiment of the present invention, can utilize at least two different tables, for example, comprise the table of 4 tuples (4-tuple) inquiry, be used for will import packet classification according to the stream of input packet.Described 4 tuple question blanks comprise at least one in following: for example source IP address, purpose IP address, TCP address, source, purpose TCP address.The context table of stream comprises the state variable that is used for polymerization, for example TCP sequence number.

FLT also comprises at least one in following: the TCP sign that comprises affirmation (ACK), TCP header and option copy, IP header and option copy, internet header copy and/or the accumulation of the host buffer of the dispersion centralized list (SGL) that is used for nonconnected storage or storage address, accumulation.When the termination incident took place, colligator 111 can generate the TCP section of a polymerization from accumulation or the TCP section of collecting.The TCP section of polymerization can mail to for example mainframe memory 106.

Though with CPU and Ethernet is that example is illustrated, the present invention is not limited to these embodiment, and can use for example memory of any kind and the data link layer or the physical medium of any kind respectively.Therefore, though what show among the figure is to be connected to Ethernet 112, TEEC among Figure 1A or TOE 114 are applicable to the data link layer or the physical medium of any kind.In addition, the assembly of showing among Figure 1A-B also can carry out in various degree decomposition or integrated.For example, TEEC/TOE 114 is embedded in integrated chip among mainboard or the NIC, that be provided with separately outside chipset 118.Similarly, colligator 111 can be to be embedded in integrated chip among mainboard or the NIC, that be provided with separately outside chipset 118.In addition, private memory 116 can integrate with chipset 118 or integrate with the network subsystem 110 shown in Figure 1B.

Fig. 1 C is another embodiment that is used for the example system of TCP Reduction of Students' Study Load according to an embodiment of the invention.With reference to Fig. 1 C, show hardware module 135.Hardware module 135 comprises host-processor 124, cache memory/buffering area 126, software algorithm module 134 and NIC module 128.NIC module 128 can comprise NIC processor 130, the processor such as colligator 131, NIC memory/application buffer module 132.NIC module 128 can be by for example wired and/or wireless connections and external network communication.Wireless connections can be that for example the wireless lan (wlan) supported of IEEE 802.11 standards connects.

Colligator 131 can be application specific processor or the hardware state machine that is positioned on the packet RX path.The main frame tcp protocol stack can comprise and be used to manage the software that Transmission Control Protocol is handled, and can be the part of operating system such as Microsoft Windows or Linux for example.Colligator 131 can comprise suitable logic, circuit and/or coding, is used for accumulation or polymerization tcp data.In this, colligator 131 can utilize continuous query table (FLT) to safeguard the information that flows about current network, collects and polymerization tcp data section.FLT for example can be stored in NIC memory/application buffer module 132.When the termination incident took place, colligator 131 can generate the TCP section of a polymerization from accumulation or the TCP section of collecting.The TCP section of polymerization can mail to for example cache memory/buffering area 126.

According to some embodiment of the present invention, the single TCP section after the polymerization can be mail to main frame and carry out the TCP processing, significantly reduce processing burden of main frame 124 with this.In addition, owing to do not transmit tcp state information, the specialized hardware of NIC 128 and so on can be used to assist the TCP section to receiving to handle, so that reduce the processing burden of every packet, this is to realize by combination or a plurality of TCP sections of receiving of polymerization.

In the TCP of routine treatment system, before first data segment that TCP connects arrives, be necessary to understand some information that connects about this TCP.And according to embodiments of the invention, before first data segment arrives, need not to understand TCP and connect, because tcp state or contextual information still carry out independent management by the main frame tcp protocol stack, in any given time, can transferring status data between hardware protocol stacks and the software protocol stack.

In an embodiment of the present invention, provide a kind of Reduction of Students' Study Load mechanism, this mechanism is stateless from the angle of host protocol stack, but is total state from the angle of Reduction of Students' Study Load equipment, can realize the performance gain suitable with TOE.By allowing reception and the transmission data cell of host computer system processes greater than MTU, the TCP Reduction of Students' Study Load can reduce the desired host process power consumption of TCP.In one exemplary embodiment of the present invention, handled is the protocol Data Unit (PDU) of 64KB rather than the PDU of 1.5KB, significantly reduces packet rate with this, thereby reduces the host process power consumption of processing data packets.

In TCP Reduction of Students' Study Load process, in host operating system with comprise between the NIC of TOE and can not shake hands.TOE understands the new stream of autonomous identification and lightens the burden.The Reduction of Students' Study Load of transmitting terminal is similar to LSO, and wherein main frame sends big transmitting element, and TOE is divided into littler transmission packet according to largest segment length (MSS) with them.

The TCP Reduction of Students' Study Load process of receiving terminal is with belonging to first-class a plurality of packet polymerizations of receiving, then they are mail to main frame, just look like these packets of receiving be from a packet---that receives big packet is the same, and the ack msg bag just looks like to be that the ack msg bag of a polymerization is the same simultaneously.Processing in the main frame is similar to the processing to the big packet that is received.Under the situation of TCP stream polymerization, the definable rule determines whether the aggregated data bag.Can formulate polymeric rule and allow polymerization as much as possible, and can not increase round trip cycle, data that make whether polymerization depend on to be received and without the significance level that lingeringly it is sent to main frame.Can use transmission-reception to connect and realize polymerization, wherein the connection between reflector and the receiver is to use the transmission information of the decision that is used to lighten the burden to realize, and stream can be counted as bidirectional flow.The contextual information that receives in the Reduction of Students' Study Load process among the TOE can data flow be that unit safeguards.In this, for the packet that each received, the packet header that enters the station can be used to detect its affiliated stream, and the context of this packet convection current upgrades.

When transmitter links to each other with receiver, can search for the network packet that is sent together with the network packet that is received, flow with the particular network under the specified data bag.The context that the network packet that is sent can be used for convection current upgrades, and this can be used for receiving the Reduction of Students' Study Load process.

Fig. 1 D is the flow chart of copies data when carrying out write operation according to an embodiment of the invention in the host CPU system.With reference to Fig. 1 D, the copy operation that shows various context switchings, user buffering district 164, kernel buffers 162, NIC memory 168 and carry out for write operation.In step 152, one-writing system is called out and is caused context to switch, and switches to kernel mode from user model.Can carry out copy operation data are sent to kernel buffers 162.In step 154, one-writing system is called out and can be returned, and generates another context and switches.When the DMA engine independent and synchronously with data when kernel buffers 162 is sent to NIC memory 168, another time copy operation takes place.Duplicating not necessarily of data can be cancelled it, to reduce expense and to improve performance.In order to reduce expense, can cancel some copy operations between kernel buffers 162 and the user buffering district 164.

Fig. 1 E is the flow chart of copies data when carrying out read operation according to an embodiment of the invention in the host CPU system.With reference to Fig. 1 E, the copy operation that shows various context switchings, user buffering district 164, kernel buffers 162, NIC memory 168 and carry out for write operation.In step 172, one-writing system is called out and is caused context to switch, and switches to kernel mode from user model.Can carry out copy operation data are sent to kernel buffers 162.In step 174, read apparatus is called out and can be returned, and generates another context and switches.When the DMA engine independent and synchronously with data when kernel buffers 162 is sent to NIC memory 168, another time copy operation takes place.Duplicating not necessarily of data can be cancelled it, to reduce expense and to improve performance.In order to reduce expense, can cancel some copy operations between kernel buffers 162 and the user buffering district 164.

Fig. 2 A is the flow chart of the example procedure that connects for user's space TCP Reduction of Students' Study Load engine (TOE) according to an embodiment of the invention.With reference to Fig. 2 A, show hardware 202, kernel 204, user's space storehouse 206 and application program 208.Hardware 202 can comprise suitable logic and/or circuit, can be used for handling from various drivings and is connected to the data that other equipment of hardware 202 receive.Kernel 204 can comprise suitable logic and/or code, is used for the resource of CPU management 102, and other application program 208 is moved in host computer system.Kernel 204 is used for realizing for example dispatching, buffering, high-speed cache, spool and troubleshooting function.Kernel 204 also is used to realize the communication between the various hardware and software components.User's space storehouse 206 is the subprogram collection that are used to research and develop software.User's space storehouse 206 allows coding and data to share and exchange in the mode of combination.

A connection can be registered by determining 4 constituent elements, receive formation (RQ), transmit queue (SQ) and finishing formation (CQ) in user's space storehouse 206.Optionally, kernel 204 can be applied to filtering rule registration and connect packet, to determine whether to allow connection request.Kernel 204 lockables also shine upon RQ, SQ, CQ and user's context buffering area.Kernel 204 is being to flow among the FLT of up-to-date distribution with DMA address and the user's context buffer stores of RQ, SQ, CQ also.

In another embodiment of the present invention, user's space storehouse 206 can be specified by what kernel 204 provided and used the handle that connects.Kernel 204 agrees that the ownership that will connect sends user's space storehouse 206 to.If kernel 204 agrees that the ownership that will connect sends user's space storehouse 206 to, kernel 204 can connect this desired tcp state information and offer user's space storehouse 206.

In another embodiment of the present invention, user's space storehouse 206 can oneself be set up and is dynamically connected.User's space storehouse 206 can be asked will be for it distributes local ip address and tcp port, concrete RQ and CQ that formulation will be used.If pass through, if kernel 204 registrable and locking RQ and CQ also not locked, next generate Reduction of Students' Study Load and monitor clauses and subclauses.

Fig. 2 B is the block diagram that passes through the example system of user's space storehouse distribution queue according to an embodiment of the invention.Show user's space storehouse 220 among Fig. 2 B, (CQ) 222 that finish formation, received formation (RQ) 224, transmit queue (SQ) 226, common reception formation (GRQ) 228 and kernel 230.

Before user's space TOE processing was carried out in convection current, user's space storehouse 220 can distribute at least one to receive formation (RQ) 224, transmit queue (SQ) 226, general reception formation (GRQ) 228, finish formation (CQ) 222.Kernel 230 lockables and mapping (map) RQ 224, SQ 226, CQ 222 and GRQ 228.User's space storehouse 220 can be independent of any specific flow distribution RQ 224, SQ 226, CQ 222 and GRQ 228.Kernel 230 can provide handle, after in the request from user's space storehouse 220, can use this handle.

Transmit queue (SQ) 226 can allow user's space storehouse 220 will send task requests and directly mail to buffering ring, and described buffering ring can directly be read by equipment.Optionally doorbell (doorbell) can provide memory mapped location, and write operation can be carried out in user's space storehouse 220 on described memory mapped location, thereby event notice is offered equipment.But this doorbell ring annunciator, specific transmit queue 226 is not empty.The transmit queue 226 that has doorbell can be used for connecting RDMA equipment.By using the RDMA interface, the buffering area of using in the task requests can use the handle of the registered buffering area of representative, and so, equipment knows that just the sender of task requests possesses the necessary authority of using this buffering area.This handle can be called as by the controlled flag of the RDMA of IP interface (Stag).The different in kind of realization and the task requests that sends transmit queue 226 that has the TOE in user's space storehouse 220.

Receiving formation (RQ) 224 allows user's space storehouses 220 will receive task requests directly to send to the Buffer Pool that can directly be read by equipment.By sending task requests, related buffering area can use the handle of the registered buffering area of representative.Support that the reception formation 224 of user's space TOE may be different with traditional RQ, because before buffering area is sent out, buffering area can not set up related with the specific message that enters the station in advance.This is owing to basic wire protocol TCP arranged, and does not have clientage between task requests and wired protocol message.

Finish the task performance that formation (CQ) 222 allows user's space storehouse 220 direct receiving equipments to write.Support user's space TOE to finish formation 222 different with conventional CQ because the CQ 222 of the comparable support RDMA of TOE is to generate the task performance more in proper order.Finish formation 222 and have relevant mechanism, wherein the user handles the call-back that can notify the task performance that when generates the task performance or when generate particular type.For user's space TOE, can select agency mechanism, the equipment that makes is still circulated a notice of the user's space storehouse.Do not handle because equipment can directly not interrupt user model, may require the agency in the kernel.For example, realize that the method for this announcement relay agent is, pass on callback, engage light or thread, perhaps generate the incident relevant with filec descriptor.

Reduction of Students' Study Load monitor table (OLT) can be used for transmitting with specific stream do not match, but with the TCP section of local tcp port coupling.OLT can integrate with the supporting assembly that Reduction of Students' Study Load is connected, as TOE, iSCSI and RDMA.

User's space TCP storehouse 220 is used to provide socket API and/or socket operation-interface.Each operation can be called out corresponding to receiving message (recvmsg ()) or sending message (sendmsg ()), and this depends on that the API that is provided is literal socket API or callback API and/or based on the simultaneous operation of task queue.For example, if the user sends three continuous recvmsg () operations, so just have three kinds and finish situation, supposing does not have mistake, and finishes load capacity that situation sends and the semanteme of recvmsg () is complementary with every kind.Therefore, the quantity of desired TCP header action not necessarily leaves no choice but to depend on the quantity that task requests that the user receives is finished.

Receiving buffering area has two kinds of different data structures, for example, and RQ 224 and GPQ 228.128 its expenses of visit are that reasonably then RQ 224 is used for direct reception if CNIC is determined to allow in user's space storehouse 220.Instant locking can be used for proving that registration is suitable, otherwise user's space storehouse 220 can be used to prove that buffering area and very large buffering area by the frequent suitable size of utilizing again are suitable.

When RQ 224 clauses and subclauses were unavailable concerning the non-ordered data bag, GRQ 228 can receive data.The non-ordered data bag can be placed in RQ 224 buffering areas, if but the PSH operation requires to send data before buffer memory is full, and then next the non-ordered data bag must be removed.Whether as long as GRQ 228 buffer memorys are finished, at least one RQ 224 buffer memory also can be finished, no matter have storage therein.CQ 222 polling routines in the user's space can copy to any data that are placed in GRQ 228 buffering areas in RQ 224 buffering areas.RQ 224 buffering areas are suitable targets of memory copy operation (memcpy ()), even do not carry out registration thereon, because cq_poll is a user space program.The power consumption of memcpy () operation is negligible, uses because data are prepared against by high-speed cache, and is carried out immediately before the application program deal with data.

Compare with carried out duplicating of user by core processor, the operation that copies to the user is more complicated more than memcpy (), before application program is used data, makes the probability of cache invalidation higher.When carrying out recvmsg () calling, user's space storehouse 220 can be used for receiving leaving in GRO 228 buffering areas of task and finishes situation, determines then whether user's space storehouse 220 needs to wait for the next task request.If the next task request is waited in permission, 220 decisions of user's space storehouse send to RQ 224 with task requests, and described task requests has the registration buffer district ID of destination buffer.Optionally, user's space storehouse 220 waits for that new GRQ 228 finishes situation.When more finishing situation, 220 reprocessings of user's space storehouse have enough reception data to finish the socket level operations up to it.Optionally, user's space storehouse 220 can be moved on callback API, and wherein it can call next higher rank, and described higher rank has SGL and indication ownership that comprises GRQ 228 buffer memorys and/or the order of handling urgency.A plurality of headers can be handled in user's space storehouse 220, and definite recvmsg () operation do not finish, and continue to handle same RQ 224 task requests.

According to embodiments of the invention, can utilize RDMA type interface based on the mapping memory interface, the element of this interface can be called as formation to (QP) with finish formation (CQ) 222.The TOE that has a user's space storehouse is different with the TOE that uses the kernel coding, and described kernel coding is based on plurality of data structures, described data structure be used to make user's space storehouse 220 directly and Reduction of Students' Study Load equipment mutual.

Fig. 3 A is the exemplary transmission of user's space TOE according to an embodiment of the invention or the flow chart of transmission course.Hardware 302, kernel 304, user's space storehouse 306 and application program 308 have been showed among Fig. 3 A.

Hardware 302 can comprise suitable logic and/or circuit, can be used for handling from various drivings and is connected to the data that other equipment of hardware 302 receive.Kernel 304 can comprise suitable logic and/or code, is used for CPU management 102 systems and/or device resource, and other application program 308 is moved in host computer system.Kernel 304 is used for for example dispatching, the function of buffering, high-speed cache, spool and troubleshooting.Kernel 304 also is used for the communication between the various hardware and software components.User's space storehouse 306 is the subprogram collection that are used to research and develop software.User's space storehouse 306 allows coding and data to share and exchange in the mode of combination.

Application program 308 is used for transmission message is sent to user's space storehouse 306.If transmission message is short or have particular length, user's space storehouse 306 can avoid locking applications to send buffering area 132.Sending under the very short situation of message, it can directly be sent to the hardware that uses SQ, and can not inquire about buffering ID.If transmission message is short or greater than particular length, user's space storehouse 306 can be used for application program is sent the application program transmission buffering area 132 that buffering area 132 copies to locking in advance.User's space storehouse 306 can send the application program that locks in advance buffering ID and add in the transmit queue (SQ).Hardware 302 can be inquired about the buffering ID of buffer zone address.Hardware 302 can be used for application programs transmission buffer memory and carries out directmemoryaccess (DMA).Hardware 302 can the data that send in the buffer memory 132 be carried out segmentation and TCP handles to using, and send the packet that obtains on thread.

Fig. 3 B is according to an embodiment of the invention when not comprising application buffer in the high-speed cache, the exemplary transmission of user's space TOE or the flow chart of transmission course.Fig. 3 B has showed hardware 302, kernel 304, user's space storehouse 306 and application program 308.

For example, when application program transmission buffering area 132 was not in high-speed cache, application program 308 was used to transmission message is sent to user's space storehouse 306.User's space storehouse 306 registrer applications send buffering area 132, and application program is sent buffering area 132 send to kernel 304.Kernel 304 can be used for locking and the mapping application program sends buffering area 132.The buffering ID that kernel 304 can send application program buffering area 132 sends to user's space storehouse 306.The buffering area virtual address is added in user's space storehouse 306 on the buffering ID that is received, and will cushion ID and be stored in the high-speed cache.The buffering ID that user's space storehouse 306 can send the application program of locked among the SQ and mapping buffering area 132 sends to hardware 302.Hardware 302 can be used for searching buffering ID for buffer address.The data that hardware 302 can be used in the application programs transmission buffering area 132 are carried out directmemoryaccess (DMA).But the data that hardware 302 application programs send in the buffering area 132 are carried out segmentation and TCP processing, and send the packet that obtains on thread.

In another embodiment of the present invention, user's space storehouse 306 can be used for task requests is directly sent to transmit queue (SQ), does not need the auxiliary of kernel 304.For example, user's space storehouse 306 announcement apparatus transmit queues no longer are to use doorbell for empty method.Doorbell is the address on the bus, when the user writes it, can give equipment with event notice.Doorbell also is assigned to several streams, thereby the unique user processing can write address page.In the time can directly sending to transmit queue, equipment can verify that specific packet is legal to transmit queue.If source header and target header all are opposite, the packet that obtains can be assigned to identical FLT.

The buffering ID that the direct task requests that sends to transmit queue from user's space storehouse 306 can require the user to register.The buffering area of the reality that has registered in the storehouse, buffering ID representative of consumer space 306 of these registrations, perhaps user's space storehouse 306 copies to wherein instant pre-registration buffering area with outbound data.User's space storehouse 306 is used for definite particular technology that will use.

If destination application receives buffering area not in the high-speed cache of registration, user's space storehouse 306 can be used for registering destination application and receives buffering area.For example, user's space storehouse 306 can utilize the memory registration request to register the reception buffer memory.The log-on message of user's space storehouse 306 cacheable reception buffering areas, therefore when application program 308 was submitted identical buffering area in next one request, user's space storehouse 306 needn't the repeated registration process.

According to embodiments of the invention, TCP Reduction of Students' Study Load engine (TOE) 114 can will be handled the transmission control protocol (TCP) that receives data on the sheet borne processor of the NIC processor and so on for example of lightening the burden.Utilize user's space storehouse 306, the data that received can be sent straight to hardware 302 or host-processor 124, skip by 304 pairs of processing of receiving that data are carried out of kernel.Hardware 302 is used for determining whether the data that received are cached at memory 126.If the data that received are not cached in the memory 126, user's space storehouse 306 is used to register at least one application program that comprises received data and sends buffering area 132.Kernel 304 lockables comprise that the application program of received data sends buffering area 132.User's space storehouse 306 is used for comprising that the locked application program transmission buffering area 132 of received data sends to hardware 302 or host-processor 124.User's space storehouse 306 is used in buffering ID that application program sends buffering area 132 and goes up and add the virtual address that application program sends buffering area 132, and application program is sent buffering area 132 is stored in the cache memory 126.User's space storehouse 306 can send at least one application program that comprises received data buffering area and send to hardware 302 or host-processor 124 in advance.

Fig. 3 C is according to an embodiment of the invention when comprising application buffer in the high-speed cache, the exemplary transmission of user's space TOE or the flow chart of transmission course.Fig. 3 C has showed hardware 302, kernel 304, user's space storehouse 306 and application program 308.

When application buffer was not in high-speed cache, application program 308 was used to transmission message is sent to user's space storehouse 306.User's space storehouse 306 is used to the pre-locked application program transmission buffering ID of the transmission of the application program on SQ buffering area 132 is sent to hardware 302, uses kernel 304 and be not required to be short transmission message.Hardware 302 can be used to search the buffering ID of buffer zone address.Hardware 302 can be used to the data in directmemoryaccess (DMA) the application program transmission buffering area 132.But the data that hardware 302 application programs send in the buffering area 132 are carried out segmentation and TCP processing, and the packet that obtains is sent on thread.When the data that received were cached in the memory 126, the data that received were sent straight to hardware 302 or host-processor 124, thereby skipped the processing of being undertaken by 304 pairs of received data of kernel.

Fig. 4 A is the flow chart of the exemplary receiving course of user's space TOE according to an embodiment of the invention.Hardware 402, kernel 404, user's space storehouse 406 and application program 408 have been showed among Fig. 4 A.

Hardware 402 can comprise suitable logic and/or circuit, can be used for handling from various drivings and is connected to the data that other equipment of hardware 402 receive.Kernel 404 can comprise suitable logic and/or code, is used for CPU management 102 systems and/or device resource, and other application program 408 is moved in host computer system.Kernel 404 is used for for example dispatching, the function of buffering, high-speed cache, spool and troubleshooting.Kernel 404 also is used to realize the communication between the various hardware and software components.User's space storehouse 406 comprises the subprogram collection that is used to research and develop software.User's space storehouse 406 allows coding and data to share and exchange in the mode of combination.

Hardware 402 can be used for header and load are placed on from the reception buffering area of RQ or GRQ acquisition.By allow to destination memory long-range write-access, overflow to prevent buffer memory, hardware 402 can convert buffering ID and skew to desired DMA address.

User's space storehouse 406 can be used for the pre-locked buffering ID of the reception of the application program in the common reception formation (GRQ) buffering area 132 is sent to hardware 402.Hardware 402 can be used to handle the TCP section that is generated, the buffering ID of inquiry buffer zone address, and the TCP section that is generated is placed in the storehouse buffering area of pre-transmission.Hardware 402 is used for the reception load of direct memory access (DMA) to GRQ buffering area 228.Hardware 402 can be used for generating finishing finishes queue entries (CQE) in the formation (CQ), receives the tcp data bag that is generated with indication.User's space storehouse 406 can be used to poll and finish formation (CQ) 222.When poll to CQ 222 was finished in user's space storehouse 406, hardware 402 can be asked in user's space storehouse 406, carried out event notice immediately after follow-up the finishing.User's space storehouse 406 can be used for that pre-locked application program is received buffering area and copies to application program reception buffering area 132.

The buffering area of registration can be sent in receiving task requests and receive formation (RQ) and/or common reception formation (GRQ).Send to and finish formation (CQ) by task being finished or finished queue entries (CQE), hardware 402 can be used for indicating and receives the TCP section that enters buffering area, and described buffering area distributes from RQ or GRQ.When sending to CQ, hardware 402 can be used for generating notification event.

The message that application program 408 is used for being received sends to user's space storehouse 406.User's space storehouse 406 is used for the pre-locked buffering ID of the reception of the application program in the common reception formation (GRQ) buffering area 132 is sent to hardware 402.Hardware 402 can be used for handling the tcp data bag of input, and the buffering ID of inquiry buffer zone address, and the data of the packet that will enter the station are placed in the storehouse buffering area of pre-transmission.Hardware 402 can be used for the reception load of directmemoryaccess (DMA) to the GRQ buffering area.Hardware 402 can be used for generating finishing finishes queue entries (CQE) in the formation (CQ) 137, receives the tcp data bag with indication.User's space storehouse 406 can be used to poll and finish formation (CQ) 137.When poll to CQ 137 was finished in user's space storehouse 406, hardware 402 can be asked in user's space storehouse 406, carried out event notice immediately after follow-up the finishing.User's space storehouse 406 can be used for that pre-locked application program is received buffer memory and copies to application program reception buffering area 132.

Fig. 4 B is according to an embodiment of the invention when not comprising application buffer in the high-speed cache,, the flow chart of the exemplary receiving course of user's space TOE.Fig. 4 B has showed hardware 402, kernel 404, user's space storehouse 406 and application program 408.

When application program reception buffering area 132 was not in high-speed cache, application program 308 was used to reception message is sent to user's space storehouse 406.The registrable application programs in user's space storehouse 406 receive buffering areas 132, and application program is received buffering area 132 send to kernel 404.Kernel 404 can be used for locking and the mapping application program receives buffering area 132.The buffering ID that kernel 404 can receive the application program of locked and mapping buffering area 132 sends to user's space storehouse 406.User's space storehouse 406 appends to the buffering area virtual address on the buffering ID that is received, and will cushion ID and be stored in the high-speed cache.User's space storehouse 406 can be gone up up-to-date blocked buffering ID with RQ and be sent to hardware 402.

Hardware 402 can be used for handling the tcp data bag of input, the buffering ID of inquiry buffer zone address, and the data that will import the tcp data bag directly are placed in the application program reception buffering area 132 of corresponding locked and mapping.Hardware 402 can be used for the load of directmemoryaccess (DMA) to the input TCP of RQ buffering area.Hardware 402 or host-processor 124 can be used for generating finishing finishes queue entries (CQE) in the formation (CQ) 137, receives the tcp data bag with indication.User's space storehouse 406 can be used to poll and finish formation (CQ) 137.When poll to CQ 137 was finished in user's space storehouse 406, hardware 402 can be asked in user's space storehouse 406, carried out event notice immediately after follow-up the finishing.

Fig. 4 C is according to an embodiment of the invention when comprising application buffer in the high-speed cache, the flow chart of the exemplary receiving course of user's space TOE.Hardware 402, kernel 404, user's space storehouse 406 and application program 408 have been showed among Fig. 4 C.

When application buffer was in high-speed cache, application program 308 was used to reception message is sent to user's space storehouse 406.The pre-locked buffering ID that user's space storehouse 406 is used to receive the application program reception buffering area 132 in the formation (RQ) sends to hardware 402.Hardware 402 can be used to handle input tcp data bag, the buffering ID of inquiry buffer zone address, and the application program that the data of the tcp data bag of input directly are placed on corresponding to pre-locked buffering ID received in the buffering area 132.Hardware 402 is used for the load of direct memory access (DMA) to the input tcp data bag of RQ buffering area.Hardware 402 can be used for generating finishing finishes queue entries (CQE) in the formation 137, receives the tcp data bag that is generated with indication.User's space storehouse 406 can be used to poll and finish formation (CQ) 137.When poll to CQ 137 was finished in user's space storehouse 406, hardware 402 can be asked in user's space storehouse 406, carried out event notice immediately after follow-up the finishing.

According to embodiments of the invention, a kind of method and system that is used for user's space TCP Reduction of Students' Study Load comprises TCP Reduction of Students' Study Load engine (TOE) 114, is used for the transmission control protocol (TCP) of received data handled on the sheet borne processor of the NIC processor 130 and so on for example of lightening the burden carrying out.Can utilize user's space storehouse 306 that the data that received are directly sent to hardware 402 or host-processor 124, thereby skip the processing of being undertaken by 304 pairs of received data of kernel.Hardware 402 is used to determine whether the data that received are cached in the memory 126.If the data high-speed that is received is buffered in the memory 126, the data that received are sent straight to hardware 402 or host-processor 124, thereby skip the processing by 304 pairs of received data of kernel.

If the data that received are not cached in the memory, user's space storehouse 306 registrable at least one comprise that the application program of received data receives buffering area 132.Kernel 304 is used to lock the application program that comprises received data and receives buffering area 132.User's space storehouse 306 is used for and will comprises that the locked application program that receives data receives buffering area 132 and sends to hardware 402 or host-processor 124.User's space storehouse 306 is used for the application buffer virtual address is appended to the buffering ID that application program receives buffering area 132, and application program is received buffering area 132 is stored in the cache memory 126.User's space storehouse 306 is used for comprising that at least one application program reception buffering area of received data sends to hardware 402 or host-processor 124 in advance.Finish the queue entries of finishing in the formation 137 by renewal, hardware 402 or host-processor 124 are used for the reception of designation data.Finish finishing after the queue entries in the formation 137 in renewal, hardware 402 or host-processor 124 can be used for generating event notice.

Fig. 5 is the flow chart that has the exemplary steps that the work request of the transparent TCP Reduction of Students' Study Load in user's space storehouse finishes according to an embodiment of the invention.With reference to Fig. 5, exemplary steps is from step 502.In step 504, effective TCP section of being confirmed by continuous query table (FLT) can be distributed to user's space TTO and handle.In step 506, determine whether FLT has current buffering area.Colligator 111 utilizes continuous query table (FLT) to safeguard the information that flows about current network, so that collect and polymerization TCP section.If FLT has current buffering area, control proceeds to step 508.

In step 508, determine whether new tcp data bag appends in the current buffering area.If new tcp data bag does not append in the current buffering area, control proceeds to step 510.In step 510, determine whether allocation buffer from GRQ.If can not then indicate assignment error from the GRQ allocation buffer, control proceeds to step 516.In step 516, packet discard.If can be from the GRQ allocation buffer, control proceeds to step 512.In step 512, new packet is applied to the buffering area that distributed.In step 514, for the buffering area generation task of being distributed is finished clauses and subclauses.Control afterwards proceeds to end step 538.

In step 508, if new tcp data bag appends to current buffering area, control proceeds to step 518.In step 518, determine whether new packet is fit to current buffering area.If new packet is fit to current buffering area, control proceeds to step 528.In step 528, new packet can be affixed to current stream.If new packet is not suitable in the current buffering area, control proceeds to step 520.In step 520, for current buffering area generation task is finished clauses and subclauses.In step 522, the current buffering area of FLT is set to NULL.In step 524, determine whether the TCP section is orderly, could be from RQ 224 allocation buffers.If the TCP section is not orderly or can not be from RQ 224 allocation buffer that control proceeds to step 536.In step 536, determine whether and to distribute current buffering area from GRQ.If can not distribute current buffering area from GRQ, can indicate assignment error, control proceeds to step 516.In step 516, packet discard.If distribute current buffering area from GRQ, control proceeds to step 528.If the TCP section is orderly, and can be from RQ 224 allocation buffers, control proceeds to step 526.In step 526, distribute current buffering area from RQ 224.

In step 530, can determine current buffering area whether use by referable.If current buffering area referable is used, control proceeds to step 532.In step 532, can be current buffering area generation task and finish clauses and subclauses.In step 534, the current buffering area of FLT is set to NULL.Control proceeds to end step 538 then.If current buffering area can not be delivered for use, control proceeds to end step 538.

According to embodiments of the invention, the method and system that is used for user's space TCP Reduction of Students' Study Load can comprise host-processor 124, and described host-processor 124 is used for the transmission control protocol (TCP) of received data handled on the sheet borne processor of the NIC processor and so on for example of lightening the burden and carries out.Utilize user's space storehouse 220, the data that received can be sent straight to hardware, skip the kernel processes of being undertaken by described reception data.If the data that received are not cached in the memory 126, then register the application buffering area that at least one comprises received data.If high-speed buffer is in memory 126 for the data that received, application buffer is locked and send to hardware.The application buffer virtual address can be affixed on the buffering ID of application buffer, and application buffer can be stored in the cache memory 126.

If the data cache that is received is in memory 126, the data that received are sent straight to hardware, skip the kernel processes that described reception data are carried out.At least one application buffer is sent to hardware in advance.Can come the reception of designation data by upgrading the queue entries of finishing in the hardware.In upgrading hardware finish queue entries after, can generate event notice.

Another embodiment of the present invention provides a kind of machine readable storage, stores computer program on it, and this program has at least one code segment, and this at least one code segment is carried out by machine, makes machine carry out the above-mentioned steps of user's space TCP Reduction of Students' Study Load engine.

Therefore, available hardware of the present invention, software or software and hardware combining realize.The present invention can realize under the centralized environment of at least one computer system, also can realize under each element is distributed in the distributed environment of different interconnective computer systems.The equipment that the computer system of any kind of or other are suitable for carrying out the method for the invention all is fit to use the present invention.The example of software and hardware combining can be the general-purpose computing system that has certain computer program, but when being written into and moving this computer program, the may command computer system is carried out method of the present invention.

The present invention also can be built in the computer program, wherein comprises all properties that can realize the method for the invention, and can carry out these methods when it is loaded into computer system.Computer program in this context is meant any expression formula of the instruction set of writing with any language, code or symbol, can make the system that has the information processing function directly carry out specific function or carry out specific function after finishing following one or two: a) be converted to other Languages, code or symbol; B) regenerate with other form.

The present invention is described according to specific embodiment, but it will be understood by those skilled in the art that when not breaking away from the scope of the invention, can carry out various variations and be equal to replacement.In addition, for adapting to the specific occasion or the material of the technology of the present invention, can carry out many modifications and not break away from its protection range the present invention.Therefore, the present invention is not limited to specific embodiment disclosed herein, and comprises that all drop into the embodiment of claim protection range.

Claims

1. a method that is used to handle the network information is characterized in that, described method comprises:

To handle on the sheet borne processor of lightening the burden the transmission control protocol (TCP) that the data of receiving are extremely carried out and carry out; And

2. method according to claim 1 is characterized in that, described method comprises also whether the data of determining described reception are buffered in the memory.

3. method according to claim 2 is characterized in that, described method also comprises, if the described data of receiving are not buffered in the described memory, registers at least one application buffer, comprises the described data of receiving in the described buffering area.

4. method according to claim 3 is characterized in that, described method also comprises, if the described data of receiving are not buffered in the described memory, locks described at least one application buffer.

5. method according to claim 4 is characterized in that, if the described data of receiving are not buffered in the described memory, at least one application buffer described locking, that comprise the described data of receiving is directly sent to described hardware.

6. method according to claim 3 is characterized in that, described method also comprises:

Described at least one application buffer is stored in the cache memory.

7. a system that is used to handle the network information is characterized in that, described system comprises:

8. system according to claim 7 is characterized in that, described system comprises also whether definite described data of receiving are buffered in the circuit in the memory.

9. system according to claim 8, it is characterized in that, described system also comprises, is used for when the described data of receiving are not buffered in described memory, registers the circuit of at least one application buffer, and described buffering area comprises the data of described reception.

10. system according to claim 9 is characterized in that, described system also comprises, is used for when the described data of receiving are not buffered in described memory, the circuit of described at least one application buffer of locking.