CN115174500B - PISA-based transmitting node and switch for intra-network combined transmission - Google Patents

PISA-based transmitting node and switch for intra-network combined transmission Download PDF

Info

Publication number
CN115174500B
CN115174500B CN202210561537.XA CN202210561537A CN115174500B CN 115174500 B CN115174500 B CN 115174500B CN 202210561537 A CN202210561537 A CN 202210561537A CN 115174500 B CN115174500 B CN 115174500B
Authority
CN
China
Prior art keywords
packet
sliding window
sequence number
switch
key
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210561537.XA
Other languages
Chinese (zh)
Other versions
CN115174500A (en
Inventor
吴文斐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peking University
Original Assignee
Peking University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University filed Critical Peking University
Priority to CN202210561537.XA priority Critical patent/CN115174500B/en
Publication of CN115174500A publication Critical patent/CN115174500A/en
Application granted granted Critical
Publication of CN115174500B publication Critical patent/CN115174500B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/90Buffering arrangements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L49/00Packet switching elements
    • H04L49/10Packet switching elements characterised by the switching fabric construction

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The invention provides a PISA-based transmitting node for intra-network aggregation transmission, which completes the following operations: dividing the sequence number of the packet into segments according to the maximum size of the sliding window, and numbering the segments to determine whether the packet is in an odd segment or an even segment; the latest packet sequence number is transmitted in accordance with the size of a sliding window centered around the latest packet sequence number. The invention also provides a switch for intra-network aggregation transmission based on PISA, which completes the following operations: after receiving the packet of the sending node, analyzing and obtaining the working thread ID of the packet and the serial number of the packet; judging whether the packet falls in the current sliding window or not according to the sequence number of the packet, and marking or processing, wherein the sliding window is centered on the latest packet sequence number.

Description

PISA-based transmitting node and switch for intra-network combined transmission
Technical Field
The present invention relates to the field of computer communications, and more particularly, to a PISA-based transmitting node and switch for intra-network aggregation transmission.
Background
Because the underlying network is unreliable, the reliability of the intra-network integrated transmission protocol needs to be considered, such as the collaborative design of the terminal host and the switch, the design of the state in the switch, and the like. The current academy presents both SwitchML and ATP based solutions to the reliability problem, both of which propose their own solutions. The SwitchML is only a single-rack and single-tenant, and shallow copies of previous aggregation results are made in the switch for realizing reliability, and version numbers are recorded, so that the results of the completed aggregation are ensured not to be lost due to release of aggregator resources. The ATP adopts a packet loss detection technology similar to the traditional TCP, and when three continuous ACK packets with unmatched serial numbers are received or a timeout phenomenon occurs, retransmission processing is carried out, and intra-network aggregation is not carried out on the retransmitted packets, so that no additional state needs to be recorded in the switch, and the corresponding aggregation operation is returned to the parameter server.
Disclosure of Invention
In view of the problems in the background art, the present invention proposes a transmitting node for intra-network aggregation transmission, the node having a computer program, which, when run, performs the following operations: dividing the sequence number of the packet into segments according to the maximum size of the sliding window, and numbering the segments to determine whether the packet is in an odd segment or an even segment; the latest packet sequence number is transmitted in accordance with the size of a sliding window centered around the latest packet sequence number.
The invention also proposes a switch for intra-network integrated transmission, said switch having a computer processing program which, when run, performs the following operations: after receiving the packet of the sending node, analyzing and obtaining the working thread ID of the packet and the serial number of the packet; judging whether the packet falls in the current sliding window or not according to the sequence number of the packet, and marking or processing, wherein the sliding window is centered on the latest packet sequence number.
The invention adopts the technical thought of transmission, aggregation and pulling, and realizes the reliability of transmission. The invention can adapt to the limit of limited memory resources, read-write times and stage numbers.
The invention can record the state of each sender in the exchanger with high efficiency, can greatly reduce the use of scarce exchanger memory resources, and ensures the reliability of transmission.
The invention divides the key space, so that the same key does not occupy different aggregators, thereby avoiding the waste of resources of the aggregators, and simultaneously enabling more key value pairs to be contained in the group, thereby improving the throughput and saturating the network bandwidth.
The invention processes the variable-length keys in the load in the group, can cope with flexible and variable workload, and enhances the universality of the system.
Drawings
For easier understanding of the present invention, the present invention will be described in more detail by referring to specific embodiments shown in the drawings. These drawings depict only typical embodiments of the invention and are not therefore to be considered to limit the scope of the invention.
Fig. 1 is a flow chart of the operation of one embodiment of a transmitting node of the present invention.
Fig. 2 is a flow chart of the operation of another embodiment of the switch of the present invention.
Fig. 3 is a schematic diagram of one embodiment of a general packet data format in the present invention.
Detailed Description
Embodiments of the present invention will be described below with reference to the accompanying drawings so that those skilled in the art can better understand the present invention and implement it, but the examples listed are not limiting to the present invention, and the following examples and technical features of the examples can be combined with each other without conflict, wherein like parts are denoted by like reference numerals.
In a device (e.g., a switch) using intra-network aggregation transport protocols, the key-value pair data in a packet is not all aggregated, and thus the switch needs to record the processing result of each key-value pair in order to ensure reliability. The spatial complexity of recording these states is O (num_sender×avg_num_repetition), where num_sender represents the number of senders and avg_num_repetition represents the key-value pairs that need to be aggregated. In order to reduce the space complexity, the invention adopts two methods: (1) the transmitting node uses the sliding window to transmit packets (2) record in the switch the same aggregate state as the sliding window size.
First embodiment
For a transmitting node as a node, it transmits packets in a sliding window fashion. As shown in fig. 1, it completes the following operation.
S1, dividing sequence of a sequence number of a packet into segments according to a maximum size max_cwnd of a sliding window, and numbering the segments to determine whether the packet is in an odd segment sen_odd or an even segment sen_even.
In one embodiment, the worker thread sends the sequence number of the packet modulo-growing 32 bits as the sequence number. Each packet has its own sequence number sequence. The parity of sequence/max_cwnd indicates whether the packet is in an odd or even segment, each segment being either even or odd. sequence% max_cwnd is its offset within the segment.
The sliding window refers to [ most_front_seq_max_cwnd, most_front_seq+max_cwnd ]. Since the data before the most_front_seq has been successfully received, only the data after the most_front_seq can be transmitted, and the length of the transmitted data is max_cwnd, that is, at most, only the data packet of max_cwnd can be transmitted.
And segmenting the whole sequence number space according to the max_cwnd, wherein the sequence numbers of the max_cwnd are in odd segments, and the sequence numbers of the max_cwnd are in even segments in the sliding window [ most_front_seq-max_cwnd, most_front_seq+max_cwnd ].
In one embodiment, all worker threads' last_front_seq, sen_even, and sen_odd are statically assigned as large arrays, their own values and ranges are selected using their thread_ids, and worker threads with an ID of thread_id will be assigned to a range from max_cwnd×thread_id (inclusive) to max_cwnd× (thread_id+1) (exclusive).
S2, transmitting the latest packet according to the size of the sliding window, wherein the sequence number of the latest packet is most_front_seq.
Because the packet has already been fragmented, the sliding window of the worker thread of the sending node occupies at most two adjacent segments, one odd and one even.
In the switch, only packets with sequence numbers in [ most_front_seq_max_cwnd, most_front_seq+max_cwnd ] are likely to be observed in the switch in the future (described in detail below).
Second embodiment
For a switch, to reduce the use of the switch memory, the core idea is that in the switch only packets with sequence numbers within a certain range will be seen. Specifically, as shown in fig. 2, the switch performs the following operations.
S1, after receiving a packet (the packet comes from a working thread of a certain sending node), analyzing and obtaining a working thread ID (thread_id) and a sequence number sequence of the packet.
Each working thread of each sending node has a globally unique ID, and the packets sent by the working threads include the ID information. For example, there are at most N worker threads per transmitting node, and the jth thread at the ith sender has an ID of i×n+j.
S2, judging whether the packet falls in the current sliding window or not through the sequence of the sequence number of the packet, and marking or processing.
Preferably, the sliding window is [ most_front_seq_max_cwnd, most_front_seq+max_cwnd ].
In one embodiment, for PISA switches, the current PISA switch does not natively support% and/or operation, the present invention selects the value of max_cwnd to be a binary value of 0 on the left and 1 on the right (similar to a mask, e.g., 11 2 ,1111 2 ). Thus determining parity use sequence&(max_cwnd+1), offset is sequence&max_cwnd。
In particular, if the packet does not fall within the sliding window, it indicates that the packet was delayed before the network, but has been observed. The information is returned to the transmitting node and the next packet can be transmitted, with the most_front_seq updated by the new sequence. The parity and segment offset values of the current packet are calculated. Depending on the parity of the current packet, the sen array is used to check whether the current packet is observed or not and set in the sen array bits. More importantly, bits with the same offset in another segment will be reset. If a packet was previously observed, then the aggregation results before it are directly copied, otherwise, the switch aggregates the packet and stores the aggregation results.
In one embodiment, in the switch, for three variables: the most_front_seq, sen_even, and sen_odd states of the worker threads of the sending node may be stored in a large array, using the worker threads' thread_id to determine their own value and range, the worker threads with the ID thread_id will be assigned to a range from max_cwnd×thread_id (inclusive) to max_cwnd× (thread_id+1) (exclusive).
By the method, key value pairs which are aggregated by the switch in the packet can be prevented from being aggregated by the switch and the receiving end again; the key value pairs which are not aggregated by the exchanger are continuously forwarded to the receiving end for secondary checking. To this end, the method of the present invention is to record the status for each transmitting node in the switch, comprising: the occurrence of a packet and the result of the first aggregation process.
Third embodiment
In this embodiment, the key space is divided in the switch to avoid the waste of resources of the aggregator, reduce the consumption of the capacity of the switch, and improve the throughput.
Fig. 3 shows an embodiment of a general packet data format in which a header contains control information and a payload field contains key-value pairs to be aggregated.
Specifically, the task number field (task_id field) stores a task number for aggregation. The send node number field (snd_id field) stores the send node number of the packet to be sent. The thread number field (thread_id field) stores the specific thread number that sent the packet.
The packet type field (type field) stores the type of the packet, which includes: SYN, FIN, DATA, ACK, QUERY, RESET. SYN indicates start, FIN indicates end, DATA indicates DATA, ACK indicates acknowledgement, QUERY indicates inquiry, RESET indicates RESET.
A packet sequence number field (sequence field) stores the sequence number of the packet sent by the thread. The payload field (payload field) contains key-value pairs to be aggregated, and there may be a plurality of key-value pairs.
The validity field (bitmap field) indicates which key-value pairs in the payload field (payload) are valid. In one embodiment, the bitmap field is used to have bits of the same number of key values in the payload field (payload field) that are used to indicate which key value pairs in the payload field (payload field) are valid (e.g., 1 indicates valid, i.e., processed). The payload field (payload field) stores a key-value pair in the form of a key comparison at aggregation, and determines whether the value is aggregated by the switch through the bitmap field in the general packet data format.
Typically, a packet contains only one key pair, and a small packet size is not able to saturate the network bandwidth. PISA switches face the challenge of having to handle multiple key-value pairs during one round of flowing through the same Match-Action phase in a programmable switch. The pipeline of PISA switches is organized into stages, with packets flowing through each stage only unidirectionally in one round. Although packets can be recycled through stages in the switch, this is not recommended because the capacity of the switch is consumed.
Because of the programmable capability limitation, one aggregator array can only be accessed once in each stage, but multiple arrays (the array number is denoted R, r=4 in Tofino) can be built in one stage, so each stage can build R aggregator arrays. The memory between the different phases is isolated (the number of phases is denoted W, w=12 in Tofino). Therefore, at most W R aggregator arrays can be built up in the pipeline of programmable switches.
During processing of the packet, key-value pairs in the payload can only be processed in a fixed order and each aggregator array can only be accessed once. Thus, the positions of key-value pairs in the payload need to be associated with different arrays. The same key should be placed in the same location in the payload, otherwise if aggregators in different aggregators arrays handle the same key, then this key would occupy multiple aggregators, which is a waste of switch memory.
The invention divides the key space into mutually non-overlapping subspaces, each subspace being associated with a key-value position in the payload and thus with an aggregator array in the switch. Each transmitting node classifies the key-value pairs into corresponding subspaces and reorganizes their locations in the packet payload.
In one embodiment, the transmitting node constructs the packet according to the following method: where the payload buffer buff and bitmap are two-dimensional arrays, the first dimension representing the index of the packet and the second dimension representing the position of the key-value pair in the payload. The key-value pairs are processed sequentially, first extracting its subspace index, and then filling the key-value pairs into the positions in the payload corresponding to the subspaces. The packets are ultimately constructed using the filled payload and bitmap.
Fourth embodiment
In this embodiment, multiple phases of aggregators are connected in series in the switch to support variable length keys.
In a PISA switch, each aggregator is implemented as a register in the switch, the register being a maximum of 64 bits, using 32 bits therein as keys, and 32 bits as values. However, longer key cases may occur in the actual workload, and the present invention concatenates the aggregator arrays of adjacent stages to support the longer keys.
When multiple aggregator arrays process a key-value pair with longer keys, the key-value pairs are mapped to the same index in all arrays, and the aggregator in each array matches its key and the corresponding portion of the keys in the key-value pair tuple. The comparison results of the earlier stages are carried by the meta-information of the packet and accumulated (the comparison results of the current array are anded with the meta-information of the packet) until the last array. If all the keys in the aggregator match the corresponding portion of keys in the key-value pair tuple, the values of the key-value pair tuple will be aggregated in the last array, otherwise the key-value pair tuple will not be processed. Further, key-value pair tuples with particularly long keys may be added to the end of the packet, rolled back to the recipient for aggregation processing.
The foregoing embodiments, but only the preferred embodiments of the invention, use of the phrases "in one embodiment," "in another embodiment," "in yet another embodiment," or "in other embodiments" in this specification may all refer to one or more of the same or different embodiments in accordance with the present disclosure. Common variations and substitutions by those skilled in the art within the scope of the present invention are intended to be included in the scope of the present invention.

Claims (1)

1. A PISA-based switch for intra-network aggregation transport, the switch having a computer processing program which, when run, performs the following operations:
after receiving the packet of the sending node, analyzing and obtaining the working thread ID of the packet and the serial number of the packet;
judging whether the packet falls in a current sliding window or not through the sequence number of the packet, and marking or processing, wherein the packet comprises control information and a load, the load comprises a key value pair, the control information comprises a validity field, and the validity field can indicate whether the key value pair in the load field is aggregated or not;
the sliding window is [ most_front_seq-max_cwnd, most_front_seq+max_cwnd ], wherein most_front_seq represents the latest packet sequence number, and max_cwnd represents the maximum size of the sliding window;
if the packet does not fall into the sliding window, the switch informs the sending node to send the next packet, and the latest packet sequence number is updated by the sequence number of the new packet; after receiving the grouping, dividing a key space for storing key value pairs of the grouping into mutually non-overlapping subspaces, each subspace being associated with a key value position in a load in the grouping;
wherein the sliding window centers on the latest packet sequence number, and the packet of the transmitting node is transmitted through the sliding window.
CN202210561537.XA 2022-05-23 2022-05-23 PISA-based transmitting node and switch for intra-network combined transmission Active CN115174500B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210561537.XA CN115174500B (en) 2022-05-23 2022-05-23 PISA-based transmitting node and switch for intra-network combined transmission

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210561537.XA CN115174500B (en) 2022-05-23 2022-05-23 PISA-based transmitting node and switch for intra-network combined transmission

Publications (2)

Publication Number Publication Date
CN115174500A CN115174500A (en) 2022-10-11
CN115174500B true CN115174500B (en) 2023-09-12

Family

ID=83483778

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210561537.XA Active CN115174500B (en) 2022-05-23 2022-05-23 PISA-based transmitting node and switch for intra-network combined transmission

Country Status (1)

Country Link
CN (1) CN115174500B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101316157A (en) * 2008-01-17 2008-12-03 上海交通大学 Self-adapting packet length method based on floating point window increment factor
CN103782534A (en) * 2011-09-06 2014-05-07 阿尔卡特朗讯公司 A method for avoiding network congestion and an apparatus thereof
CN109155763A (en) * 2016-05-11 2019-01-04 微软技术许可有限责任公司 Digital Signal Processing in data flow

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10579661B2 (en) * 2013-05-20 2020-03-03 Southern Methodist University System and method for machine learning and classifying data
CN108270682B (en) * 2016-12-30 2022-06-24 华为技术有限公司 Message transmission method, terminal, network equipment and communication system
US11194812B2 (en) * 2018-12-27 2021-12-07 Microsoft Technology Licensing, Llc Efficient aggregation of sliding time window features

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101316157A (en) * 2008-01-17 2008-12-03 上海交通大学 Self-adapting packet length method based on floating point window increment factor
CN103782534A (en) * 2011-09-06 2014-05-07 阿尔卡特朗讯公司 A method for avoiding network congestion and an apparatus thereof
CN109155763A (en) * 2016-05-11 2019-01-04 微软技术许可有限责任公司 Digital Signal Processing in data flow

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
《ATP: In-network Aggregation for Multi-tenant Learning》;吴文斐;《the 18th USENIX Symposium on Networked Systems Design and Implementation》;全文 *

Also Published As

Publication number Publication date
CN115174500A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
US20220014312A1 (en) Data transmission method and apparatus
US20220309025A1 (en) Multi-path rdma transmission
US11716409B2 (en) Packet transmission method and apparatus
US7487424B2 (en) Bitmap manager, method of allocating a bitmap memory, method of generating an acknowledgement between network entities, and network entity implementing the same
US7562158B2 (en) Message context based TCP transmission
US8953631B2 (en) Interruption, at least in part, of frame transmission
CN111711566B (en) Receiving end disorder rearrangement method under multipath routing scene
US9667729B1 (en) TCP offload send optimization
US20070291782A1 (en) Acknowledgement filtering
US7684319B2 (en) Transmission control protocol congestion window
EP3135016A1 (en) Managing sequence values with added headers in computing devices
EP3203699A1 (en) Method for man-in-the-middle processing for tcp without protocol stack
US11134129B2 (en) System for determining whether to forward packet based on bit string within the packet
US8111700B2 (en) Computer-readable recording medium storing packet identification program, packet identification method, and packet identification device
CN111131179B (en) Service processing method, device, network equipment and storage medium
CN113572582B (en) Data transmission and retransmission control method and system, storage medium and electronic device
CN115174500B (en) PISA-based transmitting node and switch for intra-network combined transmission
WO2015085849A1 (en) Method for network device congestion avoidance and network device
CN113783664A (en) Message transmission method and message transmission device
US9525629B2 (en) Method and apparatus for transmitting data packets
CN116405546A (en) Data pushing method and terminal
CN107743102B (en) efficient tcp session recombination method
CN116684354A (en) Network flow congestion management device and method thereof
CN109428777B (en) Method and terminal for detecting packet loss
CN115174496B (en) Processing terminal and switch for intra-network combined transmission

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant