Techniques for High Performance Matching
Loading...
Date
2021-06-03
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Abstract
With the growth of big data application demands, improving high-performance computing (HPC) becomes an essential industry task. High-performance matching is a critical performance path for HPC communications because it significantly impacts computing performance
and profoundly affects networking performance. This dissertation focuses on improving the
high-performance matching in HPC networks to keep up with the increasingly heavy demands
of evolving applications.
This dissertation is tackling the matching problem from both the computational and network
aspects. On the one hand, the Message Passing Interface (MPI) is a de facto standard for the communication of parallel processes in an HPC network [1]. MPI has delivered an excellent performance for running large-scale scientific applications in petascale systems. Along with the petascale system, the exascale system is evolving to run even larger applications where the computing job size increases dramatically. This trend enlarges the message queues and degrades the MPI message matching performance. With the increasing requirement of big data applications, MPI message matching is a critical performance path for HPC communications. On the other hand, with the blooming of network techniques and the fast-growing size of network applications, users are seeking more enhanced, secure, and various network services. In an HPC network, the HPC cluster comprises multiple interconnected nodes in a switched network. With the integration of software-defined networking (SDN) technology into the HPC network, both the computational and network resources can be allocated efficiently according to the applications’ requirements. Thus, SDN switches are deployed in HPC networks to support high-performance, differentiated network services and guarantee the diverse users’ needs, such as firewall, load balancing, and quality of service [2]. In an SDN switch, packet classification classifies incoming packets to flows according to the rules generated in the control plane, which is a switch’s core function. Therefore, packet classification becomes a critical performance path for the HPC network.
First, this dissertation presents GenMatcher, a generic and software-only arbitrary matching framework for fast and efficient searches on packet classification. The goal is to represent arbitrary rules with efficient prefix-based tries. In order to generate efficient trie groupings and expansions to support all arbitrary rules, we propose a clustering-based grouping algorithm to group
rules based upon their bit-level similarities. Our algorithm generates near-optimal trie groupings
with low configuration times and provides significantly higher match throughput than prior techniques. Experiments with synthetic traffic show that our method can achieve a 58.9X speedup
1 compared to the baseline on a single-core processor under a given memory constraint [3].
Second, to further improve the GenMatcher performance, this dissertation proposes GenS-
Matcher, an efficient Single Instruction Multiple Data (SIMD) and cache-friendly arbitrary match-
ing framework. GenSMatcher adopts a trie node with a fixed high-fanout and a varying span for
each node depending on the data distribution. The layout of the trie node leverage cache and
modern processor features such as SIMD instructions. To support arbitrary matching, we interpret arbitrary rules into three fields: value, mask, and priority, and then propose the GenSMatcher
extraction algorithm to process the wildcard bits to support randomly positioning wildcards in
arbitrary rules. At last, we add an array of wildcard entries to the leaf entries, which stores the
wildcard rules and guarantees matching results. Experiments show that GenSMatcher outperforms GenMatcher under a large scale of the ruleset and key set regarding search time, insert
time, and memory cost. Specifically, with 5M rules, our method achieves a 2.7X speedup on
search time, and the insertion time takes ∼ 7.3 seconds, gaining a 1.38X speedup; meanwhile, the memory cost reduction is up to 6.17X.
Third, to guarantee MPI ordering feature and high-performance matching for big applications on MPI tag matching, this dissertation introduces a new hybrid data structure and match-
ing mechanism to address the performance challenges, reducing the matching operation time in
the posted receive queue (PRQ) and unexpected message queue (UMQ). The hybrid data structures are composed of tries and hash maps. We evaluate our mechanism on microbenchmarks and existing MPI applications with different numbers of processes. Experiments with synthetic
message flow show that our method can achieve a 20X search time speedup compared to the
single-core processor’s baseline. For the PICSARlite application, we integrated our Hybrid and
Intel mechanism into the MPICH library and evaluated their performance on the Ada cluster of
Texas A&M University, which has 793 general compute nodes. The experiment outcome shows that our proposed Hybrid mechanism can achieve up to 1.55X speedup compared to the MPICH library method.
Description
Keywords
Arbitrary matching, high-performance