Publications

Preprints

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
Joon Ha Kim, Geon-Woo Kim, Anoop Rachakonda, Daehyeok Kim
arXiv preprint arXiv:2605.07985 (arXiv), 2026
Abstract Code arXiv
Selecting optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures. Profile-based simulators are standard tools but hardcode operations to specific configurations and re-profile from scratch, making exploration expensive. Dooly exploits structural understanding that input dimensions are fixed by model configuration or determined by requests. The system performs a single inference pass, labels input dimensions via taint propagation, and selectively profiles only missing operations from its latency database. Stateful operations like attention are isolated using the serving engine’s initialization code. Latency regression models built from the database serve as a drop-in backend for existing simulators. Across two GPU platforms, three attention backends, and diverse model architectures, Dooly achieves simulation accuracy within 5% MAPE for TTFT and 8% for TPOT while reducing profiling GPU-hours by 56.4% across 12 models compared to existing approaches.

2026

PacketExpress: Fully Exploiting Large MTUs for Internet Traffic in Private Networks
Junghan Yoon, Youngmin Choi, Juyoung Park, Daehyeok Kim, Changhoon Kim, KyoungSoo Park
Proceedings of the ACM SIGCOMM 2026 Conference (SIGCOMM), 2026
Abstract PDF
Network bandwidth continues to scale rapidly, yet Internet data transmission performance remains constrained by the legacy 1500 B MTU. This small MTU translates high bandwidth into high packet rates that strain CPU processing at middleboxes and end hosts. While increasing the MTU could substantially improve performance, coordinating upgrades across arbitrary Internet paths is impractical. This paper presents PacketExpress, a packet processing architecture that enables networks to leverage large MTUs for Internet traffic without requiring modifications to neighboring networks. MTU-translating gateways at network borders dynamically aggregate incoming small packets into larger packets for efficient processing, then segment them back when forwarding externally. We present PXIO, a packet processing stack that leverages NIC offload capabilities to achieve high throughput. We introduce F-PMTUD, which determines the path MTU within a single round-trip time without relying on ICMP. For UDP, we present PX-caravan, a tunneling mechanism that encapsulates multiple packets to benefit from large MTUs while preserving packet boundaries. Our prototype achieves 1.47 Tbps throughput with 8 CPU cores while converting over 90% of 1500 B packets into 9000 B packets, improving middlebox performance by up to 4.8× and end-host performance by up to 3.3×.
Mimesys: Generating Realistic Executable Testing Environments from Resource Usage Traces
Donghyun Kim, Zichao Hu, Joydeep Biswas, Aditya Akella, Daehyeok Kim
Proceedings of 20th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2026
Abstract PDF Code
Testing applications under realistic resource contention is challenging because production workloads are often inaccessible due to privacy and proprietary concerns. Existing approaches either use simplistic resource stressors that fail to capture temporal dynamics and multi-resource interactions, rely on limited benchmark suites, or require exhaustive per-application profiling. This paper explores an alternative direction: Synthesizing executable workloads from resource usage traces to reproduce realistic colocation scenarios. We present Mimesys, a system that transforms time-series resource usage traces into executable workloads that emulate resource contention patterns. Mimesys represents emulated workloads as compositions of resource stressors and employs a diffusion-based generative model to learn the inverse mapping from traces to stressor compositions. We introduce two key ideas: state-aware conditioning that conditions generation on both target traces and prior system state to capture temporal dependencies, and execution-driven alignment that adapts the model to real application patterns using direct execution feedback without requiring ground-truth labels. Our evaluation shows that Mimesys achieves up to 5.5× higher trace similarity and reproduces application performance under contention 2.6× more accurately than baselines.
Enabling SLO-Aware 5G Multi-Access Edge Computing with SMEC
Xiao Zhang, Daehyeok Kim
Proceedings of 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2026
Abstract PDF Code Slides Website
5G multi-access edge computing (MEC) promises to enable latency-critical applications by bringing computational power closer to mobile devices. Our measurements on commercial MEC deployments reveal frequent SLO violations due to high tail latencies. We identify resource contention at the RAN and the edge server as the root cause, compounded by SLO-unaware schedulers. Existing SLO-aware MEC schedulers require RAN–edge coordination, making them impractical for deployment and prone to poor performance due to coordination delays, limited heterogeneous application support, and ignoring edge resource contention. This paper introduces MEC, a practical, SLO-aware resource management framework that facilitates deadline-aware scheduling through fully decoupled operations at the RAN and edge servers. Our key insight is that standard 5G protocols and application behaviors naturally provide information exploitable for SLO-aware management without extensive infrastructure or application changes. Evaluation on our 5G testbed shows that MEC achieves 90–96% SLO satisfaction versus under 6% for existing approaches, while reducing tail latency by up to 122×.
Pendulum: Network-Compute Joint Scheduling for Efficient and Accurate MEC Live Video Analytics
Juheon Yi, Seokgyeong Shin, Minkyung Jeong, Goodsol Lee, Daehyeok Kim, Youngki Lee
Proceedings of IEEE Information Communications Conference (INFOCOM), 2026
Abstract PDF
We present Pendulum, a live video analytics system with a novel network-compute joint scheduling in mobile edge computing (MEC) architecture. In practical scenarios, the resource bottleneck frequently alternates between network (video streaming) and compute (DNN inference) stages due to independent fluctuations in wireless channel and scene content. Prior single-stage scheduling systems suffer from throughput or accuracy fluctuation and resource wastage resulting from over-provisioning. To overcome these limitations, we newly leverage the interplay between video bitrate and DNN complexity to design an end-to-end system. Pendulum is composed of (i) a resource-efficient network-compute demand curve profiler and (ii) a joint resource scheduler. Evaluation with various videos and state-of-the-art DNNs shows that Pendulum achieves up to 0.64 mIoU gain and 1.29× higher throughput than state-of-the-art baselines. Pendulum also achieves near-optimal multi-user resource scheduling performance with minimal search overhead, achieving a 25% cost reduction compared to network-compute decoupled scheduling.
Reforge: Low-Latency Distributed GNN Serving with Selective Embedding Recomputation
Geon-Woo Kim, Donghyun Kim, Jeongyoon Moon, Henry Liu, Tarannum Khan, Anand Iyer, Daehyeok Kim, Aditya Akella
Proceedings of 40th IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2026
Abstract PDF
Graph Neural Networks (GNNs) are widely used for generating expressive node representations in graph datasets. Serving GNNs on large graphs is challenging due to high communication, computation, and memory costs from constructing and executing computation graphs across large neighborhoods. Training-time approximation techniques can reduce these overheads, but they still result in high latency or accuracy loss during serving. We propose Reforge, a system designed to enable low-latency GNN serving for large graphs with minimal accuracy loss through two core ideas. First, Reforge leverages selective recomputation of precomputed embeddings to reuse computation subgraphs, selectively updating only a minority to preserve accuracy. Second, Reforge introduces computation graph parallelism, parallelizing the creation and execution of computation graphs across machines to reduce communication overhead. Evaluations on large graph datasets and GNN models demonstrate that Reforge significantly outperforms current state-of-the-art methods.

2025

Man-Made Heuristics Are Dead. Long Live Code Generators!
Rohit Dwivedula, Divyanshu Saxena, Aditya Akella, Swarat Chaudhuri, Daehyeok Kim
Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets), 2025
Abstract PDF Code
Policy design for systems controllers has conventionally been a manual process, with domain experts tailoring heuristics for the specific instance under which the policy will be deployed. In this paper, we re-imagine policy design via a novel automated search technique fueled by recent advances in generative models, specifically Large Language Model (LLM)-driven code generation. We outline the design and implementation of PolicySmith, a framework that applies LLMs to synthesize instance-optimal heuristics. We apply PolicySmith to two long-standing systems policies - web caching and congestion control, highlighting the opportunities unraveled by this LLM-driven heuristic search. For caching, PolicySmith discovers heuristics that outperform established baselines on standard open-source traces. For congestion control, we show that PolicySmith can generate safe policies that can directly run inside the Linux kernel.
Towards Incremental MTU Upgrade for the Internet
Junghan Yoon, Youngmin Choi, Juyoung Park, Daehyeok Kim, Changhoon Kim, KyoungSoo Park
Proceedings of the 24th ACM Workshop on Hot Topics in Networks (HotNets), 2025
Abstract PDF
This paper proposes a systematic approach to incrementally enabling large MTUs in the Internet. We demonstrate that increasing the MTU size significantly enhances the performance of both middleboxes and end hosts. To bridge MTU mismatches at network borders, we introduce PacketExpress gateway (PXGW), an MTU-translating gateway that dynamically adjusts packet sizes for cross-traffic. PXGW merges and splits TCP payloads on the fly and tunnels UDP packets, ensuring seamless adaptation. We propose F-PMTUD, a new path MTU discovery algorithm that determines the path MTU within a single round-trip without relying on ICMP. Our preliminary evaluation shows that the PXGW prototype achieves 1.46 Tbps of packet forwarding throughput using only 8 CPU cores. After dynamic conversion, 95% of transmitted TCP packets are 9000 B jumbo frames, indicating that most flows were effectively converted into large segments, thereby demonstrating the system’s efficiency and scalability. We also find that large-MTU packets made available via PXGW enhance end-host performance by up to 2.5×.
Large Language Models as Realistic Microservice Trace Generators
Donghyun Kim, Sriram Ravula, Taemin Ha, Alexandros G Dimakis, Daehyeok Kim, Aditya Akella
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2025
Abstract PDF Code arXiv
Workload traces are essential to understand complex computer systems’ behavior and manage processing and memory resources. Since real-world traces are hard to obtain, synthetic trace generation is a promising alternative. This paper proposes a first-of-a-kind approach that relies on training a large language model (LLM) to generate synthetic workload traces, specifically microservice call graphs. To capture complex and arbitrary hierarchical structures and implicit constraints in such traces, we show how to fine-tune LLMs to generate recursively, making call graph generation a sequence of easier steps. To further enforce learning constraints in traces and generate uncommon situations, we argue for applying additional instruction tuning steps to align our model with the desired trace features. Our evaluation results show that we can generate diverse realistic traces under various conditions and outperform existing methods in accuracy and validity. We demonstrate that our synthetically generated traces can effectively replace real data to optimize important microservice management tasks. Additionally, our model adapts to downstream trace-related tasks, such as predicting key trace features and infilling missing data.
ConfigBot: Adaptive Resource Allocation for Robot Applications in Dynamic Environments
Rohit Dwivedula, Sadanand Modak, Aditya Akella, Joydeep Biswas, Daehyeok Kim, Christopher Rossbach
Proceedings of 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2025
Abstract PDF Code
The growing use of service robots in dynamic environments requires flexible management of on-board compute resources to optimize the performance of diverse tasks such as navigation, localization, and perception. Current robot deployments often rely on static OS configurations and system over-provisioning. However, they are suboptimal because they ignore variations in resource usage, leading to system-wide issues like robot instability or inefficient resource utilization. This paper presents ConfigBot, a novel system designed to adaptively reconfigure robot applications to meet a predefined performance specification by leveraging runtime profiling and automated configuration tuning. Through experiments on multiple real robots, each running a different stack with diverse performance requirements, which could be context-dependent, we illustrate ConfigBot’s efficacy in maintaining system stability and optimizing resource allocation. Our findings highlight the promise of automatic system configuration tuning for robot deployments, including adaptation to dynamic changes.
Towards End-to-End Latency Guarantee in MEC Live Video Analytics with App-RAN Mutual Awareness
Juheon Yi, Goodsol Lee, Seokgyeong Shin, Minkyung Jeong, Daehyeok Kim, Youngki Lee
Proceedings of 23rd ACM International Conference on Mobile Systems, Applications, and Services (MobiSys), 2025
Abstract PDF
Mobile live video analytics apps require end-to-end latency guarantee for responsiveness and immersiveness. Achieving consistent low latency is challenging due to complex fluctuations of wireless channel and scene complexity; for example, latency SLO satisfaction rate drops to as low as 26% in commercial 5G MEC platforms. Prior works mostly focus on either app-only (bitrate, DNN adaptation, or GPU allocation) or RAN-only (radio resource allocation) scheduling, with mutual ignorance of the other side resulting in mismatched scheduling decisions and frequent SLO violations. Coordinating the two schedulers is also challenging, as they are run separately by network and cloud operators with disjoint control. We present ARMA, an end-to-end live video analytics system with app-RAN mutual-awareness for high end-to-end latency SLO satisfaction in MEC. We design a mutually-aware decoupled scheduling mechanism on top of RAN Intelligent Controller (RIC) in Open-RAN architecture that fosters cooperative interaction between the two operators’ schedulers while preserving operational proprietaries. We prototype an Open RAN-enabled 5G MEC testbed and evaluate ARMA, showing that it achieves 97% SLO satisfaction rate.
Enabling Portable and High-Performance SmartNIC Programs with Alkali
Jiaxin Lin, Zhiyuan Guo, Mihir Shah, Tao Ji, Yiying Zhang, Daehyeok Kim, Aditya Akella
Proceedings of 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2025
Abstract PDF Code
Emerging SmartNICs, from different vendors or generations, exhibit substantial differences in hardware parallelism and memory interconnects. These variations make porting programs across NICs complex and time-consuming, requiring programmers to refactor code for performance per NIC’s characteristics. We argue that an ideal SmartNIC compilation framework should allow developers to write target-independent programs, with the compiler managing cross-NIC porting and performance optimization. We present Alkali, which achieves this by proposing a new intermediate representation for building flexible compiler infrastructure for multiple NIC targets and developing a new iterative parallelism optimization algorithm that automatically ports and parallelizes the input programs based on the target NIC’s hardware characteristics. Experiments across NIC applications show that Alkali enables developers to write portable, high-performance NIC programs. Our compiler optimization passes can automatically port these programs and make them run efficiently across all targets, achieving performance within 9.8% of hand-tuned expert implementations.

2024

On the Criticality of Integrity Protection in 5G Fronthaul Networks
Jiarong Xing, Sophia Yoo, Xenofon Foukas, Daehyeok Kim, Michael K. Reiter
Proceedings of 33rd USENIX Security Symposium (USENIX Security), 2024
Abstract PDF
The modern 5G fronthaul, connecting base stations to radio units in cellular networks, is designed for microsecond-level performance using Ethernet-based protocols. Due to perceived performance overheads and misconceptions about the risk and impact of attacks, integrity protection is not mandatory in the 5G fronthaul standards. We show how vulnerabilities from the lack of protection can be exploited, making attacks easier and more powerful. We present a novel class of powerful attacks and traditional attacks, both fully launched from software over open packet-based interfaces, to cause degradation or denial-of-service to users over large regions. Our attacks do not require a physical radio presence or signal-based mechanisms, do not crash radios, and are highly severe, impacting multiple cells. We demonstrate the impact in an end-to-end manner on a commercial-grade, multi-cell 5G testbed, showing adversaries can degrade user performance by more than 80%, block select users from the cell permanently, or generate signaling storms of more than 2500 messages per minute with just two compromised cells and four users. We present analysis of countermeasures that meet strict fronthaul performance requirements.

2023

On a Foundation Model for Operating Systems
Divyanshu Saxena, Nihal Sharma, Donghyun Kim, Rohit Dwivedula, Jiayi Chen, Chenxi Yang, Sriram Ravula, Zichao Hu, Aditya Akella, Sebastian Angel, Joydeep Biswas, Swarat Chaudhuri, Isil Dillig, Alex Dimakis, P. Brighten Godfrey, Daehyeok Kim, Chris Rossbach, Gang Wang
Workshop on ML for Systems at NeurIPS (MLSys@NeurIPS), 2023
Abstract PDF
This paper lays down the research agenda for a domain-specific foundation model for operating systems (OSes). Our case revolves around the observations that several OS components, such as CPU, memory, and network subsystems, are interrelated, and OS traces offer the ideal dataset for a foundation model to grasp the intricacies of diverse OS components and their behaviors. We discuss several possibilities, from using foundation models as policy agents to employing them as generators and predictors to assist OS control algorithms. We hope this paper spurs further research into OS foundation models and the next generation of operating systems for the evolving computing landscape.
LogNIC: A High-Level Performance Model for SmartNICs
Zerui Guo, Jiaxin Lin, Yuebin Bai, Daehyeok Kim, Michael Swift, Aditya Akella, Ming Liu
Proceedings of 56th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2023
Abstract PDF
SmartNICs provide in-network computing capabilities in today’s data centers, benefiting a range of applications. Efficient SmartNIC-assisted solutions require programmers to understand SmartNIC architecture, refactor application logic, and relate executions to traffic. A high-level SmartNIC performance model can decouple hardware from its offloaded software, simplifying development. However, prior models cannot dissect the complexity of SmartNIC-offloaded programs, capture the nondeterministic computation and I/O overlap, or diverse traffic profiles. LogNIC systematically analyzes the performance of a SmartNIC-offloaded program using a packet-centric approach, examining packet traversal over heterogeneous domains, interconnects, and memory subsystems. It abstracts device details, represents programs as execution graphs, retains configurable parameters, and generates latency/throughput estimates for a given traffic profile. Extensions handle multi-tenancy, traffic interleaving, and accelerator peculiarity. Our evaluations show LogNIC estimates performance bounds, explores software optimization strategies, and provides guidelines for hardware designs.
Enabling Resilience in Virtualized RANs with Atlas
Jiarong Xing, Junzhi Gong, Xenofon Foukas, Anuj Kalia, Daehyeok Kim, Manikanta Kotaru
Proceedings of 29th ACM International Conference on Mobile Computing and Networking (MobiCom), 2023
Abstract PDF
Virtualized radio access networks (vRANs) run RAN processing on commodity servers, replacing proprietary hardware. The DU component has real-time deadlines and a black-box nature, making resilience features like upgrades and failover challenging. These properties prevent use of techniques like VM migration or state replication. Atlas is the first system providing resilience for the DU by repurposing existing wireless resilience mechanisms—handovers and cell reselection—to provide software resilience. For upgrades, we serve cells from both old and new DUs via the same radio, using handovers to migrate user devices. For failures, we identify deficiencies in RAN protocols that disrupt reselection and eliminate them with a middlebox between the DU and higher layers. Evaluation on a 5G vRAN testbed shows Atlas has minimal connectivity disruption during resilience events, with low overhead.
Resilient Baseband Processing in Virtualized RANs with Slingshot
Nikita Lazarev, Tao Ji, Anuj Kalia, Daehyeok Kim, Ilias Marinos, Francis Y. Yan, Christina Delimitrou, Zhiru Zhang, Aditya Akella
Proceedings of ACM SIGCOMM conference (SIGCOMM), 2023
Abstract PDF
Virtualized radio access networks (vRANs) are replacing specialized RAN hardware with commodity servers. Today’s vRANs lack resilience, as there is no support for failover or upgrades without long interruption. Enabling these features is challenging due to real-time latency and black-box nature. Slingshot is a new system providing transparent resilience for the vRAN physical layer (PHY) using real-time workload migration, fast protocol middleboxes, and real-time failure detection. A key insight is to treat disruptions like regular wireless impairments and use the cellular network’s resilience. Experiments with a 5G vRAN testbed show Slingshot handles PHY failover with no disruption to video conferencing, under 110 ms to TCP, and enables zero-downtime upgrades.
ExoPlane: An Operating System for On-Rack Switch Resource Augmentation
Daehyeok Kim, Vyas Sekar, Srinivasan Seshan
Proceedings of 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023
Abstract PDF Slides
The promise of in-network computing remains unrealized in practice, as serving concurrent stateful applications on a programmable switch is challenging due to limited on-chip resources. We argue for an on-rack switch resource augmentation architecture that augments a switch with other programmable network hardware, such as smart NICs, on the same rack. We design and implement ExoPlane, an operating system supporting concurrent applications via pragmatic resource augmentation. ExoPlane includes a practical runtime operating model and state abstraction to manage application state correctly across devices with minimal performance and resource overheads. Our evaluation shows ExoPlane provides low latency, scalable throughput, and fast failover, achieving these with small overhead and minimal application changes.
Sketchovsky: Enabling Ensembles of Sketches on Programmable Switches
Hun Namkung, Zaoxing Liu, Daehyeok Kim, Vyas Sekar, Peter Steenkiste
Proceedings of 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2023
Abstract PDF Code
Network operators need to run diverse measurement tasks on switches for traffic engineering or anomaly detection. Prior work considers only running a single sketch, lacking efficient support for ensembles. This work presents the design and implementation of Sketchovsky, a cross-sketch optimization and composition framework. We identify five cross-sketch optimization building blocks to reduce critical hardware resources. Efficient heuristics select and apply these blocks for arbitrary ensembles. Sketchovsky automatically generates composed code for the hardware compiler. Evaluation shows Sketchovsky makes ensembles up to 18 sketches feasible, reducing up to 45% of critical hardware resources.

2022

Automatic Generation of Network Function Accelerators Using Component-Based Synthesis
Francisco Pereira, Gonçalo Matos, Hugo Sadok, Daehyeok Kim, Ruben Martins, Justine Sherry, Fernando Ramos, Luis Pedrosa
Proceedings of ACM Symposium on SDN Research (SOSR), 2022
Abstract PDF
Designing networked systems that exploit heterogeneous dataplanes—such as splitting processing between a PISA switch and x86 CPUs—can improve performance and efficiency. Programming multiple hardware targets is challenging because of platform-specific languages. Existing write-once, run-anywhere compilers cannot fully tune NFs for performance across different targets. We explore compiler ideas to exhaustively search for different hardware mappings, tunable for objectives like minimizing memory or maximizing throughput. Our prototype SyNAPSE uses component-based synthesis, supporting x86 and Tofino platforms. Compared to a baseline compiler, SyNAPSE finds thousands of deployment choices, including options reducing controller traffic by an order of magnitude, or halving memory use.
SketchLib: Enabling Efficient Sketch-based Monitoring on Programmable Switches
Hun Namkung, Zaoxing Liu, Daehyeok Kim, Vyas Sekar, Peter Steenkiste
Proceedings of 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2022
Abstract PDF Code
Sketching algorithms provide accurate network measurement with low resources. Programmable switches are attractive for such sketches, but current hardware implementations are inefficient or infeasible. Our contributions: (1) systematic analysis of bottlenecks of hardware sketches, (2) practical, correct-by-construction optimization techniques, (3) SketchLib, an easy-to-use library to help developers efficiently implement sketches in hardware. Evaluation on state-of-the-art sketches shows SketchLib reduces hardware resource footprint up to 96% without impacting fidelity.
SwiSh: Distributed Shared State Abstractions for Programmable Switches
Lior Zeno, Dan R. K. Ports, Jacob Nelson, Daehyeok Kim, Shir Landau Feibish, Idit Keidar, Arik Rinberg, Alon Rashelbach, Igor De-Paula, Mark Silberstein
Proceedings of 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2022
Abstract PDF
SwiSh is a distributed shared state management layer for data-plane P4 programs. It enables running scalable stateful distributed network functions entirely in the data-plane. Several shared variable schemes differ in consistency, performance, and in-switch implementation complexity. We introduce the Strong Delayed-Writes (SDW) protocol, which offers consistent snapshots of shared data-plane objects with strong linearizability, enabling distributed concurrent sketches with precise error bounds. We implement strong, eventual, and SDW consistency protocols in Tofino switches, and compare them in microbenchmarks and network functions (NAT, DDoS detector, and rate limiter). Distributed state management in the data plane is practical and outperforms centralized solutions by up to four orders of magnitude in update throughput and replication latency.

2021

Towards Elastic and Resilient In-Network Computing
Daehyeok Kim
PhD Thesis, Carnegie Mellon University, Computer Science Department (CMU-CS-21-143) (PhD Thesis), 2021
Abstract PDF
Advances in programmable networking hardware (switches and smart NICs) create a new in-network computing paradigm. This paradigm allows functionality typically served by servers or proprietary devices, from middleboxes to distributed system components, to be performed in the network. The demand for higher performance and programmable hardware has driven in-network computing. However, we observe a gap between current capabilities and evolving application demands—specifically, in-network computing lacks resource elasticity and fault resiliency, limiting its potential. Elasticity addresses the lack of support for multi-application, scalable deployment models. Fault resiliency is critical for managing failures, but challenging to enable on networking devices due to low-level abstractions, hardware constraints, heterogeneity, and workload. We argue that high-level abstractions and runtimes that leverage compute and memory resources across device types can enable elasticity and resiliency without hardware changes. Device resource augmentation is the key enabler, as we show via three systems—TEA, ExoPlane, and RedPlane—which support elastic memory, elastic compute/memory, and fault resiliency, respectively. Each system provides an abstraction, programming APIs, and a runtime. Prototypes and evaluation show that developers can enable elasticity and resiliency without worrying about underlying complexity.
A Roadmap for Enabling a Future-Proof In-Network Computing Data Plane Ecosystem
Daehyeok Kim, Nikita Lazarev, Tommy Tracy, Farzana Siddique, Hun Namkung, James C. Hoe, Vyas Sekar, Kevin Skadron, Zhiru Zhang, Srinivasan Seshan
arXiv preprint arXiv:2111.04563 (Tech Report), 2021
Abstract PDF
As in-network computing vision matures, we see two evolutionary trends: richer applications requiring more than programmable ASICs, and the emergence of diverse data plane technologies. Point solutions exist, but ecosystem-level disconnects persist (e.g., application refactoring, missing guidelines, lack of holistic runtimes). We use a simple application–data plane combination to highlight these disconnects. We sketch a high-level roadmap and guidelines for building a "future-proof" data plane ecosystem.
RedPlane: Enabling Fault-Tolerant Stateful In-Switch Applications
Daehyeok Kim, Jacob Nelson, Dan R. K. Ports, Vyas Sekar, Srinivasan Seshan
Proceedings of ACM SIGCOMM conference (SIGCOMM), 2021
Abstract PDF Code Slides
Running datacenter functions (e.g., NATs, load balancers, monitoring) on programmable switches has shown performance benefit, but key fault tolerance is missing. Since networks are no longer stateless, endpoint-only recovery does not suffice. We design and implement RedPlane, a fault-tolerant state store for in-switch applications, providing consistent state access even after switch failure or rerouting. We address challenges for a practical, provably correct replication protocol and implementation in the switch data plane. Evaluation shows RedPlane adds negligible overhead and enables rapid recovery.
Telemetry Retrieval Inaccuracy in Programmable Switches: Analysis and Recommendations
Hun Namkung, Daehyeok Kim, Zaoxing Liu, Vyas Sekar, Peter Steenkiste
Proceedings of ACM Symposium on SDN Research (SOSR), 2021
Abstract PDF
Sketching algorithms are attractive for telemetry on programmable hardware switches due to accuracy guarantees and compact data structures. In practice, their implementations can have large (up to 94x) accuracy drops versus theory. Delays from pulling and resetting data plane state induce degradation. We design solutions to reduce these delays and almost eliminate inaccuracy in existing workflows.

2020

Unleashing In-network Computing on Scientific Workloads
Daehyeok Kim, Ankush Jain, Zaoxing Liu, George Amvrosiadis, Damian Hazen, Bradley Settlemyer, Vyas Sekar
arXiv preprint arXiv:2009.02457 (Tech Report), 2020
Abstract PDF
In-network computing can benefit datacenter applications; we explore how it may help scientific workloads in HPC. Analyzing HPC applications, we find opportunities and challenges for in-network acceleration. The main obstacle is the dynamic and demanding nature of scientific workloads, making open-loop, feedback-lacking in-network techniques ill-suited. We present NSinC, an architecture providing closed-loop runtime feedback for in-network acceleration in scientific workloads. We outline challenges and a preliminary design for such acceleration.
TEA: Enabling State-Intensive Network Functions on Programmable Switches
Daehyeok Kim, Zaoxing Liu, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, Vyas Sekar, Srinivasan Seshan
Proceedings of ACM SIGCOMM conference (SIGCOMM), 2020
Abstract PDF Slides
Programmable switches are attractive for network functions (NFs), but memory limits stymie support for state-intensive NFs (cloud-scale NATs or load balancers with millions of entries). We introduce TEA (Table Extension Architecture), leveraging DRAM on servers to provide a virtual table abstraction for NFs on switches. This allows switch ASICs to access external DRAM directly from the data plane, without CPUs. We address design and implementation challenges and show with a Tofino-based implementation that TEA achieves low and predictable latency (1.8-2.2 μs) for table lookups, and throughput scaling with more servers (138 M lookups/second with 8 servers).
Adapting TCP for Reconfigurable Datacenter Networks
Matthew Mukerjee, Christopher Canel, Weiyang Wang, Daehyeok Kim, Srinivasan Seshan, Alex C. Snoeren
Proceedings of 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2020
Abstract PDF Slides
Reconfigurable datacenter networks (RDCNs) have traditional packet switches augmented with high-bandwidth reconfigurable circuits, assigned to rack pairs per schedule. TCP flows must ramp up sending rates fast when circuits become available. Past TCP variants perform well for ms-scale reconfiguration, but modern RDCNs reconfigure in 20 μs and have short configuration epochs. Existing TCP variants cannot ramp quickly enough. We address this by (1) in-network output queue resizing to prebuffer packets and (2) endpoint adjustment of the congestion window (cwnd) using explicit circuit feedback. Using our RDCN emulator Etalon, we show that combining these increases circuit utilization by 1.91x with only a 1.20x latency increase.

2019

FreeFlow: Software-based Virtual RDMA Networking for Containerized Clouds
Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu, Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong Guo, Vyas Sekar, Srinivasan Seshan
Proceedings of 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2019
Abstract PDF Slides
Large-scale cloud applications are using containerization for efficiency and isolation. Data-intensive apps increasingly adopt RDMA for high networking performance. There is an inevitable collision between these. We present FreeFlow, a software-based RDMA virtualization framework for container clouds, using commodity RDMA NICs. FreeFlow meets cloud requirements (isolation, portability, controllability) and is transparent to applications. It provides performance close to bare-metal RDMA with low CPU overhead. With TensorFlow and Spark, FreeFlow gives nearly the same application performance as bare-metal.

2018

Generic External Memory for Switch Data Planes
Daehyeok Kim, Yibo Zhu, Changhoon Kim, Jeongkeun Lee, Srinivasan Seshan
Proceedings of the 17th ACM Workshop on Hot Topics in Networks (HotNets) (HotNets), 2018
Abstract PDF Slides
Switches are attractive for networking applications (e.g., load balancing, virtual switching) because of their location and high processing rate. Advances in programmable switch ASICs open opportunities for offloading application logic, but limited switch memory remains a challenge. We envision enabling switches to access remote memory from data planes, improving performance for many applications. We design three remote memory primitives using RDMA operations and demonstrate feasibility using a prototype.
HyperLoop: Group-Based NIC-Offloading to Accelerate Replicated Transactions in Multi-Tenant Storage Systems
Daehyeok Kim, Amirsaman Memaripour, Anirudh Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Padhye, Shachar Raindel, Steven Swanson, Vyas Sekar, Srinivasan Seshan
Proceedings of ACM SIGCOMM conference (SIGCOMM), 2018
Abstract PDF Slides
Data center storage systems perform replicated transactions for availability and integrity, but still face high tail latency. The root problem is CPU use in the critical path. HyperLoop removes CPU from the critical path by offloading replicated transactions to RDMA NICs, using non-volatile memory for storage. We develop NIC offloading primitives that provide memory operations and guarantee ACID properties, all without CPU. Popular storage applications are easily optimized using the primitives; evaluation shows HyperLoop can reduce 99th percentile latency by about 800x, with near 0% CPU consumption on replicas.

2017

REboost: Improving Throughput in Wireless Networks using Redundancy Elimination
Kilho Lee, Daehyeok Kim, Insik Shin
IEEE Communications Letters (Comm. Letter), 2017
Abstract PDF
Traffic redundancy elimination (RE) boosts throughput in bandwidth-limited networks. However, RE is less effective in wireless networks because TCP cannot exploit RE unless it is aware of how the underlying RE system operates. REboost enables the TCP layer to be aware and improve throughput. Prototype evaluations show REboost significantly outperforms prior RE systems.

2016

What Mobile Ads Know About Mobile Users
Sooel Son, Daehyeok Kim, Vitaly Shmatikov
Proceedings of 23rd Network and Distributed System Security Symposium (NDSS), 2016
Abstract PDF Slides
We analyze popular Android advertising libraries to see how they protect users from malicious ads. Most separate ad privileges from the host app using dedicated browser instances and the same origin policy. However, malicious ads can infer sensitive information by accessing external storage, used for media-rich ads. Even though ads can’t read other apps’ files, they can learn if a file exists, which can leak sensitive data (e.g., prescription queries). We conclude with recommendations for better ad protection.
FlexDroid: Enforcing In-App Privilege Separation in Android
Jaebaek Seo, Daehyeok Kim, Donghyun Cho, Taesoo Kim, Insik Shin
Proceedings of 23rd Network and Distributed System Security Symposium (NDSS), 2016
Abstract PDF Slides
Mobile apps use third-party libraries for features (ads, analytics, etc.), but Android gives all permissions granted to the app to these libraries. This can lead to privacy violations, especially with native code, Java reflection, and dynamic code loading. FLEXDROID is a new security model and isolation mechanism, providing fine-grained, dynamic access control per library. Developers can control which permissions to grant and how to react to violations (e.g., mock data, kill). FLEXDROID defines a principal for each library and develops an inter-process stack inspection mechanism effective for JNI and dynamic code. Evaluations show it is easy to adopt, effective, and incurs negligible overhead.

2015

SounDroid: Supporting Real-Time Sound Application on Commodity Mobile Devices
Hyosu Kim, SangJeong Lee, Wookhyun Han, Daehyeok Kim, Insik Shin
Proceedings of 36th IEEE Real-Time Systems Symposium (RTSS), 2015
Abstract PDF
Mobile sound applications need real-time audio request management for features like high-rate acoustic sensing. Existing OSes use static configurations, lacking real-time capability for such timing needs. SounDroid is a framework for real-time audio request management, based on requirement analysis and audio playback understanding. It incorporates real-time audio scheduling (EDF-V and AFDS) and dispatch optimization on Android. Experiments show SounDroid improves scheduling performance (up to 40%) and enables deterministic latency.
Optimized Layered Integrated Video Encoding
Sangki Yun, Daehyeok Kim, Xiaofan Lu, Lili Qiu
Proceedings of 34th IEEE International Conference on Computer Communications (INFOCOM), 2015
Abstract PDF
Wireless video traffic is exploding, overloading wireless networks. Multicast can reduce total traffic, but receivers are heterogeneous—especially in channel quality and antenna count. We develop LIVE (Layered Integrated Video Encoding) to guarantee good performance for weaker receivers and better quality for stronger ones. LIVE uses (i) layered coding for heterogeneity, (ii) optimization for per-layer transmission, and (iii) integrated modulation combining soft/hard encoding for reliability. It’s the first approach handling MIMO heterogeneity in wireless video multicast. Extensive Matlab and USRP experiments show LIVE’s effectiveness.

2014

ATRA: Address Translation Redirection Attack against Hardware-based External Monitors
Daehee Jang, Hojoon Lee, Minsu Kim, Daehyeok Kim, Daegyeong Kim, Brent B. Kang
Proceedings of 21st ACM Conference on Computer and Communications Security (CCS), 2014
Abstract PDF
Hardware-based external kernel integrity monitors are thought trustworthy, but are challenged by the Address Translation Redirection Attack (ATRA). ATRA evades monitoring by redirecting accesses to critical kernel objects outside the monitored region. Despite its seriousness, address translation integrity is often assumed and exploitation considered hypothetical. We detail ATRA and its two types—memory-bound and register-bound—plus practical realization and challenges. Benchmarks show ATRA introduces negligible performance penalty, proving attackers could exploit it in practice. We recommend future external monitors address this threat.

2013

Fine-grained Spectrum Adaptation in WiFi Networks
Sangki Yun, Daehyeok Kim, Lili Qiu
Proceedings of 20th ACM International Conference on Mobile Computing and Networking (MobiCom), 2013
Abstract PDF
WiFi traffic is growing rapidly, requiring better spectrum efficiency. We propose adapting spectrum per-frame. Our approach (i) designs fine-grained access so sender/receiver can dynamically change spectrum per frame, (ii) uses fast/accurate spectrum detection via the IEEE 802.11 preamble, (iii) provides efficient spectrum allocation per frame considering diversity and interference. It extends to joint spectrum, schedule, and AP assignment per frame. A SORA prototype and simulation show practical feasibility and near-optimal results, being the first such per-frame adaptation for WiFi.

2012

Optimal Combination of Opportunistic Routing and Network Coding for Minimizing Transmission Time
Daehyeok Kim
MS Thesis, Pohang University of Science and Technology (MS Thesis), 2012
Multi-rate Combination of Opportunistic Routing and Network Coding
Daehyeok Kim, Young-Joo Suh
Proceedings of 9th IEEE Wireless Communications and Networking Conference (WCNC), 2012
Abstract PDF Slides
Wireless communication techniques like opportunistic routing and network coding exploit the wireless medium’s broadcast nature. Prior attempts combine these two, but none considered bit-rate selection for data transmission in multi-rate wireless networks. We study the potential benefits of combining opportunistic routing, network coding, and bit-rate selection via optimization. We develop a model and algorithm to find the optimal forwarding scheme for multi-rate opportunistic routing and network coding. Simulations using MIT Roofnet traces show that properly considering bit-rate selection brings substantial expected transmission time benefits.

2011

Multicast Extension to Proxy Mobile IPv6 for Mobile Multicast Services
Daehyeok Kim, Wan-Seon Lim, Young-Joo Suh
Journal of Computing Science and Engineering (JCSE), 2011
Abstract PDF
Proxy Mobile IPv6 (PMIPv6) is proposed for mobility management in all-IP mobile networks. While unicast handover is extensively studied, PMIPv6 support for multicast services remains less covered. The two main approaches—MAG-based and LMA-based—cause multicast join overhead and non-optimal routing, respectively, possibly incurring high packet loss. We propose a PMIPv6-based multicast protocol to ensure optimal delivery, minimizing delay and packet loss during handover. Simulations show improved delay, service disruption, and loss compared to other solutions.

Daehyeok Kim

Publications

Preprints

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011