Skip to content

Latest commit

 

History

History
142 lines (141 loc) · 7.42 KB

vm-networking.md

File metadata and controls

142 lines (141 loc) · 7.42 KB
  • Review previous class: Shadow Page Tables

  • Discuss Memory Virtualization with hardware support: Nested Paging

  • Problem: processing network in a virtualized setting

    • When packet arrives, VM that is currently running gets interrupted
    • Hypervisor determines which VM the packet should go to
    • Target VM may be involved in copying packet over, and immediately processing it if required
    • When this is done, the interrupted VM can run again
    • This drastically reduces performance if a lot of network IO is done
    • Key bottleneck: Hypervisor is involved in deciding where each packet should go
  • Lets look at Hardware Support for IO

  • Virtual Machine Device Queues (VMDQ)

    • Hardware determines which VM each packet is meant for
    • Places packet on queue for each VM
    • Hypervisor still involved in copying over data from each queue into each VM
    • VMDQ removes sorting overhead from hypervisor
  • Single Root IO Virtualization (SR-IOV)

    • Introduces "Network Functions" Drivers
    • The hardware resources of the network adaptor (such as memory, registers, and queues) are split up among "network functions"
      • Each network function has dedicated access to a set of hardware resources
        • Each network function can then be bound to a specific VM
      • When a packet arrives for a specific VM, the network function of that VM is responsible for processing that packet
        • It can do so without interference from other VMs
    • What happens when there are more virtual machines than virtual network adaptors?
      • Several VMs have to share the same Virtual Function
    • Difference with VMDQ:
      • Hypervisor no longer involved in copying data over to the VM
      • Hypervisor completely removed from the loop, only involved in setup phase
      • VMDQ has different queue for each VM, SR-IOV has different Virtual Function for each VM
  • Intel IO Acceleration Technology (IOAT)

    • Suite of hardware features
    • QuickData - Hardware DMA to VMs
    • Direct Cache Access - Access caches directly (unclear whether this is different from DDIO)
    • Extended Message Signaled Interrupts (MSI-X): distribute interrupts among CPUs
    • Tune interrupt arrival times based on content of packets (Fast Packet Introspection)
  • Intel DDIO

    • IO directly in and out of the processor last-level cache!
    • Very exciting for performance
    • Last-level cache on Xeon processor is 20 MB, so a lot of space
  • The packet processing game:

    • 10 Gbps delivers 14.8M packets per second (assuming 64 byte packets and 20 byte pre-amble)
    • That gives us 67 ns to process a single packet
    • 67 ns is about 200 cycles on a 3 Ghz processor -- not a lot
    • For comparison:
      • A cache miss is 32 ns
      • L2 cache access 4 ns
      • L3 cache access 8 ns
      • Atomic lock operation : 8 ns
      • Syscall: 40 ns
      • TLB miss: several cache misses
    • Need to batch and process packets
    • What to avoid:
      • A context switch: > 1000 ns
      • A page fault: > 1000 ns
  • Data Plane Development Kit (DPDK)

  • Open-source project with contributions from Intel and others

    • C code (GCC 4.5.X or later)
    • Linux kernel 2.6.34 or later
    • With kernel boot option isolcpus
  • Set of user-space libraries for fast packet processing

    • Can handle 11x the traffic of the Linux kernel!
  • Lightweight, low level, performance driven framework

  • All traffic bypasses the kernel

  • Started with Intel x86, now supports other architectures like IBM Power 8

  • Poll Mode Driver (PMD) per Hardware NIC

    • Support for RX/TX (Receive and Transmit)
    • Mapping PCI memory
    • Mapping user memory onto NIC
    • Managing hardware support
    • Implemented as a user space pthread
    • PMD drivers are non pre-emptive, not thread-safe
    • Threads communicate using librte_ring: lockless queue
  • User space libraries

    • Initialize and use PMD
    • Provide threads (based on pthread)
    • Memory management
      • Used with Huge Pages (2 MB, 1 GB)
    • Hashing, scheduler, pipelining etc
  • All resources initialized at the start

  • Implementation Tricks:

    • Don’t use linked lists
    • Use arrays instead
  • What does a basic forwarding example do?

    • DPDK Init
    • Get all available nics
    • Initialize packet buffers (create mem pool)
    • Initialize ports
  • PMD loop for forwarding:

    • Get burst of packets from first port in port-pair
    • Send packets to second port in port-pair
    • Free unsent packets
    • Do this in a loop
  • RTE Ring

    • Fixed sized, "lockless", queue ring
    • Non Preemptive.
    • Supports multiple/single producer/consumer, and bulk actions.
    • Uses:
      • Single array of pointers.
      • Head/tail pointers for both producer and consumer (total 4 pointers).
    • To enqueue (Just like dequeue):
      • Until successful:
        • Save in local variable the current head_ptr.
        • head_next = head_ptr + num_objects
        • CAS the head_ptr to head_next
      • Insert objects.
      • Until successful:
        • Update tail_ptr = head_next + 1 when tail_ptr == head_ptr
    • Analysis:
      • Light weight.
      • In theory, both loops are costly.
      • In practice, as all threads are cpu bound, the amortized cost is low for the first loop, and very unlikely at the second loop.
  • RTE Mempool

    • Spread objects across different channels of the DRAM controller (different channels can be concurrently accessed)
    • Maintain a per-core cache, send requests in bulk to mempool ring
  • Performance Evaluation:

    • DPDK can forward 22 M pps (L3 forwarding, NIC port to NIC port)
    • DPDK can forward 11 M pps (PHY-OVS-PHY)
    • DPDK can forward 2 M pps (NIC to VM to OVS to VM to NIC)
    • 4X40Gb ports
    • E5-2695 V4 2.1Ghz Processor
    • 16X1GB Huge Pages, 2048X2MB Huge Pages
  • Open vSwitch (OvS)

    • Software switch used to form network of virtualized machines
    • Ethernet switching done in the hypervisor
    • Critical part of software-defined networking (implements OpenFlow)
    • User-space controller, fast path in kernel
    • DPDK is used to accelerate OvS
    • The first packet of a new flow goes to user-space
      • User space decides on a rule to handle this flow
      • The rule is cached in the kernel
      • Further packets of flow just go through the kernel
    • Database used to persistently store rules (ovsdb-server)
      • Communicates with user space component via RPC
  • Reading: