Skip to content

Conflict Between Short-Lived Process Capture (#862) and Multi-Container Environments (#863) #872

@aman1105-sa

Description

@aman1105-sa

Hey @cfc4n I'm experiencing a fundamental conflict when trying to capture HTTPS traffic from short-lived processes in a multi-container Kubernetes environment. The recommendations from issue #862 (use --pid=0) and #863 (use --pid=SPECIFIC_PID with container paths) are mutually exclusive.
Background
Following the guidance from:

Issue #862: Use --pid=0 to capture short-lived processes that spawn and exit quickly
Issue #863: Use --pid=SPECIFIC_PID with /proc/PID/root/... paths for multi-container environments

However, these approaches conflict in Kubernetes environments where:

Processes are short-lived (<1 second lifespan, e.g., curl commands)
Multiple containers run on the same node with different filesystem namespaces
Process detection and eCapture startup take ~800-1000ms

Current Implementation
Based on advice from #863, I'm using per-PID eCapture instances:

// Detection code
func (o *AutoOrchestrator) startCaptureForLibrary(lib *LibraryInfo) error {
    // Build command with specific PID
    cmd := exec.Command("/ecapture", "tls",
        fmt.Sprintf("--libssl=/proc/%d/root/usr/lib/x86_64-linux-gnu/libssl.so.1.1", lib.PID),
        fmt.Sprintf("--pid=%d", lib.PID),  // Specific PID, not --pid=0
        "-m", "text",
        "--hex=false",
        fmt.Sprintf("--ecaptureq=ws://127.0.0.1:%d/", wsPort))
    
    cmd.Start()
    // ... WebSocket connection logic
}

Detection loop: Scans /proc every 30 seconds to detect new processes with SSL libraries

What's Happening - The Race Condition

Timeline of Events:

T+0ms:    Curl process spawns (PID 275721)
T+50ms:   SSL library loaded
T+200ms:  HTTPS request made
T+500ms:  Curl exits ✅ (request complete)
T+30000ms: Scanner detects PID 275721 in /proc/275721/maps
T+30200ms: eCapture command launched
T+30900ms: eBPF hooks attached
T+31000ms: WebSocket connection established
T+31001ms: ❌ Process is already dead - nothing to capture

Actual Logs:

{"level":"info","time":"2025-11-25T11:42:53Z","message":"🔧 Starting PER-CONTAINER eCapture for PID=275721"}
{"level":"info","time":"2025-11-25T11:42:53Z","message":"✅ eCapture started for Container PID=275721"}
{"level":"info","time":"2025-11-25T11:42:54Z","message":"✅ WebSocket connected for openssl:...:275721"}
{"level":"debug","time":"2025-11-25T11:42:54Z","message":"📋 Process log: {\"target PID\":275721}"}
{"level":"error","time":"2025-11-25T11:42:55Z","message":"❌ WebSocket read error: EOF"}

Result: eCapture successfully attaches to PID 275721, but the process exited 30 seconds ago. The WebSocket immediately receives EOF because there's no process to monitor.
The Fundamental Conflict
Requirement--pid=0--pid=SPECIFIC_PIDCapture short-lived processes✅ Works❌ Fails (process dies before attach)Multi-container support❌ Fails (namespace isolation)✅ WorksCapture ongoing processes✅ Works✅ Works

Test Environment

Kubernetes: 3-node cluster (EKS)
Kernel: 6.8.0-1031-azure (eBPF supported)
eCapture: v1.4.3
Test workload: Debian container running:
while true; do
curl -H "Authorization: Bearer token" https://httpbin.org/get
sleep 10
done
Process lifespan: ~500-800ms per curl execution
Scanner interval: 30 seconds (to avoid overloading /proc)

Attempted Solutions

  1. ✅ Per-PID unique ports (fixed port collision)
    Changed from:
    sessionKey := fmt.Sprintf("%s:%s", lib.LibraryType, lib.LibraryPath)
    To:
    sessionKey := fmt.Sprintf("%s:%s:%d", lib.LibraryType, lib.LibraryPath, lib.PID)
    Result: Port collisions eliminated, but short-lived processes still missed.
  2. ❌ Faster scanning (tried 5-second intervals)
    Result: High CPU usage, still couldn't catch processes that live <1 second.
  3. ❌ Pre-launching eCapture with --pid=0
    Problem: Can't use container-specific paths like /proc/275721/root/usr/lib/libssl.so.1.1 with --pid=0 because different containers need different library paths.
    Questions

Is it possible to capture short-lived processes (<1s) in multi-container environments?
Can eCapture use --pid=0 with namespace-aware library paths? For example:
/ecapture tls --libssl=/proc/*/root/usr/lib/libssl.so.1.1 --pid=0
Does eBPF support "pre-hooking"? Can we attach hooks to a library path before any process loads it, so hooks are already in place when processes spawn?
Alternative approach? Should I:

Accept that short-lived processes can't be captured in multi-container setups?
Use --pid=0 per container namespace (how?)?
Use a different capture strategy entirely?

Metadata

Metadata

Assignees

Labels

questionFurther information is requested🐞 bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions