Technical Architecture¶

About This Document

This document introduces NetPulse's technical architecture, design philosophy, and implementation details. If you just want to use NetPulse for network device management, you can skip this section and directly view Quick Start, API Reference, and Driver Selection chapters, which won't affect your normal use.

But if you are a tech enthusiast, interested in distributed system design, high-performance network programming, plugin architecture, and other topics, or want to deeply understand NetPulse's technical implementation logic, then welcome to read this document. We are very open to sharing our design philosophy, technology choices, and architectural thinking, hoping to resonate with you who also love technology, and welcome valuable opinions and suggestions.

NetPulse adopts a distributed architecture design, aiming to provide high-availability, high-performance network device management services. This document introduces the system's technical architecture, design philosophy, core components, and key technical implementations.

Architecture Overview¶

NetPulse adopts a layered architecture design, from client to network devices, including API layer, task queue layer, Worker layer, and device layer:

NetPulse Technical Architecture

Architecture Layers¶

NetPulse technical architecture is divided into four main layers:

flowchart TB
    subgraph Layer1[Client Layer]
        Client1[API Client 1]
        Client2[API Client 2]
        Client3[CLI / SDK / AI Agent]
    end

    subgraph Layer2[API Service Layer]
        Controller1[Controller 1<br/>FastAPI]
        Controller2[Controller 2<br/>FastAPI]
        Controller3[Controller N<br/>FastAPI]
    end

    subgraph Layer3[Task Queue Layer]
        Redis[(Redis<br/>Task Queue & Status Storage)]
    end

    subgraph Layer4[Worker Execution Layer]
        subgraph FIFO[FIFO Worker Pool]
            FW1[FIFO Worker 1]
            FW2[FIFO Worker 2]
            FWN[FIFO Worker N]
        end

        subgraph Node[Node Worker Node]
            NW1[Node Worker 1]
            NW2[Node Worker 2]

            subgraph Pinned1[Pinned Workers]
                PW1[Pinned Worker<br/>Device A]
                PW2[Pinned Worker<br/>Device B]
            end
        end
    end

    subgraph Layer5[Network Device Layer]
        Device1[Network Device 1]
        Device2[Network Device 2]
        DeviceN[Network Device N]
    end

    Layer1 -->|HTTP/HTTPS| Layer2
    Layer2 -->|Task Scheduling| Layer3
    Layer3 -->|Task Distribution| Layer4
    Layer4 -->|SSH/API| Layer5

    style Layer1 fill:#F3E5F5,stroke:#8E24AA,stroke-width:2px
    style Layer2 fill:#9ce8e4,stroke:#05998b,stroke-width:2px
    style Layer3 fill:#FFD3D0,stroke:#D82C20,stroke-width:2px
    style Layer4 fill:#E6F4EA,stroke:#4B8B3B,stroke-width:2px
    style Layer5 fill:#DCDCDC,stroke:#696969,stroke-width:2px

Design Philosophy¶

Task Types and Worker Division¶

NetPulse's design philosophy is based on analysis of network operations task characteristics. Network operations tasks are usually divided into two types:

Task Type	Characteristics	Execution Requirements	Typical Scenarios
Query Tasks	Read-only operations, don't change device state	No need to guarantee order, pursue fast response	Pull device status, check configuration information
Modification Tasks	Write operations, change device state	Must guarantee execution order on single device	Push configuration, apply changes

Based on the different characteristics of these two task types, NetPulse designs three types of Workers:

flowchart TB
    subgraph Tasks[Task Types]
        Query[Query Tasks<br/>Fast response, no order needed]
        Config[Modification Tasks<br/>Guarantee order, tolerate queuing]
    end

    subgraph Workers[Worker Types]
        FIFO[FIFO Worker<br/>Parallel processing, no device binding]
        Pinned[Pinned Worker<br/>Serial processing, device binding]
        Node[Node Worker<br/>Daemon process, manages Pinned]
    end

    subgraph Devices[Network Devices]
        D1[Device 1]
        D2[Device 2]
        DN[Device N]
    end

    Query --> FIFO
    Config --> Pinned
    Query -.->|Can also use| Pinned

    FIFO --> D1
    FIFO --> D2
    FIFO --> DN

    Node -->|Dynamically create| Pinned
    Pinned -->|One-to-one binding| D1
    Pinned -->|One-to-one binding| D2

    style Query fill:#FDF6E3,stroke:#C29F48,stroke-width:2px
    style Config fill:#E6F2FA,stroke:#3D7E9A,stroke-width:2px
    style FIFO fill:#FDF6E3,stroke:#C29F48,stroke-width:2px
    style Pinned fill:#E6F2FA,stroke:#3D7E9A,stroke-width:2px
    style Node fill:#E6F4EA,stroke:#4B8B3B,stroke-width:2px

Worker Characteristics Description¶

FIFO Worker - Characteristics: No device binding, only one can run per node (guaranteed using file lock) - Advantages: Parallel processing, reduce queuing time - Limitations: Task order not strictly guaranteed; single instance limit per node - Applicable: Query tasks, pursue fast response - Scaling: Achieve horizontal scaling by deploying multiple nodes

Pinned Worker - Characteristics: One-to-one binding with device, serial execution - Advantages: Guarantee task order, support connection persistence - Applicable: Modification tasks, scenarios requiring order guarantee

Node Worker - Function: Daemon process, dynamically creates and manages Pinned Worker - Reason: Cannot predict which devices users will operate, Pinned Worker must be created on demand

Connection Persistence

Utilizing the one-to-one characteristic of Pinned Worker with devices, SSH Session persistence is implemented, which helps improve performance (refer to Long Connection Technology).

Core Components¶

RESTful API
- Built on FastAPI
- Handles incoming requests, validates and queues tasks
Message Queue
- Redis-based task queue (based on RQ)
- Used for state synchronization in multi-master multi-slave architecture
- Temporarily stores task status and task execution results
Worker Nodes
- Three types of Workers designed to handle different types of tasks
- FIFO Worker node: Process tasks in order
- Node Worker node: Serves as daemon process to manage Pinned Worker and node status
- Pinned Worker node: Maintains connection with single device, serially executes tasks for that device
Plugin System
- Extensible plugin system includes device drivers, schedulers, template engines, and Webhooks
- Clear interface definitions, convenient for secondary development and integration
- Adopts lazy loading mechanism, loads plugins on demand, doesn't affect system startup performance
- Supports runtime dynamic plugin selection, no need to restart service

Data Flow¶

1. FIFO Worker Task Flow¶

FIFO Worker handles query tasks, adopts "connect on use" mode:

sequenceDiagram
    participant Client as Client
    participant Controller as Controller
    participant Redis as Redis Queue
    participant FIFOWorker as FIFO Worker
    participant Device as Network Device

    Client->>Controller: API Request
    Controller->>Controller: Authentication & Parameter Validation
    Controller->>Redis: Task Enqueue (fifo queue)
    Redis-->>Controller: Return Job ID
    Controller-->>Client: Immediately Return Job ID

    Note over Redis,FIFOWorker: Asynchronous Execution
    Redis->>FIFOWorker: Task Distribution
    FIFOWorker->>Device: Establish Connection
    Device-->>FIFOWorker: Connection Success
    FIFOWorker->>Device: Execute Command
    Device-->>FIFOWorker: Return Result
    FIFOWorker->>Device: Disconnect
    FIFOWorker->>Redis: Store Result

    Client->>Controller: Query Task Status
    Controller->>Redis: Get Result
    Redis-->>Controller: Return Result
    Controller-->>Client: Return Task Result

2. Pinned Worker Task Flow¶

Pinned Worker handles modification tasks, adopts "connection persistence" mode:

sequenceDiagram
    participant Client as Client
    participant Controller as Controller
    participant Manager as Manager
    participant Scheduler as Scheduler
    participant NodeWorker as Node Worker
    participant PinnedWorker as Pinned Worker
    participant Device as Network Device
    participant Redis as Redis Queue

    Client->>Controller: API Request
    Controller->>Manager: Process Task
    Manager->>Scheduler: Select Running Node
    Scheduler-->>Manager: Return Node Information

    alt Pinned Worker Doesn't Exist
        Manager->>NodeWorker: Create Pinned Worker
        NodeWorker->>PinnedWorker: Start New Process
        PinnedWorker->>Device: Establish SSH Connection
        Device-->>PinnedWorker: Connection Success
        Note over PinnedWorker: Connection Persistence
    else Pinned Worker Already Exists
        Note over Manager,PinnedWorker: Reuse Existing Worker
    end

    Manager->>Redis: Task Enqueue (Device Queue)
    Redis-->>Controller: Return Job ID
    Controller-->>Client: Immediately Return Job ID

    Note over Redis,PinnedWorker: Asynchronous Execution
    Redis->>PinnedWorker: Task Distribution
    PinnedWorker->>Device: Execute Command (Reuse Connection)
    Device-->>PinnedWorker: Return Result
    PinnedWorker->>Redis: Store Result
    Note over PinnedWorker: Keep Connection, Wait for Next Task

    Client->>Controller: Query Task Status
    Controller->>Redis: Get Result
    Redis-->>Controller: Return Result
    Controller-->>Client: Return Task Result

Technical Features¶

NetPulse improves system performance, availability, and scalability through the following three core designs:

Feature Dimension	Core Feature	Actual Effect
Performance Optimization	Persistent SSH Connection	In frequent operation scenarios, can reduce connection establishment time, improve response speed
Availability	Distributed Multi-Node Architecture	Supports multi-node deployment, when single point fails, can continue service through other nodes
Scalability	Plugin Architecture	Extend functionality through plugin mechanism, no need to modify core code

Performance Optimization: Persistent Connection Technology¶

Background: Traditional methods need to establish new connection for each operation, connection establishment process usually takes 2-5 seconds.

Implementation: Pinned Worker maintains persistent SSH connection with device, reuses connection to execute tasks.

Effect: - In scenarios where the same device is frequently operated, can avoid overhead of repeated connection establishment - Actual performance improvement depends on network environment, device response speed, and other factors - Helps improve connection success rate (reduce failures that may be caused by frequent connections) - Reduce resource consumption (reduce overhead of connection establishment)

Availability: Distributed Architecture Design¶

Core Capabilities: - Multi-Node Deployment: Supports multi-Controller, multi-Worker node deployment - Fault Handling: When Worker fails, cleanup and reassignment will be performed - State Synchronization: Multi-node state synchronization based on Redis

Deployment Methods: Supports Docker Compose and Kubernetes deployment.

Scalability: Plugin Architecture¶

Design Philosophy: Extend functionality through plugin mechanism, drivers are one type of plugin.

Supported Extensions: - Device Drivers: Can add new device drivers (currently supports Netmiko, NAPALM, PyEAPI, Paramiko, etc.) - Template Engines: Can add new template formats (currently supports Jinja2, TextFSM, TTP, etc.) - Scheduling Algorithms: Can add new scheduling strategies (currently supports greedy, minimum load, etc.) - Notification Mechanisms: Can add new Webhook implementations

Extension Method: Inherit corresponding base class, create class in plugin directory, system will automatically discover and load.

Learn More

For detailed plugin system introduction and development guide, please refer to Plugin System.

Design Decision Explanations¶

Why Three Types of Workers?¶

Question: Why can't we use only one type of Worker to handle all tasks?

Answer: Network operations tasks have two different characteristic requirements: - Query Tasks: Need fast response, can be processed in parallel, don't need strict order - Modification Tasks: Need to guarantee order, can tolerate queuing, but must execute serially

If only one type of Worker is used, either query performance is sacrificed (serial execution), or configuration security is sacrificed (parallel execution may cause configuration conflicts).

Why Only One FIFO Worker Per Node?¶

Reason: FIFO Worker uses RQ's Worker class, which forks child processes. To avoid resource competition and state confusion, file lock is used to ensure only one FIFO Worker instance per node. If more concurrent processing capability is needed, horizontal scaling can be achieved by deploying multiple nodes.

Why Do Pinned Workers Need Dynamic Creation?¶

Reason: Cannot predict which devices users will operate. If all possible Pinned Workers are pre-created, it will waste a lot of resources (most Workers may never be used). Dynamic creation can allocate resources on demand, improve resource utilization.

Why Choose Redis + RQ Instead of Other Message Queues?¶

Reason: - Simple and Efficient: RQ is based on Redis, no need for additional message queue middleware - State Management: Redis can store both task status and results, simplifying architecture - Mature and Stable: RQ is a mature task queue solution in Python ecosystem - Easy to Debug: Can directly view task status through Redis, convenient for troubleshooting