Design Rationale

The CARP system is architected as a microservices-based, event-driven platform deployed on Kubernetes for scalability. This design emphasizes safety above all, followed by protection of equipment/inventory, then fault tolerance, and finally performance. By using Apache Kafka as the data backbone, CARP decouples components and ensures real-time communication through events. This allows the system to react to changes such as new tasks, robot status, and human presence instantly and reliably. In the event of any failure or anomaly, the system defaults to safe behavior. For example, communication loss triggers robot halt reflecting the core principle that safety actions always take precedence over productivity.

CARP's architecture is organized into functional modules, each responsible for a subset of the requirements. These modules communicate primarily via Kafka event streams and use well-defined APIs where necessary. Kubernetes provides an orchestration layer to deploy these services with auto-scaling, rolling updates, and high availability. The sections below break down the design by module/function, discussing how each meets the requirements and the design decisions made with alternatives considerations. In all modules, design choices were guided by the priority: personnel safety, asset safety, fault tolerance, and performance.

Task Ingestion and Orchestration Modules

This module handles task intake and mission orchestration, covering HLR-01, HLR-02 and FR-001 to FR-004. It ingests work orders such as pick or putaway tasks from external systems (WMS/ERP) and orchestrates multi-robot missions.

Through a northbound API (FR-070/FR-071), the system receives task requests via REST or message queue. Tasks are deduplicated and prioritized by SLA (FR-001) and then broken into multi-leg missions if needed (FR-003). For example, a mission might require a drone to retrieve an item from a mezzanine and hand it off to an AMR on the floor. The orchestrator will split this into a drone sub-task and an AMR sub-task, with a defined handoff point (FR-002, FR-003). The orchestrator also enforces prerequisites and completion checks for each stage (FR-004).

We opted for a hybrid approach centered on a Task Orchestration Service that uses event-driven communication. The orchestrator is a microservice on Kubernetes that subscribes to incoming task events from the WMS interface and orchestrates mission steps by publishing command events. It maintains mission state which can be stored in a lightweight state store or in memory with periodic checkpoints to Kafka. This design ensures that complex multi-robot workflows are managed explicitly for correctness and safety. For instance, the orchestrator will not dispatch a drone and an AMR into the same physical space without timing and handoff coordination, preventing collisions or idle waits (HLR-02, FR-003). We prioritized this controlled sequence to ensure human safety and asset protection. The orchestrator can hold or adjust missions immediately if any safety event occurs (HLR-04).

By using Kafka, the task intake is decoupled. The system can buffer bursts of task events without losing data. Multiple instances of the Task Orchestration Service can run in a consumer group to process different tasks in parallel, scaling out as needed. Kubernetes Horizontal Pod Autoscaler (HPA) can spin up more instances on high load. Each mission can be given a unique ID and all events related to that mission carry this ID, enabling any service to reconstruct the sequence if needed. This design meets NFR-001, as task assignment decisions in collaboration with the scheduling module happen within a few hundred milliseconds. Kafka adds minimal latency and the orchestrator's logic is optimized for quick decisions. Overall, this module ensures that tasks are efficiently taken in and broken down, while always ready to pause or cancel missions if safety dictates.

Robot Allocation and Scheduling Modules

This module decides which robot(s) should carry out a given task, fulfilling HLR-03 and FR-010 to FR-013. It ensures that each task is assigned to an optimal robot or team of robots considering capabilities, location, and battery, and can reassign on the fly if conditions change.

When a new task or mission step is created by the orchestrator module, the Scheduling Service evaluates the available drones and AMRs. It considers capability constraints (FR-010), current location and proximity to the task, State of Charge (SoC) and battery levels (FR-011), and even recent duty cycles to avoid overuse. The scheduler uses a utility scoring function (per FR-011) that also penalizes unnecessary travel (FR-012) to maximize efficiency. For joint missions (HLR-02), it will allocate a pair or team, choose a drone and an AMR that together minimize total fulfillment time. The scheduler must also handle dynamic reassignments (FR-013); if a robot fails or a higher-priority task preempts, it can reallocate a task mid-mission safely.

The scheduling problem here is essentially a multi-robot task allocation (MRTA). We implemented a Centralized Fleet Scheduler microservice. The orchestrator publishes a “task pending assignment” event, and the scheduler service consumes it. The scheduler maintains an up-to-date view of each robot's status, which it gets from the Digital Twin and Robot Interface modules via events. Using this data, it computes the best assignment. For example, picking the nearest available drone that has enough battery and appropriate payload capacity for a pick task. If the task is multi-leg, it pairs a drone and AMR considering their combined route (drone fetch then meet AMR at a handoff point). The decision is then emitted as an “assignment” event, which the orchestrator and relevant execution modules receive.

To meet fault tolerance and speed requirements, this service is stateless. It derives all needed info from incoming events and the shared world state. We run it with at least two instances for redundancy; if one fails mid-computation, the task event is simply reprocessed by another instance leveraging Kafka consumer group rebalance. The scheduling algorithm is optimized for <500ms decisions (NFR-001). We use heuristics and only do heavy computations like path estimation if needed. Because this module directly affects efficiency and asset usage, we ensure it never sacrifices safety. For example, it will not assign a nearly depleted drone just because it's closest due to a battery safety rule, and it will avoid assignments that send too many robots into one area at once; thus preventing congestion and reducing risk of collisions. If a robot reports an incident or failure mid-mission (FR-013), the scheduler can immediately produce a re-allocation event to dispatch a second drone to take over a delivery, while the orchestrator coordinates a safe handover. This module thus satisfies HLR-03 and contributes to HLR-06 optimizing global flow, while obeying safety and fault tolerance priorities.

Path Planning and Deconfliction Modules

This module handles navigation planning for both aerial and ground robots, ensuring they don't collide and efficiently share resources (HLR-01, HLR-05). It covers FR-020 to FR-024 and FR-050 to FR-052. Essentially, once tasks are assigned to robots, this component charts the time-space paths for each robot and manages intersections to avoid conflict.

The Path Planning & Traffic Control Service computes feasible routes for drones in 3D airspace and AMRs on 2D floor paths that respect all constraints. Path planning and traffic management is a complex, continuous problem. We implemented a Central Traffic Control Service using a global planning approach. The Digital Twin acts similarly to a blackboard. When the Scheduling module assigns a task to a robot, it triggers a route planning event. The Path Planning service then computes an optimal route for that robot from its current position to the goal, considering current and predicted positions of other robots and reserved resources. We use algorithms for pathfinding in static space combined with a time dimension, ensuring no two robots are scheduled in the same space at the same time.

After computing a route, the planner publishes the route or waypoints to the robot via Kafka. It simultaneously marks the path's space-time segments as reserved in the Digital Twin's model, so any subsequent planning for other robots will avoid those segments (FR-022). For shared resources like elevators, the planner will coordinate with the Infrastructure Module to request an elevator at a certain time. The infrastructure service responds via event when the elevator is ready, at which point the planner finalizes that segment of the route and signals the robot to proceed (FR-050, FR-051). This handshake ensures tight coupling for critical handoffs without blocking the whole system. The use of events allows other planning to continue in parallel.

To handle dynamic changes, the planning service subscribes to events like “obstacle detected in aisle 3” or “human entered zone A”. On such an event, it will mark that area as temporarily closed (FR-021) and recompute any affected active routes (NFR-002). New routes or hold commands are then sent out to robots within ~200-300ms, fulfilling the NFR-002 replan requirement. We also leverage the on-robot local avoidance capabilities (FR-024) each robot has basic collision sensors native to their platform to stop or swerve around sudden obstacles. Our architecture supports this by not micromanaging every millisecond of motion. If a robot performs a minor avoidance maneuver, it will report it as an event, and the planner can adjust the global plan if needed. This layered safety ensures that even if the global planner's last update was a fraction of a second ago, the robot itself can react in between (NFR-002).

In terms of performance and scaling, we deploy the Path Planning service as a set of identical instances on Kubernetes. The Kafka event queue allows multiple planners to work concurrently on different route requests. We partition planning tasks by zone or by robot fleet to avoid two instances solving conflicting plans simultaneously. Because this function is computationally heavier, we allocate more resources to it and allow Kubernetes to scale it out under load. The 300 robots per site (NFR-004) are handled by efficient planning algorithms and offloading less critical computations to background processes. The design prioritizes safety. If ever in conflict, the planner will choose a slower/longer route or even hold a robot idle rather than risk a close call. Efficiency is important (HLR-06), but never at the cost of violating separation or rushing a resource handoff. This careful, centralized coordination meets HLR-05 and HLR-01 by effectively acting as the “air traffic control” of the warehouse.

Human Safety and Zone Management Modules

This module monitors human presence and manual interventions, enforcing human-in-the-loop safety (HLR-04) and covering FR-030 to FR-034. It ensures robots slow down, stop, or reroute when humans are in proximity, and allows authorized humans to pause or override robot operations with proper logging. Safety Management Service has two primary roles automated zone enforcement and manual override and intervention.

The warehouse is instrumented with sensors that feed into CARP whenever a human or any other unexpected entity enters a protected area (FR-031). These areas can be defined as caution zones. In caution zones, robots may continue but at reduced speed. In stop zones no robot motion allowed per FR-030. The service consumes these sensor events and correlates them with map zones. Within milliseconds of detecting a human in a danger zone, CARP must respond (FR-032). The Safety service will publish commands to all robots in that zone or approaching it to either slow down, pause, or take an alternate route. Drones may be commanded to hold position or ascend to a safe hover altitude; AMRs may decelerate to a crawl or stop completely. These commands are high priority and bypass normal scheduling queues if needed.

Manual override and intervention occurs through the UI or physical controls; supervisors can issue overrides (FR-033). They might invoke an e-stop on a specific robot, freeze all movements in a zone for an emergency, or resume operations when clear. Each such action is authenticated (FR-090, NFR-021) and logged with timestamp and reason (FR-034, NFR-091). The Safety module ensures that any manual stop command has immediate effect system-wide.

Safety enforcement is essentially a real-time control loop overlay on the whole system. CARP's safety module is implemented as a combination of real-time event processing and redundant fail-safes. The Safety Management Service runs as one or more instances on Kubernetes, each capable of handling a subset of sensors or zones for scalability. It uses a rules engine or state machine for zone policies (FR-030). For example, if a human is detected in a caution zone, the rule might be “allow AMRs to continue but at 50% speed and no drones below 5m altitude in that zone.” These rules can be configured per site or zone. The service subscribes to all human-detection events (FR-031) and zone status changes, and it references the Digital Twin to map those coordinates to zone IDs. Then it issues speed change or stop commands to the affected robots through the Robot Interface. The latency from detection to command is kept very low (HLR-04). Kafka and our network assumptions support this, but we also deploy this service close to the edge to minimize round-trip time.

For manual interventions (FR-033), the system provides a UI button and also integrates physical E-stop buttons via the Infrastructure module if required. For example, when a supervisor clicks “Freeze Zone A,” the Safety service broadcasts a ZoneA_Freeze event. Robot controllers subscribed to zone commands will execute an immediate stop, and the orchestrator will mark missions in that zone as on hold. All such interventions create log events which the safety service tags with user ID, reason code, and affected robots (FR-034). These logs go to an immutable audit log store (NFR-091) for later review.

One critical design decision here is prioritization: safety commands always preempt other traffic. For instance, if a robot is in the middle of a task, a stop command will override any mission instruction. The architecture ensures this by using dedicated high-priority topics or channels for safety and robot firmware giving them priority. This module also works closely with Path Planning: when a zone is frozen, the planner is alerted via the Digital Twin or direct message to not route any robot through that zone and to recompute paths. When the zone is clear and operations resume, the safety service will release the hold with proper logging of who authorized it.

In line with our safety-first philosophy, redundancy is built in. If the Safety service or network is unresponsive, robots will trigger their own fail-safes (NFR-011). Additionally, critical sensors like badge-based proximity might be processed by both the cloud service and a local failsafe device. The overall approach satisfies HLR-04 by dynamically adjusting robot behavior when humans are present, and FR-032 by reacting within milliseconds. It also aligns with industry best practices that any fault or uncertainty leads to a safe state. By structuring it as an event-driven, central policy service with local robot enforcement, we ensure maximum flexibility(to define zones and responses without compromising on reaction time.

Robot Interface and Execution Control Modules

This module comprises the southbound interfaces to the robots. CARP communicates with disparate drone and AMR platforms to send commands and receive telemetry. It addresses (FR-071) and parts of safe handovers during reassignments(FR-013) and extensibility to new robot vendors (NFR-031). It also plays a role in ensuring reliable execution and reporting for tasks.

Each robot may have its own vendor-specific API. The Robot Interface module provides an adapter service for each robot type or vendor, translating CARP's generic commands into robot-specific instructions and vice versa for telemetry.

The architecture pattern here is akin to a Bridge/Adapter pattern. We implement a Robot Adapter Layer consisting of multiple containerized services on Kubernetes. Each adapter service subscribes to relevant command topics filtered by robot ID or fleet and translates those commands to the robot. For instance, a movement command from the path planner is picked up by the AMR's adapter, which then instructs the AMR's controller. The adapter also listens to robot telemetry the adapter might subscribe to its topics and then publish key data into Kafka for the rest of CARP (FR-062, FR-041).

We emphasize reliability and safety in this layer. All commands to robots are acknowledged. If a robot doesn't confirm an action, the adapter immediately informs the orchestrator and safety service. Adapters also implement time-outs and retries. If an AMR doesn't reach a waypoint in expected time, the adapter can query its status or trigger a fail-safe. Because this is the final step between CARP and physical movement, adapters are designed to be robust.

To support extensibility (NFR-031), we provide a clear Adapter SDK or template. New robot types should implement standardized functions. Register via handshake with CARP with its capabilities, ExecuteCommand, and ReportStatus. With this structure, adding a new vendor's robot might be done in a few weeks by following the template, rather than redesigning the system.

Security is also critical here as all communications to robots are authenticated and encrypted (NFR-020, NFR-021). We use mTLS certificate auth for the adapters to talk to robots, preventing any rogue commands. Each robot has a unique identity and credentials (NFR-022), rotated regularly (FR-092). This ensures that even at the device control layer, there's no unauthorized access.

In summary, the Robot Interface module acts as the execution arm of CARP. It carries out the plans and safety actions on the actual machines. By decoupling this via an event-driven adapter layer, we keep the core system abstracted from hardware specifics, improve fault tolerance, and facilitates scaling. You can add dozens of robots, and as long as the adapters and Kafka can handle the message volume, no re-architecture is needed. It directly supports HLR-08 by normalizing different robots under one framework.

Infrastructure Integration Module

This module integrates facility infrastructure systems such as conveyors, elevators, and automatic doors into the CARP workflow (HLR-05). It addresses FR-050 to FR-052 and ensures robots can seamlessly use shared facility resources during their missions. The integration points are very site-specific and often involve synchronous operations. For example, an elevator must complete move before robot continues combined with event notifications. The Infrastructure Integration Service runs as a set of microservices or a unified service with plugins for each system.

For instance, an ElevatorAdapter, a ConveyorAdapter, each specialized but following a common pattern. They subscribe to events like “Request_Elevator(robot_id, floor1, floor2)” and perform the necessary actions. For an elevator request, the service might interact with the elevator's API: it sends a command to bring an elevator car to floor1 and reserve it. Once the elevator signals it's in position the integration could get an event or poll an API, the service publishes “ElevatorReady(robot_id)” event. The robot's path planner or orchestrator then knows it can move the robot into the elevator. The Infrastructure service may temporarily take over controlling that robot's movement during the elevator ride for safety, ensuring the robot is centered and sending the command for the elevator to go to floor2, then signaling when trip is complete (FR-050). Similarly, for doors, when a robot approaches a door, the planner might have placed a “DoorOpen” action in the route. The service receives that and triggers the door's actuator. If the door doesn't open in time, it alerts the Safety module to stop the robot.

This module heavily uses asynchronous events to notify completion or problems, rather than blocking calls. This prevents one long operation from halting others and improves fault tolerance. If an elevator is out of service, the service can inform the scheduler to avoid multi-floor tasks or redirect to a backup elevator. Because these operations involve machinery that could affect safety, the Infrastructure module also upholds safety priority. It will abort an operation if needed. For instance, if a person presses an emergency stop in an elevator, the service will catch that signal and ensure no robot moves into that elevator. All interactions are fail-safe, if unable to confirm a resource is secured, CARP will not send a robot into it.

From a deployment perspective, these integration services will run on-premises close to the devices. Kubernetes allows deploying them as needed per site as some sites might not have certain devices, so those adapters wouldn't be deployed.

In summary, this module ensures CARP can seamlessly incorporate facility equipment into robot missions, preserving throughput (HLR-05) by treating these devices as part of the coordinated plan. It abstracts complexity of various protocols into a uniform, event-driven interface so that other modules can simply request actions and not worry about the details. This fulfills the requirement of integrating with existing infrastructure without requiring changes to it, and allows CARP to be deployed in brownfield environments with minimal disruption.

Digital Twin and Localization Data Management Modules

This module maintains the live unified map and state of the warehouse (HLR-01) effectively the Digital Twin of the environment and handles localization data. It corresponds to FR-060 to FR-063 and supports all modules that need environment or location info. The Digital Twin Service acts as the single source of truth. Static Map Data: Layout of aisles, shelves, walls, no-fly zones (permanent) and other physical structures. This includes 3D coordinates for drone flight space and 2D map for AMRs, plus vertical connectors (elevators, ramps). The digital twin is essentially a data repository with real-time pub-sub capabilities. We established a Digital Twin Service as a core component deployed redundantly for HA. This service maintains the current graph/grid of the warehouse and all dynamic elements. To maintain performance, smaller updates, like one robot's position, are handled via fast in-memory updates and not every minor move is broadcast. Instead, the Twin broadcasts aggregate state at a lower frequency, while critical changes like a new obstacle or zone closure are immediate events. This balances load with the need for reactivity.

For localization (FR-062), we assume each robot does sensor fusion on-board or via its adapter. The result is a pose and covariance that the Twin receives. If a robot's pose becomes uncertain or lost, the Twin could flag it, possibly triggering that robot to perform a re-localization routine. The Twin might also integrate global references to correct positions. This ensures NFR-014 by combining data sources.

Map configuration changes (FR-063) are handled by loading new static data. We maintain previous versions so we can roll back if a new map has issues. All zone definitions and safety policies (NFR-042) are stored in a version-controlled manner, meaning any change is logged and traceable.

The Digital Twin is effectively the memory of the system, enabling all other modules to operate from a consistent, updated picture. We chose to centralize it to maximize consistency and facilitate global optimization. The risk of a central data store is a failure or bottleneck, so we mitigate that by running it as a replicated service, an using in-memory caching in other services to reduce read load.

This module underpins HLR-01 by providing the live map with sub-second updates. It also aids graceful degradation (HLR-07). For example, if connectivity is lost, the Twin might instruct robots to revert to local control zones defined by the static map until reconnection as part of a strategy with Safety/Robot adapters. In summary, the Digital Twin and data management design ensure that every decision the system makes is based on the most current, accurate information available, which is crucial for both safety and efficiency.

Monitoring, Analytics, and Maintenance Module

This module deals with system monitoring, performance analytics, and predictive maintenance, covering FR-040 to FR-042 and FR-080 to FR-082. It gathers data from all components to produce metrics, identifies issues, and supports maintenance workflows.

A service that subscribes to raw telemetry such as battery levels, sensor readings and events such as mission events, or alerts and filters/forwards them to appropriate sinks. For instance, the service pushes robot health metrics into a time-series DB, push event logs into an Elasticsearch cluster, and call an alerting service for certain triggers.

Small jobs or services that compute derived metrics. For example, a Throughput Calculator service reads task completion events and computes orders/hour and other productivity KPIs continuously. A Safety Analyzer process listens for proximity alerts or sudden stops to count near-misses (FR-081).

A component that uses telemetry trends to predict maintenance (FR-042). For example, it used simple rules to forecast failures. When it identifies an issue, it can automatically generate a maintenance task or ticket in whatever system the facility uses, and notify operators.

For energy management (FR-040), the module gets battery levels from robots and knows charging station statuses. It uses a rule-based scheduler that works with the main Scheduler. If many are low, stagger the charging to avoid too many out of service at once (FR-040). It can mark robots as temporarily unavailable for tasks when charging, and update when they're ready.

We integrate a notification service that can be configured with rules (FR-072). This uses event triggers from our metrics.

We maintain logs of all missions and events (FR-073). The Digital Twin service can load a timeline of events to replay what happened, or to test new routing logic in a sandbox mode. We provide a UI or API to run these simulations. This was considered in design by ensuring the architecture can run in a “simulated mode” where, for example, robot commands don't go to real robots but to a simulator service. The modular design makes it possible to plug in simulated robots easily for this purpose.

All these monitoring components are also containerized on Kubernetes. They are mostly separated from the critical path, so we can scale them or even restart them without affecting the core operations. Data storage is persisted to meet retention requirements. Immutable logging (NFR-091) is achieved by writing append-only logs to a secure storage where they cannot be tampered with.

By implementing this module, we satisfy the need for continuous improvement and transparency. The warehouse managers can see how the system is performing, and the system itself can proactively handle maintenance and energy, contributing to overall reliability. Fault tolerance is improved because early detection of issues can prevent bigger failures. This also closes the loop for safety and efficiency. CARP's design allows such insights to feed back into reconfigurable policies.

Function 9: User Interface and Security Governance

The Control Room UI, configuration tools and the security architecture models ensure only authorized access and data protection. It touches HLR-08 and FR-070 to FR-073 for interfaces, and FR-090 to FR-092 and NFR-020 to NFR-022 for security.

CARP provides a Web-based dashboard (FR-070) where different user roles can monitor and control the system.

Control Room Supervisors have a live map view of the warehouse with real-time positions of all robots updated via the Digital Twin (HLR-01). They see alerts and can drill down into incidents or performance metrics (FR-080). They can configure zones or SLA parameters through this UI.

Floor Associates have a simplified interface to interact with robots. They can call a robot for assistance, acknowledge an alert, or request a temporary hold in their area if they need to work (HLR-04).

Maintenance/IT users can see device status, connectivity, logs, and perform actions like taking a robot out of service, and updating software. Though actual firmware updates are staged by CARP but executed with vendor tools as per out-of-scope.

Robotics Engineers may access a sandbox mode or developer APIs to test new robot integrations or algorithms (HLR-08).

The UI is backed by an API Gateway (FR-071) that also allows external systems or scripts to fetch data or issue commands with proper auth. For instance, the WMS might query CARP for status of tasks, or a corporate dashboard might pull metrics.

We use a microservices-based backend for the UI using a combination of the data from other modules exposed via either REST and or by subscribing to Kafka. For example, the live map view is powered by a WebSocket that the Digital Twin service feeds positions into.

All users and services are authenticated (FR-090, NFR-021). Users likely integrate with the customer's identity provider. We support SSO/MFA for the UI. Within CARP, services use mutual TLS and token-based auth for internal calls (NFR-020). Each robot and service has its own credentials, rotated regularly (FR-092). Role-based access control ensures, for example, that a Floor Associate cannot change system configurations or view sensitive logs.

Kafka channels are encrypted (TLS 1.3) and access-controlled (NFR-020). The Kubernetes cluster network is segmented such that external interfaces (APIs) are separated from internal communication. We assume the warehouse network is private, but we still treat every connection as untrusted to main zero-trust stance.

Any PII (perhaps if video feeds or associate info is processed) is either not stored or is anonymized (NFR-022). For instance, the system might detect humans via camera but not record their images. We store only operational telemetry needed for running the system.

Every command or configuration change is logged with who performed it and when (NFR-091). These logs are tamper-proof, stored in append-only storage with cryptographic integrity checks. This ensures accountability, especially for manual overrides or policy changes.

Previous: High-Level Design Next: Conclusion