Advanced Tier Exercises

This document consolidates all exercises from the Advanced tier lessons for building production-ready, fault-tolerant robotic workflows.

Exercise 01: Watchdog System Implementation

Objective: Implement a comprehensive watchdog system for a multi-node workflow.

Scenario: You have a navigation workflow with multiple nodes (sensor, planner, controller). Build a watchdog that monitors all nodes and detects failures.

Requirements:

Heartbeat Mechanism: Each node publishes heartbeats
Watchdog Node: Monitors all heartbeats
Timeout Detection: Detects when nodes stop responding
Health Status: Publishes overall system health
Alerts: Logs warnings and errors

Tasks:

Modify existing nodes to publish heartbeats
Create a watchdog node that subscribes to all heartbeats
Implement timeout detection logic
Publish system health status
Test by intentionally killing nodes

Acceptance Criteria:

All nodes publish heartbeats
Watchdog detects node failures within 2 seconds
System health status is accurate
Alerts are logged appropriately
Tested with multiple failure scenarios

Starter Code:

import rclpy
from rclpy.node import Node
from std_msgs.msg import String, Bool
import time

class WatchdogNode(Node):
    def __init__(self):
        super().__init__('watchdog')

        # Dictionary to track last heartbeat time for each node
        self.last_heartbeat = {}
        self.timeout_threshold = 2.0  # seconds

        # TODO: Create subscribers for each node's heartbeat
        # TODO: Create publisher for system health
        # TODO: Create timer to check heartbeats

    def heartbeat_callback(self, msg, node_name):
        """Update last heartbeat time for a node"""
        self.last_heartbeat[node_name] = time.time()

    def check_health(self):
        """Check if all nodes are healthy"""
        current_time = time.time()
        all_healthy = True

        for node_name, last_time in self.last_heartbeat.items():
            if current_time - last_time > self.timeout_threshold:
                self.get_logger().error(f'Node {node_name} timeout!')
                all_healthy = False

        # TODO: Publish health status

def main():
    rclpy.init()
    node = WatchdogNode()
    rclpy.spin(node)
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Exercise 02: Supervisor Node with Recovery

Objective: Build a supervisor node that can automatically recover from failures.

Scenario: Your navigation system occasionally fails. Build a supervisor that detects failures and implements recovery strategies.

Recovery Strategies:

Restart: Restart the failed node
Fallback: Switch to a simpler behavior
Safe Mode: Stop and wait for manual intervention

Requirements:

Monitor system health
Detect different types of failures
Implement appropriate recovery for each failure type
Log all recovery attempts
Escalate to safe mode if recovery fails

Tasks:

Create supervisor node
Implement failure detection
Implement recovery strategies
Add state persistence for recovery
Test with various failure scenarios

Acceptance Criteria:

Supervisor detects failures correctly
Recovery strategies are implemented
System recovers automatically when possible
Safe mode activates when recovery fails
All actions are logged

Exercise 03: Sensor Dropout Handling

Objective: Handle sensor dropouts gracefully without system failure.

Scenario: Your robot's LIDAR occasionally drops out for 1-2 seconds. Implement handling so the robot continues operating safely.

Requirements:

Detect sensor dropouts
Use last known good data temporarily
Reduce speed or stop if dropout is too long
Resume normal operation when sensor recovers
Log all dropout events

Tasks:

Implement sensor health monitoring
Create data buffering for last known good data
Implement degraded operation mode
Add automatic recovery when sensor returns
Test with simulated dropouts

Acceptance Criteria:

Sensor dropouts detected within 500ms
Robot operates safely during short dropouts
Robot stops safely during long dropouts
Normal operation resumes automatically
All events logged

Starter Code:

import rclpy
from rclpy.node import Node
from sensor_msgs.msg import LaserScan
import time

class SensorMonitor(Node):
    def __init__(self):
        super().__init__('sensor_monitor')

        self.last_sensor_time = time.time()
        self.last_good_data = None
        self.dropout_threshold = 0.5  # seconds
        self.in_dropout = False

        # TODO: Create subscriber for sensor data
        # TODO: Create timer to check for dropouts
        # TODO: Create publisher for processed data

    def sensor_callback(self, msg):
        """Process incoming sensor data"""
        self.last_sensor_time = time.time()
        self.last_good_data = msg

        if self.in_dropout:
            self.get_logger().info('Sensor recovered!')
            self.in_dropout = False

        # TODO: Publish data

    def check_dropout(self):
        """Check if sensor has dropped out"""
        current_time = time.time()
        time_since_data = current_time - self.last_sensor_time

        if time_since_data > self.dropout_threshold and not self.in_dropout:
            self.get_logger().warn('Sensor dropout detected!')
            self.in_dropout = True
            # TODO: Implement degraded operation

def main():
    rclpy.init()
    node = SensorMonitor()
    rclpy.spin(node)
    rclpy.shutdown()

if __name__ == '__main__':
    main()

Exercise 04: Continuous Operation System

Objective: Deploy a workflow that can run continuously for 24+ hours.

Requirements:

Resource Management: No memory leaks, CPU usage stable
Log Rotation: Logs don't fill disk
Performance Monitoring: Track key metrics
Graceful Restart: Can restart without data loss
Health Reporting: Regular health reports

Tasks:

Profile existing workflow for resource usage
Fix any memory leaks or resource issues
Implement log rotation
Add performance monitoring
Run 24-hour stress test

Acceptance Criteria:

System runs for 24+ hours without issues
Memory usage is stable
CPU usage is reasonable
Logs are managed properly
Performance metrics are tracked

Capstone Project: Production-Ready Autonomous Robot

Objective: Build a complete, production-ready autonomous robot workflow with full fault tolerance.

System Requirements:

Core Functionality:
- Autonomous navigation
- Obstacle avoidance
- Task execution (delivery, patrol, etc.)
Fault Tolerance:
- Watchdog monitoring
- Supervisor with recovery
- Sensor dropout handling
- Error state management
Production Features:
- Continuous operation capability
- Resource management
- Comprehensive logging
- Performance monitoring
- Health reporting
Testing:
- Unit tests for critical components
- Integration tests for workflows
- Stress tests for reliability
- Failure injection tests

Architecture:

┌─────────────────────────────────────────────────┐
│              Supervisor Node                     │
│  (Monitors health, implements recovery)          │
└─────────────────────────────────────────────────┘
                      ↓
┌─────────────────────────────────────────────────┐
│              Watchdog Node                       │
│  (Monitors heartbeats, detects failures)         │
└─────────────────────────────────────────────────┘
                      ↓
┌──────────────┬──────────────┬──────────────────┐
│ State Machine│ Sensor Nodes │ Navigation Stack │
│    Node      │  (monitored) │   (monitored)    │
└──────────────┴──────────────┴──────────────────┘

Tasks:

Design complete system architecture
Implement all core functionality
Add all fault tolerance features
Implement production features
Write comprehensive tests
Deploy and run 24-hour test
Document everything

Deliverables:

Complete source code
Launch files and configuration
Test suite
Deployment guide
Operations manual
Performance report from 24-hour test
Video demonstration

Acceptance Criteria:

Self-Assessment Checklist

After completing all advanced exercises, verify you can:

Fault Tolerance

Implement watchdog systems
Build supervisor nodes with recovery
Handle sensor dropouts gracefully
Implement multiple recovery strategies

Production Systems

Design for continuous operation
Manage resources effectively
Implement comprehensive logging
Monitor performance metrics

Testing & Validation

Write tests for fault tolerance
Perform stress testing
Inject failures for testing
Validate recovery mechanisms

Deployment

Deploy production-ready systems
Document operations procedures
Create monitoring dashboards
Plan maintenance strategies

Ready for Real-World Deployment?

If you checked all boxes above, you have the skills to build and deploy production-ready robotic workflows!

Next Steps

Apply to Real Projects: Use these patterns in your own robotic systems
Continue Learning: Explore Chapter 5 for advanced AI integration
Contribute: Share your implementations with the community
Iterate: Continuously improve based on real-world experience

Congratulations! You've completed the Advanced tier and can now build production-ready, fault-tolerant robotic workflows.

Advanced Tier Exercises

Exercise 01: Watchdog System Implementation​

Exercise 02: Supervisor Node with Recovery​

Exercise 03: Sensor Dropout Handling​

Exercise 04: Continuous Operation System​

Capstone Project: Production-Ready Autonomous Robot​

Self-Assessment Checklist​

Fault Tolerance​

Production Systems​

Testing & Validation​

Deployment​

Ready for Real-World Deployment?​

Next Steps​

Exercise 01: Watchdog System Implementation

Exercise 02: Supervisor Node with Recovery

Exercise 03: Sensor Dropout Handling

Exercise 04: Continuous Operation System

Capstone Project: Production-Ready Autonomous Robot

Self-Assessment Checklist

Fault Tolerance

Production Systems

Testing & Validation

Deployment

Ready for Real-World Deployment?

Next Steps