Skip to content

Resilience and Transparency Features

This document describes the resilience patterns and transparency features implemented in the MacumbaTravel backend to provide a reliable, user-friendly experience.

Circuit Breaker Pattern

The application implements circuit breakers to handle external service failures gracefully and prevent cascading failures.

Implementation

Circuit breakers are configured for each critical external service:

# Initialize circuit breakers in TravelService
self.flights_circuit_breaker = CircuitBreaker(
    failure_threshold=3,
    recovery_timeout=30,
    name="flights_service"
)

self.maps_circuit_breaker = CircuitBreaker(
    failure_threshold=3,
    recovery_timeout=30,
    name="google_maps_service"
)

self.image_circuit_breaker = CircuitBreaker(
    failure_threshold=5,
    recovery_timeout=60,
    name="image_service"
)

self.accommodation_circuit_breaker = CircuitBreaker(
    failure_threshold=3,
    recovery_timeout=30,
    name="accommodation_service"
)

Circuit Breaker States

  1. CLOSED (Normal): All requests go through
  2. OPEN (Failing): Requests fail fast, use fallback data
  3. HALF_OPEN (Testing): Single request to test service recovery

Service-Specific Configuration

Service Threshold Recovery Time Criticality Fallback Strategy
Flights Service 3 failures 30 seconds Critical AI-estimated pricing
Google Maps 3 failures 30 seconds Important AI transportation options
Image Service 5 failures 60 seconds Low Default/generic images
Accommodations 3 failures 30 seconds Important Budget estimates

Benefits

  • Prevent Cascading Failures: Stop calling failing services immediately
  • Faster Response Times: Fail fast instead of waiting for timeouts
  • Graceful Degradation: Provide fallback data when services unavailable
  • Automatic Recovery: Test service health and recover when possible

Progressive Enhancement

Enrichment Status Tracking

The system tracks the status of each enrichment component for transparency and progressive loading:

{
  "enrichment_components": {
    "transportation": {"status": "success", "error": null},
    "flights": {"status": "success", "error": null},
    "accommodations": {"status": "fallback", "error": "Circuit breaker open"},
    "seasonal_info": {"status": "success", "error": null},
    "payment_info": {"status": "success", "error": null}
  }
}

Status Endpoint

The /recommendations/{id}/status endpoint allows clients to poll enrichment progress:

Response States: - pending: Basic recommendation available, enrichment in progress - partial: Some components failed, others successful - completed: All components successfully enriched

Example Response:

{
  "recommendation_id": "rec_123456",
  "status": "completed",
  "has_basic_data": true,
  "has_enriched_data": true,
  "enrichment_components": {
    "transportation": {"status": "success", "error": null},
    "flights": {"status": "success", "error": null},
    "accommodations": {"status": "partial", "error": null}
  },
  "message": "Recommendation fully enriched and ready"
}

Data Transparency

Provenance Tracking

Every piece of data includes its source information:

{
  "provenance": {
    "destination_data": "ai_generated",
    "transportation": "google_maps_api",
    "flights_data": "amadeus_api",
    "accommodations_data": "booking_api",
    "seasonal_info": "seasonal_service",
    "payment_info": "payment_service"
  }
}

Data Freshness

Timestamps show when each component was last updated:

{
  "data_freshness": {
    "transportation_fetched_at": "2025-01-15T10:30:00Z",
    "flights_fetched_at": "2025-01-15T10:30:15Z",
    "accommodations_fetched_at": "2025-01-15T10:30:30Z",
    "seasonal_data_fetched_at": "2025-01-15T10:30:45Z"
  }
}

Cost Breakdown Transparency

Detailed cost breakdowns help users understand pricing:

{
  "cost_breakdown": {
    "base_cost": 1200,
    "transport_cost": 450,
    "accommodation_estimate": 420,
    "total_per_person": 1800,
    "total_for_travelers": 3600
  }
}

Timeout and Partial Returns

Enrichment Timeout

The enrichment process has a 30-second timeout to prevent hanging requests:

async def enrich_recommendation(self, recommendation, ...):
    try:
        # Set timeout for enrichment process (30 seconds)
        enriched = await asyncio.wait_for(
            self._enrich_recommendation_internal(
                recommendation, departure_city, budget, max_travel_time, travelers, duration
            ),
            timeout=30.0
        )
        return enriched
    except asyncio.TimeoutError:
        # Fallback to minimal recommendation
        return self._create_minimal_recommendation(recommendation, ...)

Fallback Strategies

Each service has defined fallback behavior:

Transportation Fallback

  • Use AI-provided transportation options
  • Fall back to estimated drive times
  • Default to 2-hour drive time if no data available

Flight Fallback

  • Use distance-based pricing estimates
  • Apply standard aviation pricing models
  • Include check-in time estimates

Accommodation Fallback

  • Provide budget-based estimates
  • Use regional pricing averages
  • Include different accommodation tiers

Image Fallback

  • Use cached destination images
  • Fall back to generic travel images
  • Avoid duplicate images across destinations

Caching Headers

ETag Implementation

Static endpoints include ETags for efficient caching:

# Create ETag from content hash
content_hash = hashlib.md5(str(recommendations).encode()).hexdigest()

# Check if client has cached version
if request.headers.get("If-None-Match") == f'"{content_hash}"':
    return Response(status_code=304)  # Not Modified

# Set caching headers
response.headers["ETag"] = f'"{content_hash}"'
response.headers["Cache-Control"] = "public, max-age=3600"

Cache-Control Headers

Different endpoints have appropriate caching policies:

Endpoint Cache Duration Policy
Random Recommendations 1 hour public, max-age=3600
Rate Limit Status 1 minute public, max-age=60
User Profile Private private, max-age=300

Security Enhancements

Token Hashing

Email verification and password reset tokens are securely hashed before storage:

@staticmethod
def hash_token(token: str) -> str:
    """Hash a token for secure storage"""
    return pwd_context.hash(token)

def verify_verification_token(self, plain_token: str) -> bool:
    """Verify a verification token against the stored hash"""
    if not self.verification_token:
        return False
    return pwd_context.verify(plain_token, self.verification_token)

Benefits

  • Secure Storage: Tokens cannot be reversed if database is compromised
  • Constant-Time Verification: Prevents timing attacks
  • Proper Lifecycle: Tokens are invalidated after use

Monitoring and Observability

Circuit Breaker Metrics

Circuit breakers expose metrics for monitoring:

  • Circuit breaker state (CLOSED/OPEN/HALF_OPEN)
  • Failure count per service
  • Recovery attempts
  • Success/failure rates

Component Status Tracking

Enrichment components provide detailed status information:

  • Success/failure rates per component
  • Error categorization (timeout, service down, invalid data)
  • Performance metrics (response times, cache hit rates)

Best Practices

When to Use Circuit Breakers

  1. External Service Calls: Any call to services outside your control
  2. Expensive Operations: Operations that consume significant resources
  3. Non-Critical Features: Features that can gracefully degrade

Fallback Strategy Guidelines

  1. Data Quality: Fallback data should be reasonably accurate
  2. User Communication: Clearly indicate when fallback data is used
  3. Gradual Degradation: Prefer partial functionality over complete failure
  4. Recovery Testing: Regularly test service recovery scenarios

Progressive Enhancement Tips

  1. Status Endpoints: Provide real-time status for long-running operations
  2. Granular Tracking: Track individual components, not just overall status
  3. Client-Friendly: Design status responses for easy frontend consumption
  4. Error Context: Include enough error information for debugging

Impact on User Experience

These resilience patterns significantly improve the user experience:

  1. Faster Perceived Performance: Progressive loading reduces wait times
  2. Higher Reliability: Circuit breakers prevent complete failures
  3. Greater Transparency: Users understand data sources and freshness
  4. Better Trust: Cost breakdowns and provenance build user confidence
  5. Improved Accessibility: Graceful degradation ensures core functionality remains available

The combination of these patterns creates a robust, transparent, and user-friendly travel planning system that maintains functionality even when external services are experiencing issues.