Unkey - Permission Interruption – Incident details

All systems operational

Permission Interruption

Resolved
Major outage
Started 2 days agoLasted about 2 hours

Affected

App

Major outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

API

Partial outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

Global Endpoint

Partial outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

AWS us-east-2

Partial outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

AWS eu-central-1

Partial outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

AWS us-east-1

Partial outage from 4:30 PM to 4:42 PM, Operational from 4:42 PM to 6:32 PM

Updates
  • Postmortem
    Postmortem

    Incident Summary: A recent deployment introduced a logical error, leading to a 20-minute outage affecting Unkey’s API and Unkey’s Dashboard.

    The following endpoints were affected during this time:

    /v1/apis.listKeys
    /v1/ratelimits.limit
    /v1/analytics.getVerifications
    /v1/permissions.getRole
    /v1/keys.whoami
    /v2/ratelimit.limit


    Timeline:

    • 11:31 EDT: New deployment initiated.

    • 12:30 EDT: Engineering and customers reported an issue with Ratelimiting and unable to log into the dashboard.

    • 12:40 EDT: Rollback the dashboard

    • 12:42 EDT: Root cause was identified

    • 12:58 EDT: API Patch was rolled out

    • 14:00 EDT: Services restored.

    • 14:32 EDT: Permissions were restored to the DB

    Root Cause: The WHERE clause in the keysPermissions deletion logic used a logical && operator directly instead of the Drizzle ORM's and() helper.

    Original (Faulty) Code:

    .where(

      eq(schema.keysPermissions.keyId, input.keyId) &&

        inArray(schema.keysPermissions.permissionId, permissionIdsToRemove),

    );


    This resulted in permissions across all root keys being deleted instead of the one being updated.

    Resolution: The faulty WHERE clause was corrected to use the and() helper, ensuring proper logical conjunction for the SQL query.

    Corrected Code:

    .where(

      and(

        eq(schema.keysPermissions.keyId, input.keyId),

        inArray(

          schema.keysPermissions.permissionId,

          permissionIdsToRemove

        )

      )

    );


    Impact:

    • Dashboard: Users experienced errors when attempting to log in to the dashboard. 

    • Customers: Some customers experienced "permission denied" errors for their API requests, leading to service disruption.

    • Data Integrity: While permissions were affected, no permissions attached to the root key were lost. All affected permissions were successfully restored.

    Lessons Learned:

    1. ORM Usage: Emphasize strict adherence to ORM-specific methods for query construction, particularly for logical operators.

    2. Testing: Implement more comprehensive integration tests targeting permission management flows, including various combinations of permission additions and removals.

    3. Deployment Process: Enhance pre-deployment sanity checks to catch fundamental logical errors in critical paths.

    4. V2 SDK: We also noticed that our SDK didn’t handle this error to follow fallback procedures and allow users to gracefully deny or allow requests through the ratelimiting.

    Action Items:

    1. Review ORM Best Practices: Conduct a team-wide review of Drizzle ORM best practices, focusing on query construction.

    2. Implement two code owner PRs approvals: Currently we only request a single code owner to approve our PRs, we will now require two going forward. 

    3. Update our runbook: Document and drill enhanced rollback procedures for critical services.

    4. Update our SDKs: Work with our customers to figure out what the best solution for this type of error and allow them to handle it in their own application. 

    Next Steps: We will prioritize the above action items to prevent similar incidents. We apologize for the disruption and appreciate our customers' patience and understanding.


  • Resolved
    Resolved
    This incident has been resolved.
  • Monitoring
    Monitoring

    API Patch was rolled out.

  • Identified
    Identified

    Root cause was identified

  • Investigating
    Investigating
    We are currently investigating this incident.