Incident Summary: A recent deployment introduced a logical error, leading to a 20-minute outage affecting Unkey’s API and Unkey’s Dashboard.
The following endpoints were affected during this time:
/v1/apis.listKeys
/v1/ratelimits.limit
/v1/analytics.getVerifications
/v1/permissions.getRole
/v1/keys.whoami
/v2/ratelimit.limit
Timeline:
11:31 EDT: New deployment initiated.
12:30 EDT: Engineering and customers reported an issue with Ratelimiting and unable to log into the dashboard.
12:40 EDT: Rollback the dashboard
12:42 EDT: Root cause was identified
12:58 EDT: API Patch was rolled out
14:00 EDT: Services restored.
14:32 EDT: Permissions were restored to the DB
Root Cause: The WHERE clause in the keysPermissions deletion logic used a logical && operator directly instead of the Drizzle ORM's and() helper.
Original (Faulty) Code:
.where(
eq(schema.keysPermissions.keyId, input.keyId) &&
inArray(schema.keysPermissions.permissionId, permissionIdsToRemove),
);
This resulted in permissions across all root keys being deleted instead of the one being updated.
Resolution: The faulty WHERE clause was corrected to use the and() helper, ensuring proper logical conjunction for the SQL query.
Corrected Code:
.where(
and(
eq(schema.keysPermissions.keyId, input.keyId),
inArray(
schema.keysPermissions.permissionId,
permissionIdsToRemove
)
)
);
Impact:
Dashboard: Users experienced errors when attempting to log in to the dashboard.
Customers: Some customers experienced "permission denied" errors for their API requests, leading to service disruption.
Data Integrity: While permissions were affected, no permissions attached to the root key were lost. All affected permissions were successfully restored.
Lessons Learned:
ORM Usage: Emphasize strict adherence to ORM-specific methods for query construction, particularly for logical operators.
Testing: Implement more comprehensive integration tests targeting permission management flows, including various combinations of permission additions and removals.
Deployment Process: Enhance pre-deployment sanity checks to catch fundamental logical errors in critical paths.
V2 SDK: We also noticed that our SDK didn’t handle this error to follow fallback procedures and allow users to gracefully deny or allow requests through the ratelimiting.
Action Items:
Review ORM Best Practices: Conduct a team-wide review of Drizzle ORM best practices, focusing on query construction.
Implement two code owner PRs approvals: Currently we only request a single code owner to approve our PRs, we will now require two going forward.
Update our runbook: Document and drill enhanced rollback procedures for critical services.
Update our SDKs: Work with our customers to figure out what the best solution for this type of error and allow them to handle it in their own application.
Next Steps: We will prioritize the above action items to prevent similar incidents. We apologize for the disruption and appreciate our customers' patience and understanding.