Quotas for the bedrock-runtime endpoint - Amazon Bedrock
Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.

Quotas for the bedrock-runtime endpoint

The bedrock-runtime.region.amazonaws.com endpoint is the primary inference endpoint for Amazon Bedrock. Inference traffic to this endpoint is governed by per-model token-based quotas. You can view these quotas in the Service Quotas console by selecting Amazon Bedrock as the service, or in the Amazon Bedrock service quotas table in the AWS General Reference.

Quota types

Inference on the bedrock-runtime endpoint is governed by the following per-model quotas:

bedrock-runtime per-model quotas
Quota Scope Description
Cross-Region InvokeModel tokens per minute for ${model} Per model, per Region The maximum number of tokens per minute (input + output, combined) that your account can use for the model when invoked through a cross-Region inference profile.
On-demand InvokeModel tokens per minute for ${model} Per model, per Region The maximum number of tokens per minute (input + output, combined) that your account can use for the model when invoked on-demand in a single Region.
Model invocation max tokens per day for ${model} Per model, per Region The maximum number of tokens per day (input + output, combined) that your account can use for the model. By default, this value is the per-minute quota multiplied by 24 × 60. New AWS accounts might receive reduced quotas.

The bedrock-runtime endpoint TPM quotas count input and output tokens together against a single per-model quota. The bedrock-mantle endpoint applies separate input-tokens-per-minute and output-tokens-per-minute quotas; for details, see Quotas for the bedrock-mantle endpoint.

Note

Amazon Bedrock no longer enforces requests-per-minute (RPM) quotas on the bedrock-runtime endpoint. Throttling is governed by the token-based quotas described in this section.

Output tokens are converted into quota usage through a model-specific burndown rate. For details on how token-based quotas are calculated and how the max_tokens request parameter affects deductions, see How tokens are counted in Amazon Bedrock.

Related runtime quotas

The following Amazon Bedrock capabilities are served through the bedrock-runtime endpoint and have their own separate quotas:

These quotas apply only to the bedrock-runtime endpoint and are not exposed on the bedrock-mantle endpoint.

Requesting a quota increase

The steps for requesting a quota increase for your account depend on the value in the Adjustable column in the quotas table in Amazon Bedrock service quotas.

Important

Before requesting a quota increase, verify that the model is not in a Legacy or Deprecated lifecycle status. Quota increases are not granted for models that are scheduled for retirement. Check the model's lifecycle status on the Model lifecycle page and consider migrating to the successor model instead.

  • If a quota is marked as Yes, you can adjust it by following the steps at Requesting a Quota Increase in the Service Quotas User Guide.

  • For any model, you can request an increase for the following quotas together:

    • Cross-Region InvokeModel tokens per minute for ${model}

    • On-demand InvokeModel tokens per minute for ${model}

    • Model invocation max tokens per day for ${model}

    To request an increase for any combination of these quotas, request an increase for the Cross-Region InvokeModel tokens per minute for ${model} quota by following the steps at Requesting a Quota Increase in the Service Quotas User Guide. After you do so, the support team will reach out and offer you the option of also increasing the other two quotas.

    Note

    Due to overwhelming demand, priority will be given to customers who generate traffic that consumes their existing quota allocation. Your request might be denied if you don't meet this condition.

For bedrock-mantle quota increases, see Requesting a quota increase.