Quotas for the bedrock-runtime endpoint

The bedrock-runtime.region.amazonaws.com endpoint is the primary inference endpoint for Amazon Bedrock. Inference traffic to this endpoint is governed by per-model token-based quotas. You can view these quotas in the Service Quotas console by selecting Amazon Bedrock as the service, or in the Amazon Bedrock service quotas table in the AWS General Reference.

Quota types

Inference on the bedrock-runtime endpoint is governed by the following per-model quotas:

bedrock-runtime per-model quotas
Quota	Scope	Description
Cross-Region InvokeModel tokens per minute for `${model}`	Per model, per Region	The maximum number of tokens per minute (input + output, combined) that your account can use for the model when invoked through a cross-Region inference profile.
On-demand InvokeModel tokens per minute for `${model}`	Per model, per Region	The maximum number of tokens per minute (input + output, combined) that your account can use for the model when invoked on-demand in a single Region.
Model invocation max tokens per day for `${model}`	Per model, per Region	The maximum number of tokens per day (input + output, combined) that your account can use for the model. By default, this value is the per-minute quota multiplied by 24 × 60. New AWS accounts might receive reduced quotas.
InvokeModel requests per minute for `${model}`	Per model, per Region	The maximum number of inference requests per minute that your account can submit for the model. RPM is enforced for some models on the `bedrock-runtime` endpoint and not others; see the Service Quotas console for the exact quotas that apply to your model.

The bedrock-runtime endpoint TPM quotas count input and output tokens together against a single per-model quota. The bedrock-mantle endpoint applies separate input-tokens-per-minute and output-tokens-per-minute quotas; for details, see Quotas for the bedrock-mantle endpoint.

Note

RPM quotas on the bedrock-runtime endpoint are model-specific. Some models – for example, Anthropic Claude Opus 4.7 and Claude Opus 4.8 – do not have an RPM quota and are governed solely by the token-based quotas described in this section. For models that do have an RPM quota, view the exact value in the Service Quotas console.

Output tokens are converted into quota usage through a model-specific burndown rate. For details on how token-based quotas are calculated and how the max_tokens request parameter affects deductions, see How tokens are counted in Amazon Bedrock.

Related runtime quotas

The following Amazon Bedrock capabilities are served through the bedrock-runtime endpoint and have their own separate quotas:

Custom inference profiles – Application-defined inference profiles that wrap an underlying model. See Set up a model invocation resource using inference profiles.
Batch inference – Asynchronous, large-volume inference jobs. See Process multiple prompts with batch inference.
Provisioned Throughput – Reserved model-unit capacity. See Increase model invocation capacity with Provisioned Throughput in Amazon Bedrock.

These quotas apply only to the bedrock-runtime endpoint and are not exposed on the bedrock-mantle endpoint.

Requesting a quota increase

The steps for requesting a quota increase for your account depend on the value in the Adjustable column in the quotas table in Amazon Bedrock service quotas.

Important

Before requesting a quota increase, verify that the model is not in a Legacy or Deprecated lifecycle status. Quota increases are not granted for models that are scheduled for retirement. Check the model's lifecycle status on the Model lifecycle page and consider migrating to the successor model instead.

If a quota is marked as Yes, you can adjust it by following the steps at Requesting a Quota Increase in the Service Quotas User Guide.
For any model, you can request an increase for the following quotas together:
- Cross-Region InvokeModel tokens per minute for ${model}
- On-demand InvokeModel tokens per minute for ${model}
- Model invocation max tokens per day for ${model}
To request an increase for any combination of these quotas, request an increase for the Cross-Region InvokeModel tokens per minute for ${model} quota by following the steps at Requesting a Quota Increase in the Service Quotas User Guide. After you do so, the support team will reach out and offer you the option of also increasing the other two quotas.

Note
Due to overwhelming demand, priority will be given to customers who generate traffic that consumes their existing quota allocation. Your request might be denied if you don't meet this condition.

For bedrock-mantle quota increases, see Requesting a quota increase.

Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

CountTokens API

bedrock-mantle quotas