Connecting to Data Catalog from a standalone Spark application - AWS Glue
Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.

Connecting to Data Catalog from a standalone Spark application

You can connect to the Data Catalog from a stand application using an Apache Iceberg connector.

  1. Create an IAM role for Spark application.

  2. Connect to AWS Glue Iceberg Rest endpoint using Iceberg connector.

    # configure your application. Refer to https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html for best practices on configuring environment variables. export AWS_ACCESS_KEY_ID=$(aws configure get appUser.aws_access_key_id) export AWS_SECRET_ACCESS_KEY=$(aws configure get appUser.aws_secret_access_key) export AWS_SESSION_TOKEN=$(aws configure get appUser.aws_secret_token) export AWS_REGION=us-east-1 export REGION=us-east-1 export AWS_ACCOUNT_ID = {specify your aws account id here} ~/spark-3.5.3-bin-hadoop3/bin/spark-shell \ --packages org.apache.iceberg:iceberg-spark-runtime-3.4_2.12:1.6.0 \ --conf "spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions" \ --conf "spark.sql.defaultCatalog=spark_catalog" \ --conf "spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog" \ --conf "spark.sql.catalog.spark_catalog.type=rest" \ --conf "spark.sql.catalog.spark_catalog.uri=https://glue.us-east-1.amazonaws.com/iceberg" \ --conf "spark.sql.catalog.spark_catalog.warehouse = {AWS_ACCOUNT_ID}" \ --conf "spark.sql.catalog.spark_catalog.rest.sigv4-enabled=true" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-name=glue" \ --conf "spark.sql.catalog.spark_catalog.rest.signing-region=us-east-1" \ --conf "spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO" \ --conf "spark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.SimpleAWSCredentialProvider"
  3. Query data in the Data Catalog.

    spark.sql("create database myicebergdb").show()
    spark.sql("""CREATE TABLE myicebergdb.mytbl (name string) USING iceberg location 's3://bucket_name/mytbl'""")
    spark.sql("insert into myicebergdb.mytbl values('demo') ").show()