Services or capabilities described in AWS documentation might vary by Region. To see the differences applicable to the AWS European Sovereign Cloud Region, see the AWS European Sovereign Cloud User Guide.Choosing a migration strategy
When transitioning to Iceberg format, the choice between in-place and full migration
is crucial. To determine the most suitable approach for your specific needs, consider
the following questions and recommendations:
| Question |
Recommendation |
|
What is the data file format (for example,
CSV or Apache Parquet)?
|
-
Consider in-place migration if your table file format is
Parquet, ORC, or Avro.
-
For other formats such as CSV, JSON, and so on, use full
data migration.
|
|
Do you want to update or consolidate the
table schema?
|
-
If you want to evolve the table schema by using Iceberg
native capabilities, consider in-place migration. For
example, you can rename columns after the migration. (The
schema can be changed in the Iceberg metadata layer.)
-
If you want to remove entire columns because they are no
longer needed, we recommend that you use full data
migration.
|
|
Would the table benefit from changing the
partition strategy?
|
-
If Iceberg's partitioning approach meets your requirements
(for example, new data is stored by using the new partition
layout while existing partitions remain as is), consider
in-place migration.
-
If you want to use hidden partitions in your table,
consider full data migration. For more information about
hidden partitions, see the Best practices
section.
|
|
Would the table benefit from adding or
changing the sort order strategy?
|
-
Adding or changing the sort order of your data requires
rewriting the dataset. In this case, consider using full
data migration.
-
For large tables where it's prohibitively expensive to
rewrite all the table partitions, consider using in-place
migration and run compaction (with sorting enabled) for the
most frequently accessed partitions.
|
|
Does the table have many small
files?
|
-
Merging small files into larger files requires rewriting
the dataset. In this case, consider using full data
migration.
-
For large tables where it's prohibitively expensive to
rewrite all the table partitions, consider using in-place
migration and run compaction (with sorting enabled) for the
most frequently accessed partitions.
|