Choosing a migration strategy
When transitioning to Iceberg format, the choice between in-place and full migration
is crucial. To determine the most suitable approach for your specific needs, consider
the following questions and recommendations:
| Question |
Recommendation |
|
What is the data file format (for example,
CSV or Apache Parquet)?
|
-
Consider in-place migration if your table file format is
Parquet, ORC, or Avro.
-
For other formats such as CSV, JSON, and so on, use full
data migration.
|
|
Do you want to update or consolidate the
table schema?
|
-
If you want to evolve the table schema by using Iceberg
native capabilities, consider in-place migration. For
example, you can rename columns after the migration. (The
schema can be changed in the Iceberg metadata layer.)
-
If you want to remove entire columns because they are no
longer needed, we recommend that you use full data
migration.
|
|
Would the table benefit from changing the
partition strategy?
|
-
If Iceberg's partitioning approach meets your requirements
(for example, new data is stored by using the new partition
layout while existing partitions remain as is), consider
in-place migration.
-
If you want to use hidden partitions in your table,
consider full data migration. For more information about
hidden partitions, see the Best practices
section.
|
|
Would the table benefit from adding or
changing the sort order strategy?
|
-
Adding or changing the sort order of your data requires
rewriting the dataset. In this case, consider using full
data migration.
-
For large tables where it's prohibitively expensive to
rewrite all the table partitions, consider using in-place
migration and run compaction (with sorting enabled) for the
most frequently accessed partitions.
|
|
Does the table have many small
files?
|
-
Merging small files into larger files requires rewriting
the dataset. In this case, consider using full data
migration.
-
For large tables where it's prohibitively expensive to
rewrite all the table partitions, consider using in-place
migration and run compaction (with sorting enabled) for the
most frequently accessed partitions.
|