Populating partitioned tables in Hive

Apache Hive supports partitioned tables for performance reasons and to ease update operations. Since Hive does not support the update and delete operations of ANSI SQL, the only way to update a table content is to rewrite it entirely. Using partitions, only the involved partition can be rewritten, speeding up the whole process.

A partitioned table is created as follows:

create table sells (
    product_id int,
    customer_id int
) partitioned by (purchase_date string)

This table will be divided on disk in separate directories. Each directory will be a separate partition and its name will reflect the partitioning value, like in purchase_date=2014-04-12.

When populating this table, the overwrite clause can be used as usual to rewrite the content of the table, but the additional partition clause must be added. If an unpartitioned table can be updated with:

insert overwrite table sells
    select product_id, customer_id, purchase_date from sells_staging;

a partitioned version would require:

insert overwrite table sells partition (purchase_date="2014-04-12") 
    select product_id, customer_id from sells_staging;

Very nice. But what if we would like to load the partitions from the select subquery and we would like to place data in the proper partition automatically? A right guess would be: left the partition value out, Hive would figure it out on itself. Well, this is right and wrong at the same time. It is right because Hive will usually take the last columns of the select subquery to fill the partitioned values. So for example our query would become:

insert overwrite table sells partition (purchase_date)
    select product_id, customer_id, purchase_date from sells_staging;

Please note that the names of the partition and the source column just match by chance here. The position is the determining factor.

However, Hive by default complains about this liberal use of the partitioning feature. Its designers were (rightly) afraid that users could unintentionally create never ending jobs if they mistakenly choose a partitioning key that never repeats itself. An example? A transaction ID, which is by definition unique per transaction. Each transaction has its own unique transaction id. Partitioning by transaction IDs is clearly the wrong approach: Hive would create as many partitions as there are transactions in the staging table and each partition would be a separate directory holding just one file with one record: that transaction. So, by default Hive prevents you from running this kind of queries.

But if you know what you are doing ©, Hive lets you do it. To enable dynamic partitioning declare a session parameter:

set hive.exec.dynamic.partition = true

With dynamic partitioning turned on, Hive reads from the subquery all the partition values but one. Exactly: all but one. Your overwrite statement becomes:

insert overwrite table sells partition (purchase_date="2014-01-03", part2, part3)
    select ...

As you see, the first partition is however statically declared and other partitions come from the subquery. In our case, having just one partition, this clearly does not help. We need to disable this limitation and force Hive to receive all the partitions from the subquery. Fortunately this can be done by declaring one more session parameter:

set hive.exec.dynamic.partition.mode = nonstrict

In nonstrict mode, all the partition will be read from the last columns of the subquery. So our query can be finally written as:

insert overwrite table sells partition (purchase_date) 
    select product_id, customer_id, purchase_date from sells_staging;

Always remember to put the columns containing the partition values in the end. Selecting purchase_date, product_id, customer_id would not lead to the same result at all!

If you want to set some limitations to avoid starting an endless query, you can use the hive.exec.max.dynamic.partitions option. Set it to the number of partition you expect, or some more. If you choose by mistake the wrong column, Hive will not produce more than that number of partitions, and will exit with an error if that number is exceeded.