If you have an Impala table saved as a comma separated file and would like to get a speed improvement while performing analytical queries, one of the easiest aids to reach is converting the table into Parquet, the optimized columnar store format for Impala and Hive. Doing this is really simple:

impala-shell> create table facts_parquet like facts stored as parquet;
impala-shell> insert into facts_parquet select * from facts;

The "stored as parquet" clause is all you have to add.

Parquet is compressed, fitting much better in RAM. And Parquet is columnar, allowing you to load only the columns your query uses, without accessing the others. That's where the big improvement comes from. Now substitute the facts_parquet table to facts in your queries and enjoy your data accelerating like electrons in a collider!

Sometimes you need to guess how many RAM slots you have already filled on a server (and how, using which model and size) without being able to open the chassis and physically inspect the situation. On Linux systems you can determine the exact modules deployed using:

$ sudo dmidecode -t memory

Many guides can be found on the net about software packaging for various distributions. This is not an attempt to create the next one or maybe the best one. Is just a piece of scratch paper where I would like to fix my procedure to create Tagsistant packages and if anyone will find it useful... nice to know.

Apache Hive supports partitioned tables for performance reasons and to ease update operations. Since Hive does not support the update and delete operations of ANSI SQL, the only way to update a table content is to rewrite it entirely. Using partitions, only the involved partition can be rewritten, speeding up the whole process.

A partitioned table is created as follows:

create table sells (
product_id int,
customer_id int
) partitioned by (purchase_date string)

  The Cog In The Machine On Which All Depends