Hadoop

  • If you have an Impala table saved as a comma separated file and would like to get a speed improvement while performing analytical queries, one of the easiest aids to reach is converting the table into Parquet, the optimized columnar store format for Impala and Hive. Doing this is really simple:

    impala-shell> create table facts_parquet like facts stored as parquet;
    impala-shell> insert into facts_parquet select * from facts;

    The "stored as parquet" clause is all you have to add.

    Parquet is compressed, fitting much better in RAM. And Parquet is columnar, allowing you to load only the columns your query uses, without accessing the others. That's where the big improvement comes from. Now substitute the facts_parquet table to facts in your queries and enjoy your data accelerating like electrons in a collider!

  • Provided that you have your application ID, getting its logs is as easy as:

    $ yarn logs -applicationId <application_id_00398384848_9398>

     

  • When you move Hue database to another server or from sqlite to mysql and just want to start over, with a fresh empty DB, do as follows:

    1. configure Hue to use the right database credentials, without starting it
    2. on the host that Hue is configured to run on, do: 
      /opt/cloudera/parcels/CDH/lib/hue/build/env/bin/hue syncdb --noinput
      This will create all the tables and let Hue run.
    3. start Hue and login

     

  • What's the first action you take when a log says: no space left on device? Mine is:

    $ df -h /mount/point

    Well, quite surprisingly at first, a filesystem can be full even when there's still plenty of space on it. Is something that rarely happened to me, so rarely that I usually don't check the other side of the coin: free inodes!


  The Cog In The Machine On Which All Depends