Friday, December 18, 2015

Install and run pig 0.15.0 on hadoop 2.6.2 for Linux or Ubuntu

Below are instructions to download install and Run PIG 0.15.0

DownLoad tar.gz

Move it to a new location: on my machine /usr/local/pig-0.15.0
      sudo mv pig-0.15.0 /usr/local/

Now setup env vars, below are variables on my machine (ubuntu 14+ LTS) (file .bashrc):
     #PIG VARS
     export PIG_HOME='/usr/local/pig-0.15.0'
     export PATH=$PIG_HOME/bin:$PATH
     export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop
     #PIG VARS END
Now start hadoop and yarn with following commands:
     start-dfs.sh
     start-yarn.sh

Now try "pig" command on terminal, you will see message "Connecting to hadoop file system at: hdfs://localhost:9000"

Now submit following code in grunt shell:

   a = load '/user/hduser/wordcount/input/file0';
   b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;
   c = group b by word;
   d = foreach c generate COUNT(b), group;
   store d into '/user/hduser/wordcount/output/pig_wordcount';

Notes:
    /user/hduser/wordcount/input/file0 is an hdfs path you can use put command to move files from           local to hdfs file system.

Also you can see job running using link of Yarn:
   http://localhost:8088/cluster

Verify the output:
hdfs dfs -cat /user/hduser/wordcount/output/pig_wordcount/part-r-00000

Running pig via script:

Example Input:
file name: input-new and input-new1 (both are same file)
mike    3       40000   boston  google
mike1   12      40000   california      facebook
mike2   13      60000   seattle microsoft
mike3   32      100000  portland        microsoft
mike4   32      1000000 portland        cisco
santa       25      20000   banglore        flipkart

example1.pig (this is script name, you can store it any where I use /usr/local/pig-0.15.0/scripts/myscripts ):

input1 = LOAD '/user/hduser/wordcount/input/input-new' USING PigStorage('\t') AS (name, age:int, salary:int, city, company);
input2 = LOAD '/user/hduser/wordcount/input/input-new1' USING PigStorage('\t') AS (name, age:int, salary:int, city, company);
filterinput = FILTER input1 BY age > 12;
STORE filterinput INTO '/user/hduser/wordcount/output/result1';
groupinput = GROUP filterinput BY (city);
STORE groupinput into '/user/hduser/wordcount/output/result3';
calculatesum = FOREACH groupinput GENERATE group,filterinput.name, SUM(filterinput.salary);
STORE calculatesum into '/user/hduser/wordcount/output/result2';

Now you can run it with following command:

pig example1.pig

Checkout url http://localhost:8088/cluster  (yarn) it will show your job is running.

Verify output

hdfs dfs -ls /user/hduser/wordcount/output/result2

More more hadoop commands click here

For hadoop installation click here

1 comment: