Java And Big Data : Install and run pig 0.15.0 on hadoop 2.6.2 for Linux or Ubuntu

Below are instructions to download install and Run PIG 0.15.0

DownLoad tar.gz

Move it to a new location: on my machine /usr/local/pig-0.15.0
sudo mv pig-0.15.0 /usr/local/

Now setup env vars, below are variables on my machine (ubuntu 14+ LTS) (file .bashrc):
#PIG VARS
export PIG_HOME='/usr/local/pig-0.15.0'
export PATH=$PIG_HOME/bin:$PATH
export PIG_CLASSPATH=$HADOOP_INSTALL/etc/hadoop
#PIG VARS END

Now start hadoop and yarn with following commands:

start-dfs.sh

start-yarn.sh

Now try "pig" command on terminal, you will see message "Connecting to hadoop file system at: hdfs://localhost:9000"

Now submit following code in grunt shell:

a = load '/user/hduser/wordcount/input/file0';

b = foreach a generate flatten(TOKENIZE((chararray)$0)) as word;

c = group b by word;

d = foreach c generate COUNT(b), group;

store d into '/user/hduser/wordcount/output/pig_wordcount';

Notes:
/user/hduser/wordcount/input/file0 is an hdfs path you can use put command to move files from local to hdfs file system.

Also you can see job running using link of Yarn:

http://localhost:8088/cluster

Verify the output:

hdfs dfs -cat /user/hduser/wordcount/output/pig_wordcount/part-r-00000

Running pig via script:

Example Input:
file name: input-new and input-new1 (both are same file)
mike 3 40000 boston google
mike1 12 40000 california facebook
mike2 13 60000 seattle microsoft
mike3 32 100000 portland microsoft
mike4 32 1000000 portland cisco
santa 25 20000 banglore flipkart

example1.pig (this is script name, you can store it any where I use /usr/local/pig-0.15.0/scripts/myscripts ):

input1 = LOAD '/user/hduser/wordcount/input/input-new' USING PigStorage('\t') AS (name, age:int, salary:int, city, company);

input2 = LOAD '/user/hduser/wordcount/input/input-new1' USING PigStorage('\t') AS (name, age:int, salary:int, city, company);

filterinput = FILTER input1 BY age > 12;

STORE filterinput INTO '/user/hduser/wordcount/output/result1';

groupinput = GROUP filterinput BY (city);

STORE groupinput into '/user/hduser/wordcount/output/result3';

calculatesum = FOREACH groupinput GENERATE group,filterinput.name, SUM(filterinput.salary);

STORE calculatesum into '/user/hduser/wordcount/output/result2';

Now you can run it with following command:

pig example1.pig

Checkout url http://localhost:8088/cluster (yarn) it will show your job is running.

Verify output

hdfs dfs -ls /user/hduser/wordcount/output/result2

More more hadoop commands click here

For hadoop installation click here

Java And Big Data

Friday, December 18, 2015

Install and run pig 0.15.0 on hadoop 2.6.2 for Linux or Ubuntu

Below are instructions to download install and Run PIG 0.15.0

Running pig via script:

1 comment: