Hive Joins

Joins in Hive

Hive can perform Joins between large datasets. Joins are used to combine rows from two or more tables.
Download sample Dataset1 Dataset2 for Joining operation.
  
Creation and data loading in Sample table 1 :-
hive (hivedb)> create table samptab1(id int,name string,salary int)
             > row format delimited
             > fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 3.165 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data1.log' into table samptab1;
Loading data to table hivedb.samptab1
OK
Time taken: 2.457 seconds
hive (hivedb)> select * from samptab1;
OK
200    Snehal    50000
201    Nikhil    30000
202    Samir    45000
203    Rahul    40000
204    Ajay    25000
205    Prashant    21000
206    Aditya    50500
207    Rushi    30500
208    Vivek    45500
209    Rohan    40500
210    Vicky    25500
211    Addy    50500
212    Rose    30500
213    Vikrant    45500
214    Rohini    10500
215    Simran    35500
216    Samli    55500
217    Jeetu    35500
218    Sanket    43500
219    Sudhir    45500
220    Gagan    24500
221    Deep    59500
222    Rudra    36500
223    Shruti    45800
224    Suruchi    42500
225    Vaibhav    28500
226    Kaustubh    57500
227    Jitesh    33500
228    Mayur    44500
229    Mohit    41500
Time taken: 0.745 seconds, Fetched: 30 row(s)

Creation and data loading in Sample table 2 :-
hive (hivedb)> create table samptab2(eid int,nickname string,bonus int)
             > row format delimited
             > fields terminated by '\t'
             > stored as textfile;
OK
Time taken: 1.698 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data2.log' into table samptab2;
Loading data to table hivedb.samptab2
OK
Time taken: 1.031 seconds
hive (hivedb)> desc samptab2;
OK
eid                     int                                         
nickname                string                                      
bonus                   int                                         
Time taken: 0.117 seconds, Fetched: 3 row(s)
hive (hivedb)> select * from samptab2;
OK
200    Sonu    5000
201    Nik    3000
202    Sam    4000
203    Chotu    4000
204    Aju    5000
205    Prashu    2000
206    Adi    5060
207    Rushi    3500
208    Vik    4550
209    Rohan    4500
210    Vicky    2550
301    Adi    5050
302    Rush    3050
303    Golu    4550
304    Chiku    1050
305    Monu    3550
306    Chutki    5550
307    Tunu    3550
308    Laddu    4350
309    Tinku    4550
310    Piku    2450
Time taken: 0.345 seconds, Fetched: 21 row(s)

INNER JOIN:-
             INNER JOIN returns all rows from multiple tables where each match in the input tables results in a row in the output.
        Syntax: - CREATE TABLE TableName AS 
                       SELECT (ColNames) FROM Table1 JOIN Table2 ON (CommonCol1=CommonCol2)
hive (hivedb)> create table innerjoin as select id,name,salary,eid,nickname,bonus from samptab1 join samptab2 on (id=eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202052540_909d164f-64d6-4e94-9b4a-08ca01aaccc0
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:26:30    Starting to launch local task to process map join;    maximum memory = 518979584
2017-02-02 05:26:41    Dump the side-table for tag: 1 with group count: 21 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-25-40_606_1523263717775134206-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2017-02-02 05:26:41    Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-25-40_606_1523263717775134206-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (866 bytes)
2017-02-02 05:26:41    End of local task; Time Taken: 11.519 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0004, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0004/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1486003801406_0004
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:27:29,460 Stage-4 map = 0%,  reduce = 0%
2017-02-02 05:28:07,189 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 7.47 sec
MapReduce Total cumulative CPU time: 7 seconds 470 msec
Ended Job = job_1486003801406_0004
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/innerjoin
MapReduce Jobs Launched: 
Stage-Stage-4: Map: 1   Cumulative CPU: 7.47 sec   HDFS Read: 7208 HDFS Write: 409 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 470 msec
OK
Time taken: 152.04 seconds
hive (hivedb)> select * from innerjoin;
OK
200    Snehal    50000    200    Sonu    5000
201    Nikhil    30000    201    Nik    3000
202    Samir    45000    202    Sam    4000
203    Rahul    40000    203    Chotu    4000
204    Ajay    25000    204    Aju    5000
205    Prashant    21000    205    Prashu    2000
206    Aditya    50500    206    Adi    5060
207    Rushi    30500    207    Rushi    3500
208    Vivek    45500    208    Vik    4550
209    Rohan    40500    209    Rohan    4500
210    Vicky    25500    210    Vicky    2550
Time taken: 0.609 seconds, Fetched: 11 row(s)

LEFT OUTER JOIN:-
         LEFT JOIN returns all the values from the left table and the matched values from the right table
                Syntax: - CREATE TABLE TableName AS 
                               SELECT (ColNames) FROM Table1 LEFT OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table leftjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a left outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202054237_018fd776-482e-42bf-8344-db37f3cd6381
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:43:06    Starting to launch local task to process map join;    maximum memory = 518979584
2017-02-02 05:43:12    Dump the side-table for tag: 1 with group count: 21 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-42-37_532_3515579799160890341-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile11--.hashtable
2017-02-02 05:43:12    Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-42-37_532_3515579799160890341-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile11--.hashtable (866 bytes)
2017-02-02 05:43:12    End of local task; Time Taken: 5.599 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0005, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0005/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1486003801406_0005
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:43:47,290 Stage-4 map = 0%,  reduce = 0%
2017-02-02 05:44:07,944 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.79 sec
MapReduce Total cumulative CPU time: 3 seconds 790 msec
Ended Job = job_1486003801406_0005
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/leftjoin
MapReduce Jobs Launched: 
Stage-Stage-4: Map: 1   Cumulative CPU: 3.79 sec   HDFS Read: 6850 HDFS Write: 895 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 790 msec
OK
Time taken: 93.854 seconds
hive (hivedb)> select * from leftjoin;
OK
200    Snehal    50000    200    Sonu    5000
201    Nikhil    30000    201    Nik    3000
202    Samir    45000    202    Sam    4000
203    Rahul    40000    203    Chotu    4000
204    Ajay    25000    204    Aju    5000
205    Prashant    21000    205    Prashu    2000
206    Aditya    50500    206    Adi    5060
207    Rushi    30500    207    Rushi    3500
208    Vivek    45500    208    Vik    4550
209    Rohan    40500    209    Rohan    4500
210    Vicky    25500    210    Vicky    2550
211    Addy    50500    NULL    NULL    NULL
212    Rose    30500    NULL    NULL    NULL
213    Vikrant    45500    NULL    NULL    NULL
214    Rohini    10500    NULL    NULL    NULL
215    Simran    35500    NULL    NULL    NULL
216    Samli    55500    NULL    NULL    NULL
217    Jeetu    35500    NULL    NULL    NULL
218    Sanket    43500    NULL    NULL    NULL
219    Sudhir    45500    NULL    NULL    NULL
220    Gagan    24500    NULL    NULL    NULL
221    Deep    59500    NULL    NULL    NULL
222    Rudra    36500    NULL    NULL    NULL
223    Shruti    45800    NULL    NULL    NULL
224    Suruchi    42500    NULL    NULL    NULL
225    Vaibhav    28500    NULL    NULL    NULL
226    Kaustubh    57500    NULL    NULL    NULL
227    Jitesh    33500    NULL    NULL    NULL
228    Mayur    44500    NULL    NULL    NULL
229    Mohit    41500    NULL    NULL    NULL
Time taken: 0.261 seconds, Fetched: 30 row(s)

RIGHT OUTER JOIN:-
          RIGHT OUTER JOIN returns all the values from the right table and the matched values from the left table
              Syntax: - CREATE TABLE TableName AS 
                              SELECT (ColNames) FROM Table1 RIGHT OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table rightjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a right outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202055210_14f246dd-2f8c-46db-b679-4fcbcb706632
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:52:38    Starting to launch local task to process map join;    maximum memory = 518979584
2017-02-02 05:52:42    Dump the side-table for tag: 0 with group count: 30 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-52-10_610_1856544679281279376-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile20--.hashtable
2017-02-02 05:52:42    Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-52-10_610_1856544679281279376-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile20--.hashtable (1150 bytes)
2017-02-02 05:52:42    End of local task; Time Taken: 4.03 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0006, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0006/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1486003801406_0006
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:53:18,278 Stage-4 map = 0%,  reduce = 0%
2017-02-02 05:53:40,828 Stage-4 map = 100%,  reduce = 0%, Cumulative CPU 3.88 sec
MapReduce Total cumulative CPU time: 3 seconds 880 msec
Ended Job = job_1486003801406_0006
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/rightjoin
MapReduce Jobs Launched: 
Stage-Stage-4: Map: 1   Cumulative CPU: 3.88 sec   HDFS Read: 6680 HDFS Write: 643 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 880 msec
OK
Time taken: 93.363 seconds
hive (hivedb)> select * from rightjoin;
OK
200    Snehal    50000    200    Sonu    5000
201    Nikhil    30000    201    Nik    3000
202    Samir    45000    202    Sam    4000
203    Rahul    40000    203    Chotu    4000
204    Ajay    25000    204    Aju    5000
205    Prashant    21000    205    Prashu    2000
206    Aditya    50500    206    Adi    5060
207    Rushi    30500    207    Rushi    3500
208    Vivek    45500    208    Vik    4550
209    Rohan    40500    209    Rohan    4500
210    Vicky    25500    210    Vicky    2550
NULL    NULL    NULL    301    Adi    5050
NULL    NULL    NULL    302    Rush    3050
NULL    NULL    NULL    303    Golu    4550
NULL    NULL    NULL    304    Chiku    1050
NULL    NULL    NULL    305    Monu    3550
NULL    NULL    NULL    306    Chutki    5550
NULL    NULL    NULL    307    Tunu    3550
NULL    NULL    NULL    308    Laddu    4350
NULL    NULL    NULL    309    Tinku    4550
NULL    NULL    NULL    310    Piku    2450
Time taken: 0.311 seconds, Fetched: 21 row(s)

FULL OUTER JOIN:-
       FULL OUTER JOIN returns all rows from the left table and from the right table. In other words, it combines the result of both LEFT and RIGHT joins.
              Syntax: - CREATE TABLE TableName AS 
                              SELECT (ColNames) FROM Table1 FULL OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table outerjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a full outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202061350_e5de9815-9bb6-4143-92e7-55822f207daa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1486003801406_0007, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0007/
Kill Command = /usr/local/hadoop/bin/hadoop job  -kill job_1486003801406_0007
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-02-02 06:14:46,982 Stage-1 map = 0%,  reduce = 0%
2017-02-02 06:15:47,277 Stage-1 map = 0%,  reduce = 0%
2017-02-02 06:16:48,323 Stage-1 map = 0%,  reduce = 0%
2017-02-02 06:17:49,415 Stage-1 map = 0%,  reduce = 0%
2017-02-02 06:18:11,071 Stage-1 map = 50%,  reduce = 0%, Cumulative CPU 40.22 sec
2017-02-02 06:18:15,680 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 79.25 sec
2017-02-02 06:18:40,896 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 83.98 sec
2017-02-02 06:18:43,866 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 85.31 sec
MapReduce Total cumulative CPU time: 1 minutes 25 seconds 310 msec
Ended Job = job_1486003801406_0007
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/outerjoin
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 85.31 sec   HDFS Read: 15689 HDFS Write: 1131 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 25 seconds 310 msec
OK
Time taken: 296.676 seconds

Comments

  1. Nice post ! Thanks for sharing valuable information with us. Keep sharing.Big data hadoop online training

    ReplyDelete
  2. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
  3. It is really a great work and the way in which you are sharing the knowledge is excellent.

    big data company in chennai

    ReplyDelete

Post a Comment