Joins in Hive
Hive can perform Joins between large datasets. Joins are used to combine rows from two or more tables.
Creation and data loading in Sample table 1 :-
hive (hivedb)> create table samptab1(id int,name string,salary int)
> row format delimited
> fields terminated by '\t'
> stored as textfile;
OK
Time taken: 3.165 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data1.log' into table samptab1;
Loading data to table hivedb.samptab1
OK
Time taken: 2.457 seconds
hive (hivedb)> select * from samptab1;
OK
200 Snehal 50000
201 Nikhil 30000
202 Samir 45000
203 Rahul 40000
204 Ajay 25000
205 Prashant 21000
206 Aditya 50500
207 Rushi 30500
208 Vivek 45500
209 Rohan 40500
210 Vicky 25500
211 Addy 50500
212 Rose 30500
213 Vikrant 45500
214 Rohini 10500
215 Simran 35500
216 Samli 55500
217 Jeetu 35500
218 Sanket 43500
219 Sudhir 45500
220 Gagan 24500
221 Deep 59500
222 Rudra 36500
223 Shruti 45800
224 Suruchi 42500
225 Vaibhav 28500
226 Kaustubh 57500
227 Jitesh 33500
228 Mayur 44500
229 Mohit 41500
Time taken: 0.745 seconds, Fetched: 30 row(s)
Creation and data loading in Sample table 2 :-
hive (hivedb)> create table samptab2(eid int,nickname string,bonus int)
> row format delimited
> fields terminated by '\t'
> stored as textfile;
OK
Time taken: 1.698 seconds
hive (hivedb)> load data local inpath '/home/hduser/HIVE/data2.log' into table samptab2;
Loading data to table hivedb.samptab2
OK
Time taken: 1.031 seconds
hive (hivedb)> desc samptab2;
OK
eid int
nickname string
bonus int
Time taken: 0.117 seconds, Fetched: 3 row(s)
hive (hivedb)> select * from samptab2;
OK
200 Sonu 5000
201 Nik 3000
202 Sam 4000
203 Chotu 4000
204 Aju 5000
205 Prashu 2000
206 Adi 5060
207 Rushi 3500
208 Vik 4550
209 Rohan 4500
210 Vicky 2550
301 Adi 5050
302 Rush 3050
303 Golu 4550
304 Chiku 1050
305 Monu 3550
306 Chutki 5550
307 Tunu 3550
308 Laddu 4350
309 Tinku 4550
310 Piku 2450
Time taken: 0.345 seconds, Fetched: 21 row(s)
INNER JOIN:-
INNER JOIN returns all rows from multiple tables where each match in the input tables results in a row in the output.
Syntax: - CREATE TABLE TableName AS
SELECT (ColNames) FROM Table1 JOIN Table2 ON (CommonCol1=CommonCol2)
hive (hivedb)> create table innerjoin as select id,name,salary,eid,nickname,bonus from samptab1 join samptab2 on (id=eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202052540_909d164f-64d6-4e94-9b4a-08ca01aaccc0
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:26:30 Starting to launch local task to process map join; maximum memory = 518979584
2017-02-02 05:26:41 Dump the side-table for tag: 1 with group count: 21 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-25-40_606_1523263717775134206-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable
2017-02-02 05:26:41 Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-25-40_606_1523263717775134206-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile01--.hashtable (866 bytes)
2017-02-02 05:26:41 End of local task; Time Taken: 11.519 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0004, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0004/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1486003801406_0004
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:27:29,460 Stage-4 map = 0%, reduce = 0%
2017-02-02 05:28:07,189 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 7.47 sec
MapReduce Total cumulative CPU time: 7 seconds 470 msec
Ended Job = job_1486003801406_0004
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/innerjoin
MapReduce Jobs Launched:
Stage-Stage-4: Map: 1 Cumulative CPU: 7.47 sec HDFS Read: 7208 HDFS Write: 409 SUCCESS
Total MapReduce CPU Time Spent: 7 seconds 470 msec
OK
Time taken: 152.04 seconds
hive (hivedb)> select * from innerjoin;
OK
200 Snehal 50000 200 Sonu 5000
201 Nikhil 30000 201 Nik 3000
202 Samir 45000 202 Sam 4000
203 Rahul 40000 203 Chotu 4000
204 Ajay 25000 204 Aju 5000
205 Prashant 21000 205 Prashu 2000
206 Aditya 50500 206 Adi 5060
207 Rushi 30500 207 Rushi 3500
208 Vivek 45500 208 Vik 4550
209 Rohan 40500 209 Rohan 4500
210 Vicky 25500 210 Vicky 2550
Time taken: 0.609 seconds, Fetched: 11 row(s)
LEFT OUTER JOIN:-
LEFT JOIN returns all the values from the left table and the matched values from the right table
Syntax: - CREATE TABLE TableName AS
SELECT (ColNames) FROM Table1 LEFT OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table leftjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a left outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202054237_018fd776-482e-42bf-8344-db37f3cd6381
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:43:06 Starting to launch local task to process map join; maximum memory = 518979584
2017-02-02 05:43:12 Dump the side-table for tag: 1 with group count: 21 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-42-37_532_3515579799160890341-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile11--.hashtable
2017-02-02 05:43:12 Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-42-37_532_3515579799160890341-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile11--.hashtable (866 bytes)
2017-02-02 05:43:12 End of local task; Time Taken: 5.599 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0005, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0005/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1486003801406_0005
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:43:47,290 Stage-4 map = 0%, reduce = 0%
2017-02-02 05:44:07,944 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 3.79 sec
MapReduce Total cumulative CPU time: 3 seconds 790 msec
Ended Job = job_1486003801406_0005
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/leftjoin
MapReduce Jobs Launched:
Stage-Stage-4: Map: 1 Cumulative CPU: 3.79 sec HDFS Read: 6850 HDFS Write: 895 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 790 msec
OK
Time taken: 93.854 seconds
hive (hivedb)> select * from leftjoin;
OK
200 Snehal 50000 200 Sonu 5000
201 Nikhil 30000 201 Nik 3000
202 Samir 45000 202 Sam 4000
203 Rahul 40000 203 Chotu 4000
204 Ajay 25000 204 Aju 5000
205 Prashant 21000 205 Prashu 2000
206 Aditya 50500 206 Adi 5060
207 Rushi 30500 207 Rushi 3500
208 Vivek 45500 208 Vik 4550
209 Rohan 40500 209 Rohan 4500
210 Vicky 25500 210 Vicky 2550
211 Addy 50500 NULL NULL NULL
212 Rose 30500 NULL NULL NULL
213 Vikrant 45500 NULL NULL NULL
214 Rohini 10500 NULL NULL NULL
215 Simran 35500 NULL NULL NULL
216 Samli 55500 NULL NULL NULL
217 Jeetu 35500 NULL NULL NULL
218 Sanket 43500 NULL NULL NULL
219 Sudhir 45500 NULL NULL NULL
220 Gagan 24500 NULL NULL NULL
221 Deep 59500 NULL NULL NULL
222 Rudra 36500 NULL NULL NULL
223 Shruti 45800 NULL NULL NULL
224 Suruchi 42500 NULL NULL NULL
225 Vaibhav 28500 NULL NULL NULL
226 Kaustubh 57500 NULL NULL NULL
227 Jitesh 33500 NULL NULL NULL
228 Mayur 44500 NULL NULL NULL
229 Mohit 41500 NULL NULL NULL
Time taken: 0.261 seconds, Fetched: 30 row(s)
RIGHT OUTER JOIN:-
RIGHT OUTER JOIN returns all the values from the right table and the matched values from the left table
Syntax: - CREATE TABLE TableName AS
SELECT (ColNames) FROM Table1 RIGHT OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table rightjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a right outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202055210_14f246dd-2f8c-46db-b679-4fcbcb706632
Total jobs = 1
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hive/lib/log4j-slf4j-impl-2.4.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
2017-02-02 05:52:38 Starting to launch local task to process map join; maximum memory = 518979584
2017-02-02 05:52:42 Dump the side-table for tag: 0 with group count: 30 into file: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-52-10_610_1856544679281279376-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile20--.hashtable
2017-02-02 05:52:42 Uploaded 1 File to: file:/tmp/mydir/f6c8f88e-ec31-4167-965d-bb2c0af9d4cf/hive_2017-02-02_05-52-10_610_1856544679281279376-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile20--.hashtable (1150 bytes)
2017-02-02 05:52:42 End of local task; Time Taken: 4.03 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1486003801406_0006, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0006/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1486003801406_0006
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2017-02-02 05:53:18,278 Stage-4 map = 0%, reduce = 0%
2017-02-02 05:53:40,828 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 3.88 sec
MapReduce Total cumulative CPU time: 3 seconds 880 msec
Ended Job = job_1486003801406_0006
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/rightjoin
MapReduce Jobs Launched:
Stage-Stage-4: Map: 1 Cumulative CPU: 3.88 sec HDFS Read: 6680 HDFS Write: 643 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 880 msec
OK
Time taken: 93.363 seconds
hive (hivedb)> select * from rightjoin;
OK
200 Snehal 50000 200 Sonu 5000
201 Nikhil 30000 201 Nik 3000
202 Samir 45000 202 Sam 4000
203 Rahul 40000 203 Chotu 4000
204 Ajay 25000 204 Aju 5000
205 Prashant 21000 205 Prashu 2000
206 Aditya 50500 206 Adi 5060
207 Rushi 30500 207 Rushi 3500
208 Vivek 45500 208 Vik 4550
209 Rohan 40500 209 Rohan 4500
210 Vicky 25500 210 Vicky 2550
NULL NULL NULL 301 Adi 5050
NULL NULL NULL 302 Rush 3050
NULL NULL NULL 303 Golu 4550
NULL NULL NULL 304 Chiku 1050
NULL NULL NULL 305 Monu 3550
NULL NULL NULL 306 Chutki 5550
NULL NULL NULL 307 Tunu 3550
NULL NULL NULL 308 Laddu 4350
NULL NULL NULL 309 Tinku 4550
NULL NULL NULL 310 Piku 2450
Time taken: 0.311 seconds, Fetched: 21 row(s)
FULL OUTER JOIN:-
FULL OUTER JOIN returns all rows from the left table and from the right table. In other words, it combines the result of both LEFT and RIGHT joins.
Syntax: - CREATE TABLE TableName AS
SELECT (ColNames) FROM Table1 FULL OUTER JOIN Table2 ON(CommonCol1=CommonCol2)
hive (hivedb)> create table outerjoin as select a.id,a.name,a.salary,b.eid,b.nickname,b.bonus from samptab1 a full outer join samptab2 b on (a.id=b.eid);
WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Query ID = hduser_20170202061350_e5de9815-9bb6-4143-92e7-55822f207daa
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1486003801406_0007, Tracking URL = http://localhost:8088/proxy/application_1486003801406_0007/
Kill Command = /usr/local/hadoop/bin/hadoop job -kill job_1486003801406_0007
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2017-02-02 06:14:46,982 Stage-1 map = 0%, reduce = 0%
2017-02-02 06:15:47,277 Stage-1 map = 0%, reduce = 0%
2017-02-02 06:16:48,323 Stage-1 map = 0%, reduce = 0%
2017-02-02 06:17:49,415 Stage-1 map = 0%, reduce = 0%
2017-02-02 06:18:11,071 Stage-1 map = 50%, reduce = 0%, Cumulative CPU 40.22 sec
2017-02-02 06:18:15,680 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 79.25 sec
2017-02-02 06:18:40,896 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 83.98 sec
2017-02-02 06:18:43,866 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 85.31 sec
MapReduce Total cumulative CPU time: 1 minutes 25 seconds 310 msec
Ended Job = job_1486003801406_0007
Moving data to directory hdfs://localhost:9000/user/hive/warehouse/hivedb.db/outerjoin
MapReduce Jobs Launched:
Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 85.31 sec HDFS Read: 15689 HDFS Write: 1131 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 25 seconds 310 msec
OK
Time taken: 296.676 seconds
great snesha ...great content ...!
ReplyDeleteNice post ! Thanks for sharing valuable information with us. Keep sharing.Big data hadoop online training
ReplyDeletethakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
It is really a great work and the way in which you are sharing the knowledge is excellent.
ReplyDeletebig data company in chennai