XML File Processing using Pig
XML is a file extension for an Extensible Markup Language (XML) file format was designed to store and transport data. XML is a markup language like HTML. XML files comes under semi-structured category.
Sample XML file parts.xml :-
<?xml-stylesheet type="text/css" href="xmlpartsstyle.css"?>
<PARTS>
<PART>
<ITEM>Motherboard</ITEM>
<MANUFACTURER>ASUS</MANUFACTURER>
<MODEL>P3B-F</MODEL>
<COST> 123.00</COST>
</PART>
<PART>
<ITEM>Video Card</ITEM>
<MANUFACTURER>ATI</MANUFACTURER>
<MODEL>All-in-Wonder Pro</MODEL>
<COST> 160.00</COST>
</PART>
<PART>
<ITEM>Sound Card</ITEM>
<MANUFACTURER>Creative Labs</MANUFACTURER>
<MODEL>Sound Blaster Live</MODEL>
<COST> 80.00</COST>
</PART>
<PART>
<ITEM? inch Monitor</ITEM>
<MANUFACTURER>LG Electronics</MANUFACTURER>
<MODEL> 995E</MODEL>
<COST> 290.00</COST>
</PART>
XMLLoader() The XMLLoader() function to load the XML file which is used to parse records from a dataset.
We can extract XML data using two methods:-
1. Using Regular Expression
2. Using XPath( )
Example 1
Download Sample_XML_Dataset_Example1
tags:-
<PART>
<ITEM>Name</ITEM>
<MANUFACTURER>CompanyName</MANUFACTURER>
<MODEL>ModelNumber</MODEL>
<COST> Cost</COST>
</PART>
Pig script (partsxmlproc_regex.pig) to load and extract the XML data using Regular Expression:-
--registering piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar
--loading the xml data using XMLLoader()
input_xml= load '/home/hduser/PIG/parts.xml' using org.apache.pig.piggybank.storage.XMLLoader('PART') as (inputdata:chararray);
--Applying Regular Expression
parse_data= FOREACH input_xml GENERATE FLATTEN (REGEX_EXTRACT_ALL(inputdata,'<PART>\\s*<ITEM>(.*)</ITEM>\\s*<MANUFACTURER>(.*)</MANUFACTURER>\\s*<MODEL>(.*)</MODEL>\\s$
--storing the output file into xml_out
store parse_data into 'xml_out';
--output on console
dump parse_data;
Execution of Pig script partsxmlproc_regex.pig :-
[hduser@localhost PIG]$ pig -x local partsxmlproc_regex.pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
Input(s):
Successfully read 4 records from: "/home/hduser/PIG/parts.xml"
Output(s):
Successfully stored 4 records in: "file:/tmp/temp-1123944577/tmp-68140501"
Counters:
Total records written : 4
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
2017-02-06 09:17:53,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 09:17:53,257 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 09:17:53,258 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 09:17:53,317 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 09:17:53,317 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(Motherboard,ASUS,P3B-F, 123.00)
(Video Card,ATI,All-in-Wonder Pro, 160.00)
(Sound Card,Creative Labs,Sound Blaster Live, 80.00)
(15 inch Monitor,LG Electronics, 995E, 290.00)
2017-02-06 09:17:53,535 [main] INFO org.apache.pig.Main - Pig script completed in 20 seconds and 773 milliseconds (20773 ms)
[hduser@localhost PIG]$ ls
booksdata.xml bookxmlproc.pig data2.log MaxPageHits pig_1486405214028.log sample_log TotalHitsCount xmlprocessing.pig
books.xml CSVDataForHive data3.log parts.xml pig_1486406651245.log sample.xml xml_out
bookxmldataproc.pig data1.log logproc.pig partsxmlproc.pig SalesJan2009.csv smallwikipedia.csv xml_out_xpath
[hduser@localhost PIG]$ cd xml_out
[hduser@localhost xml_out]$ ls
part-m-00000 _SUCCESS
[hduser@localhost xml_out]$ cat part-m-00000
Motherboard ASUS P3B-F 123.00
Video Card ATI All-in-Wonder Pro 160.00
Sound Card Creative Labs Sound Blaster Live 80.00
15 inch Monitor LG Electronics 995E 290.00
Method 2:-
Using XPath()
Pig script (partsxmlproc_xpath.pig) to load and extract the XML data using XPath() :-
--registering piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
--loading the xml data using XMLLoader()
input_xml= load '/home/hduser/PIG/parts.xml' using org.apache.pig.piggybank.storage.XMLLoader('PART') as (inputdata:chararray);
--Applying transformation using XPath
parse_data= FOREACH input_xml GENERATE XPath(inputdata,'PART/ITEM'),XPath(inputdata,'PART/MANUFACTURER'),XPath(inputdata,'PART/MODEL'),XPath(inputdata,'PART/COST');
--storing the output file into xml_out
store parse_data into 'xml_out';
--output on console
dump parse_data;
Execution of Pig script partsxmlproc_xpath.pig :-
[hduser@localhost PIG]$ pig -x local partsxmlproc_xpath.pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
Input(s):
Successfully read 4 records from: "/home/hduser/PIG/parts.xml"
Output(s):
Successfully stored 4 records in: "file:/tmp/temp-1123944577/tmp-68140501"
Counters:
Total records written : 4
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
2017-02-06 09:17:53,254 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 09:17:53,257 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 09:17:53,258 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 09:17:53,317 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 09:17:53,317 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(Motherboard,ASUS,P3B-F, 123.00)
(Video Card,ATI,All-in-Wonder Pro, 160.00)
(Sound Card,Creative Labs,Sound Blaster Live, 80.00)
(15 inch Monitor,LG Electronics, 995E, 290.00)
2017-02-06 09:28:14,128 [main] INFO org.apache.pig.Main - Pig script completed in 21 seconds and 138 milliseconds (21138 ms)
[hduser@localhost PIG]$ ls
booksdata.xml bookxmlproc.pig data2.log MaxPageHits pig_1486405214028.log sample_log TotalHitsCount xmlprocessing.pig
books.xml CSVDataForHive data3.log parts.xml pig_1486406651245.log sample.xml xml_out
bookxmldataproc.pig data1.log logproc.pig partsxmlproc.pig SalesJan2009.csv smallwikipedia.csv xml_out_xpath
[hduser@localhost PIG]$ cd xml_out_xpath/
[hduser@localhost xml_out_xpath]$ ls
part-m-00000 _SUCCESS
[hduser@localhost xml_out_xpath]$ cat part-m-00000
Motherboard ASUS P3B-F 123.00
Video Card ATI All-in-Wonder Pro 160.00
Sound Card Creative Labs Sound Blaster Live 80.00
15 inch Monitor LG Electronics 995E 290.00
Example 2
Download Sample_XML_Dataset_Example2
tags:-
<book id="id_num">
<author>Author_Name</author>
<title> Book_Title</title>
<genre>Category</genre>
<price> price</price>
<publish_date> date</publish_date>
<description> details of book </description>
</book>
Sample dataset books.xml
<book id="bk101">
<author>GambardellaMatthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<book id="bk102">
<author>RallsKim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
Pig script (bookxmlproc.pig) to load and extract the XML data using Regular Expression:-
--registering the piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar
--load the xml file using XMLLoader()
input_xml= load '/home/hduser/PIG/books.xml' using org.apache.pig.piggybank.storage.XMLLoader('book') as (inputdata:chararray);
--transforming the loaded data using regular expression
parse_data= FOREACH input_xml GENERATE FLATTEN (REGEX_EXTRACT_ALL(inputdata,'<book\\s+id="(.*)">\\s*<author>(.*)</author>\\s*<title>(.*)</title>\\s*<genre>(.*)</genre>$
--storing the extracted data
store parse_data into 'book_extracted_data';
--output on console
dump parse_data;
Execution of Pig script bookxmlproc.pig :-
[hduser@localhost PIG]$ pig -x local bookxmlproc.pig
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 10:35:58 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 10:35:58 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2017-02-06 10:35:58,848 [main] INFO org.apache.pig.Main - Apache Pig version 0.16.0 (r1746530) compiled Jun 01 2016, 23:10:49
2017-02-06 10:35:58,848 [main] INFO org.apache.pig.Main - Logging error messages to: /home/hduser/PIG/pig_1486406158837.log
Input(s):
Successfully read 12 records from: "/home/hduser/PIG/books.xml"
Output(s):
Successfully stored 12 records in: "file:/tmp/temp406085478/tmp-105729446"
Counters:
Total records written : 12
2017-02-06 10:36:11,129 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 10:36:11,138 [main] INFO org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 10:36:11,139 [main] WARN org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 10:36:11,200 [main] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 10:36:11,200 [main] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(bk101,GambardellaMatthew,XML Developer's Guide,Computer,44.95,2000-10-01,An in-depth look at creating applications with XML.)
(bk102,RallsKim,Midnight Rain,Fantasy,5.95,2000-12-16,A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.)
(bk103,Corets va,Maeve Ascendant,Fantasy,5.95,2000-11-17,After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.)
(bk104,CoretsEva,Oberon's Legacy,Fantasy,5.95,2001-03-10,In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.)
(bk105,CoretsEva,The Sundered Grail,Fantasy,5.95,2001-09-10,The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.)
(bk106,RandallCynthia,Lover Birds,Romance,4.95,2000-09-02,When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.)
(bk107,ThurmanPaula,Splish Splash,Romance,4.95,2000-11-02,A deep sea diver finds true love twenty thousand leagues beneath the sea.)
(bk108,KnorrStefan,Creepy Crawlies,Horror,4.95,2000-12-06,An anthology of horror stories about roaches, centipedes, scorpions and other insects.)
(bk109,KressPeter,Paradox Lost,Science Fiction,6.95,2000-11-02,After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.)
(bk110,O'BrienTim,Microsoft .NET: The Programming Bible,Computer,36.95,2000-12-09,Microsoft's .NET initiative is explored in detail in this deep programmer's reference.)
(bk111,O'BrienTim,MSXML3: A Comprehensive Guide,Computer,36.95,2000-12-01,The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.)
(bk112,GalosMike,Visual Studio 7: A Comprehensive Guide,Computer,49.95,2001-04-16,Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.)
2017-02-06 10:36:11,407 [main] INFO org.apache.pig.Main - Pig script completed in 15 seconds and 121 milliseconds (15121 ms)
[hduser@localhost PIG]$ ls
book_extracted_data bookxmldataproc.pig data1.log logproc.pig partsxmlproc.pig SalesJan2009.csv smallwikipedia.csv xml_out_xpath
booksdata.xml bookxmlproc.pig data2.log MaxPageHits pig_1486405214028.log sample_log TotalHitsCount xmlprocessing.pig
books.xml CSVDataForHive data3.log parts.xml pig_1486406651245.log sample.xml xml_out
[hduser@localhost PIG]$ cd book_extracted_data/
[hduser@localhost book_extracted_data]$ ls
part-m-00000 _SUCCESS
[hduser@localhost book_extracted_data]$ cat part-m-00000
bk101 GambardellaMatthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML.
bk102 RallsKim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
bk103 Corets va Maeve Ascendant Fantasy 5.95 2000-11-17 After the collapse of a nanotechnology society in England, the young survivors lay the foundation for a new society.
bk104 CoretsEva Oberon's Legacy Fantasy 5.95 2001-03-10 In post-apocalypse England, the mysterious agent known only as Oberon helps to create a new life for the inhabitants of London. Sequel to Maeve Ascendant.
bk105 CoretsEva The Sundered Grail Fantasy 5.95 2001-09-10 The two daughters of Maeve, half-sisters, battle one another for control of England. Sequel to Oberon's Legacy.
bk106 RandallCynthia Lover Birds Romance 4.95 2000-09-02 When Carla meets Paul at an ornithology conference, tempers fly as feathers get ruffled.
bk107 ThurmanPaula Splish Splash Romance 4.95 2000-11-02 A deep sea diver finds true love twenty thousand leagues beneath the sea.
bk108 KnorrStefan Creepy Crawlies Horror 4.95 2000-12-06 An anthology of horror stories about roaches, centipedes, scorpions and other insects.
bk109 KressPeter Paradox Lost Science Fiction 6.95 2000-11-02 After an inadvertant trip through a Heisenberg Uncertainty Device, James Salway discovers the problems of being quantum.
bk110 O'BrienTim Microsoft .NET: The Programming Bible Computer 36.95 2000-12-09 Microsoft's .NET initiative is explored in detail in this deep programmer's reference.
bk111 O'BrienTim MSXML3: A Comprehensive Guide Computer 36.95 2000-12-01 The Microsoft MSXML3 parser is covered in detail, with attention to XML DOM interfaces, XSLT processing, SAX and more.
bk112 GalosMike Visual Studio 7: A Comprehensive Guide Computer 49.95 2001-04-16 Microsoft Visual Studio 7 is explored in depth, looking at how Visual Basic, Visual C++, C#, and ASP+ are integrated into a comprehensive development environment.
thakyou it vry nice blog for beginners
ReplyDeletehttps://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/
Xml File Processing >>>>> Download Now
Delete>>>>> Download Full
Xml File Processing >>>>> Download LINK
>>>>> Download Now
Xml File Processing >>>>> Download Full
>>>>> Download LINK uP
It is really a great work and the way in which you are sharing the knowledge is excellent.
ReplyDeletebig data company in chennai
Xml File Processing >>>>> Download Now
ReplyDelete>>>>> Download Full
Xml File Processing >>>>> Download LINK
>>>>> Download Now
Xml File Processing >>>>> Download Full
>>>>> Download LINK sU