XML File Processing

XML File Processing using Pig

       XML is a file extension for an Extensible Markup Language (XML) file format was designed to store and transport data. XML is a markup language like HTML. XML files comes under semi-structured category.

 Sample XML file parts.xml :-
<?xml-stylesheet type="text/css" href="xmlpartsstyle.css"?>
<PARTS>
   <PART>
      <ITEM>Motherboard</ITEM>
      <MANUFACTURER>ASUS</MANUFACTURER>
      <MODEL>P3B-F</MODEL>
      <COST> 123.00</COST>
   </PART>
   <PART>
      <ITEM>Video Card</ITEM>
      <MANUFACTURER>ATI</MANUFACTURER>
      <MODEL>All-in-Wonder Pro</MODEL>
      <COST> 160.00</COST>
   </PART>
   <PART>
      <ITEM>Sound Card</ITEM>
      <MANUFACTURER>Creative Labs</MANUFACTURER>
      <MODEL>Sound Blaster Live</MODEL>
      <COST> 80.00</COST>
   </PART>
   <PART>
      <ITEM? inch Monitor</ITEM>
      <MANUFACTURER>LG Electronics</MANUFACTURER>
      <MODEL> 995E</MODEL>
      <COST> 290.00</COST>
   </PART>
XMLLoader() 
        The XMLLoader() function to load the XML file which is used to parse records from a dataset. 

We can extract XML data using two methods:-
              1. Using Regular Expression
              2. Using XPath( ) 

Example 1
                                   Download Sample_XML_Dataset_Example1
        tags:-
                     <PART>
                                  <ITEM>Name</ITEM>
                                  <MANUFACTURER>CompanyName</MANUFACTURER>
                                  <MODEL>ModelNumber</MODEL>
                                  <COST> Cost</COST>
                     </PART>

Pig script (partsxmlproc_regex.pig) to load and extract the XML data using Regular Expression:-
--registering piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar

--loading the xml data using XMLLoader()
input_xml= load '/home/hduser/PIG/parts.xml' using org.apache.pig.piggybank.storage.XMLLoader('PART') as (inputdata:chararray);

--Applying Regular Expression
parse_data= FOREACH input_xml GENERATE FLATTEN (REGEX_EXTRACT_ALL(inputdata,'<PART>\\s*<ITEM>(.*)</ITEM>\\s*<MANUFACTURER>(.*)</MANUFACTURER>\\s*<MODEL>(.*)</MODEL>\\s$

--storing the output file into xml_out
store parse_data into 'xml_out';

--output on console
dump parse_data;

Execution of Pig script partsxmlproc_regex.pig :-
[hduser@localhost PIG]$ pig -x local partsxmlproc_regex.pig 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType

Input(s):
Successfully read 4 records from: "/home/hduser/PIG/parts.xml"
Output(s):
Successfully stored 4 records in: "file:/tmp/temp-1123944577/tmp-68140501"

Counters:
Total records written : 4
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
2017-02-06 09:17:53,254 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 09:17:53,257 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 09:17:53,258 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 09:17:53,317 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 09:17:53,317 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(Motherboard,ASUS,P3B-F, 123.00)
(Video Card,ATI,All-in-Wonder Pro, 160.00)
(Sound Card,Creative Labs,Sound Blaster Live, 80.00)
(15 inch Monitor,LG Electronics, 995E, 290.00)
2017-02-06 09:17:53,535 [main] INFO  org.apache.pig.Main - Pig script completed in 20 seconds and 773 milliseconds (20773 ms)
[hduser@localhost PIG]$ ls
booksdata.xml        bookxmlproc.pig  data2.log    MaxPageHits       pig_1486405214028.log  sample_log          TotalHitsCount  xmlprocessing.pig
books.xml            CSVDataForHive   data3.log    parts.xml         pig_1486406651245.log  sample.xml          xml_out
bookxmldataproc.pig  data1.log        logproc.pig  partsxmlproc.pig  SalesJan2009.csv       smallwikipedia.csv  xml_out_xpath
[hduser@localhost PIG]$ cd xml_out
[hduser@localhost xml_out]$ ls
part-m-00000  _SUCCESS
[hduser@localhost xml_out]$ cat part-m-00000 
Motherboard    ASUS    P3B-F     123.00
Video Card    ATI    All-in-Wonder Pro     160.00
Sound Card    Creative Labs    Sound Blaster Live     80.00
15 inch Monitor    LG Electronics     995E     290.00

Method 2:-
                   Using XPath()
Pig script (partsxmlproc_xpath.pig) to load and extract the XML data using XPath() :-
--registering piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar
DEFINE XPath org.apache.pig.piggybank.evaluation.xml.XPath();
--loading the xml data using XMLLoader()
input_xml= load '/home/hduser/PIG/parts.xml' using org.apache.pig.piggybank.storage.XMLLoader('PART') as (inputdata:chararray);

--Applying transformation using XPath
parse_data= FOREACH input_xml GENERATE XPath(inputdata,'PART/ITEM'),XPath(inputdata,'PART/MANUFACTURER'),XPath(inputdata,'PART/MODEL'),XPath(inputdata,'PART/COST');

--storing the output file into xml_out
store parse_data into 'xml_out';

--output on console
dump parse_data;

Execution of Pig script partsxmlproc_xpath.pig :-
[hduser@localhost PIG]$ pig -x local partsxmlproc_xpath.pig 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 09:16:21 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType

Input(s):
Successfully read 4 records from: "/home/hduser/PIG/parts.xml"
Output(s):
Successfully stored 4 records in: "file:/tmp/temp-1123944577/tmp-68140501"

Counters:
Total records written : 4
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0
2017-02-06 09:17:53,254 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 09:17:53,257 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 09:17:53,258 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 09:17:53,317 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 09:17:53,317 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1

(Motherboard,ASUS,P3B-F, 123.00)
(Video Card,ATI,All-in-Wonder Pro, 160.00)
(Sound Card,Creative Labs,Sound Blaster Live, 80.00)
(15 inch Monitor,LG Electronics, 995E, 290.00)
2017-02-06 09:28:14,128 [main] INFO  org.apache.pig.Main - Pig script completed in 21 seconds and 138 milliseconds (21138 ms)
[hduser@localhost PIG]$ ls
booksdata.xml        bookxmlproc.pig  data2.log    MaxPageHits       pig_1486405214028.log  sample_log          TotalHitsCount  xmlprocessing.pig
books.xml            CSVDataForHive   data3.log    parts.xml         pig_1486406651245.log  sample.xml          xml_out
bookxmldataproc.pig  data1.log        logproc.pig  partsxmlproc.pig  SalesJan2009.csv       smallwikipedia.csv  xml_out_xpath
[hduser@localhost PIG]$ cd xml_out_xpath/
[hduser@localhost xml_out_xpath]$ ls
part-m-00000  _SUCCESS
[hduser@localhost xml_out_xpath]$ cat part-m-00000 
Motherboard    ASUS    P3B-F     123.00
Video Card    ATI    All-in-Wonder Pro     160.00
Sound Card    Creative Labs    Sound Blaster Live     80.00
15 inch Monitor    LG Electronics     995E     290.00

Example 2
                          Download Sample_XML_Dataset_Example2          
         tags:-
                      <book id="id_num">
                          <author>Author_Name</author>
                          <title> Book_Title</title>
                          <genre>Category</genre>
                          <price> price</price>
                          <publish_date> date</publish_date>
                          <description> details of book </description>
                      </book>
Sample dataset books.xml
 <book id="bk101">
      <author>GambardellaMatthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications
      with XML.</description>
   </book>
   <book id="bk102">
      <author>RallsKim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies,
      an evil sorceress, and her own childhood to become queen
      of the world.</description>
   </book>

Pig script (bookxmlproc.pig) to load and extract the XML data using Regular Expression:-
--registering the piggybank.jar
REGISTER /usr/local/hadoop/pig/lib/piggybank.jar

--load the xml file using XMLLoader()
input_xml= load '/home/hduser/PIG/books.xml' using org.apache.pig.piggybank.storage.XMLLoader('book') as (inputdata:chararray);

--transforming the loaded data using regular expression
parse_data= FOREACH input_xml GENERATE FLATTEN (REGEX_EXTRACT_ALL(inputdata,'<book\\s+id="(.*)">\\s*<author>(.*)</author>\\s*<title>(.*)</title>\\s*<genre>(.*)</genre>$

--storing the extracted data
store parse_data into 'book_extracted_data';

--output on console
dump parse_data;

Execution of Pig script bookxmlproc.pig :-
[hduser@localhost PIG]$ pig -x local bookxmlproc.pig 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/hadoop-2.6.0/hbase-1.2.2/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
17/02/06 10:35:58 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
17/02/06 10:35:58 INFO pig.ExecTypeProvider: Picked LOCAL as the ExecType
2017-02-06 10:35:58,848 [main] INFO  org.apache.pig.Main - Apache Pig version 0.16.0 (r1746530) compiled Jun 01 2016, 23:10:49
2017-02-06 10:35:58,848 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hduser/PIG/pig_1486406158837.log

Input(s):
Successfully read 12 records from: "/home/hduser/PIG/books.xml"
Output(s):
Successfully stored 12 records in: "file:/tmp/temp406085478/tmp-105729446"

Counters:
Total records written : 12
2017-02-06 10:36:11,129 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2017-02-06 10:36:11,138 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2017-02-06 10:36:11,139 [main] WARN  org.apache.pig.data.SchemaTupleBackend - SchemaTupleBackend has already been initialized
2017-02-06 10:36:11,200 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2017-02-06 10:36:11,200 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(bk101,GambardellaMatthew,XML Developer's Guide,Computer,44.95,2000-10-01,An in-depth look at creating applications       with XML.)
(bk102,RallsKim,Midnight Rain,Fantasy,5.95,2000-12-16,A former architect battles corporate zombies,       an evil sorceress, and her own childhood to become queen       of the world.)
(bk103,Corets va,Maeve Ascendant,Fantasy,5.95,2000-11-17,After the collapse of a nanotechnology       society in England, the young survivors lay the       foundation for a new society.)
(bk104,CoretsEva,Oberon's Legacy,Fantasy,5.95,2001-03-10,In post-apocalypse England, the mysterious       agent known only as Oberon helps to create a new life       for the inhabitants of London. Sequel to Maeve       Ascendant.)
(bk105,CoretsEva,The Sundered Grail,Fantasy,5.95,2001-09-10,The two daughters of Maeve, half-sisters,       battle one another for control of England. Sequel to       Oberon's Legacy.)
(bk106,RandallCynthia,Lover Birds,Romance,4.95,2000-09-02,When Carla meets Paul at an ornithology       conference, tempers fly as feathers get ruffled.)
(bk107,ThurmanPaula,Splish Splash,Romance,4.95,2000-11-02,A deep sea diver finds true love twenty       thousand leagues beneath the sea.)
(bk108,KnorrStefan,Creepy Crawlies,Horror,4.95,2000-12-06,An anthology of horror stories about roaches,      centipedes, scorpions  and other insects.)
(bk109,KressPeter,Paradox Lost,Science Fiction,6.95,2000-11-02,After an inadvertant trip through a Heisenberg      Uncertainty Device, James Salway discovers the problems       of being quantum.)
(bk110,O'BrienTim,Microsoft .NET: The Programming Bible,Computer,36.95,2000-12-09,Microsoft's .NET initiative is explored in       detail in this deep programmer's reference.)
(bk111,O'BrienTim,MSXML3: A Comprehensive Guide,Computer,36.95,2000-12-01,The Microsoft MSXML3 parser is covered in       detail, with attention to XML DOM interfaces, XSLT processing,       SAX and more.)
(bk112,GalosMike,Visual Studio 7: A Comprehensive Guide,Computer,49.95,2001-04-16,Microsoft Visual Studio 7 is explored in depth,      looking at how Visual Basic, Visual C++, C#, and ASP+ are       integrated into a comprehensive development       environment.)
2017-02-06 10:36:11,407 [main] INFO  org.apache.pig.Main - Pig script completed in 15 seconds and 121 milliseconds (15121 ms)
[hduser@localhost PIG]$ ls
book_extracted_data  bookxmldataproc.pig  data1.log  logproc.pig  partsxmlproc.pig       SalesJan2009.csv  smallwikipedia.csv  xml_out_xpath
booksdata.xml        bookxmlproc.pig      data2.log  MaxPageHits  pig_1486405214028.log  sample_log        TotalHitsCount      xmlprocessing.pig
books.xml            CSVDataForHive       data3.log  parts.xml    pig_1486406651245.log  sample.xml        xml_out
[hduser@localhost PIG]$ cd book_extracted_data/
[hduser@localhost book_extracted_data]$ ls
part-m-00000  _SUCCESS
[hduser@localhost book_extracted_data]$ cat part-m-00000 
bk101    GambardellaMatthew    XML Developer's Guide    Computer    44.95    2000-10-01    An in-depth look at creating applications       with XML.
bk102    RallsKim    Midnight Rain    Fantasy    5.95    2000-12-16    A former architect battles corporate zombies,       an evil sorceress, and her own childhood to become queen       of the world.
bk103    Corets va    Maeve Ascendant    Fantasy    5.95    2000-11-17    After the collapse of a nanotechnology       society in England, the young survivors lay the       foundation for a new society.
bk104    CoretsEva    Oberon's Legacy    Fantasy    5.95    2001-03-10    In post-apocalypse England, the mysterious       agent known only as Oberon helps to create a new life       for the inhabitants of London. Sequel to Maeve       Ascendant.
bk105    CoretsEva    The Sundered Grail    Fantasy    5.95    2001-09-10    The two daughters of Maeve, half-sisters,       battle one another for control of England. Sequel to       Oberon's Legacy.
bk106    RandallCynthia    Lover Birds    Romance    4.95    2000-09-02    When Carla meets Paul at an ornithology       conference, tempers fly as feathers get ruffled.
bk107    ThurmanPaula    Splish Splash    Romance    4.95    2000-11-02    A deep sea diver finds true love twenty       thousand leagues beneath the sea.
bk108    KnorrStefan    Creepy Crawlies    Horror    4.95    2000-12-06    An anthology of horror stories about roaches,      centipedes, scorpions  and other insects.
bk109    KressPeter    Paradox Lost    Science Fiction    6.95    2000-11-02    After an inadvertant trip through a Heisenberg      Uncertainty Device, James Salway discovers the problems       of being quantum.
bk110    O'BrienTim    Microsoft .NET: The Programming Bible    Computer    36.95    2000-12-09    Microsoft's .NET initiative is explored in       detail in this deep programmer's reference.
bk111    O'BrienTim    MSXML3: A Comprehensive Guide    Computer    36.95    2000-12-01    The Microsoft MSXML3 parser is covered in       detail, with attention to XML DOM interfaces, XSLT processing,       SAX and more.
bk112    GalosMike    Visual Studio 7: A Comprehensive Guide    Computer    49.95    2001-04-16    Microsoft Visual Studio 7 is explored in depth,      looking at how Visual Basic, Visual C++, C#, and ASP+ are       integrated into a comprehensive development       environment.

Comments

  1. thakyou it vry nice blog for beginners
    https://www.emexotechnologies.com/courses/big-data-analytics-training/big-data-hadoop-training/

    ReplyDelete
    Replies
    1. Xml File Processing >>>>> Download Now

      >>>>> Download Full

      Xml File Processing >>>>> Download LINK

      >>>>> Download Now

      Xml File Processing >>>>> Download Full

      >>>>> Download LINK uP

      Delete
  2. It is really a great work and the way in which you are sharing the knowledge is excellent.

    big data company in chennai

    ReplyDelete
  3. Xml File Processing >>>>> Download Now

    >>>>> Download Full

    Xml File Processing >>>>> Download LINK

    >>>>> Download Now

    Xml File Processing >>>>> Download Full

    >>>>> Download LINK sU

    ReplyDelete

Post a Comment