Main Menu

Login Form



Syndication

feed-image


Extracting data from HTML pages with Perl PDF Print E-mail

Extracting data from HTML pages is procedure that is used more and more in different applications. Usually, codes use regular expressions to get some data from HTML page. This is not good solution when it is needed to extract complex data structure. Regular expression usually are big and difficult to understand if it is needed to change after some time or another coder want to change.

I found another solution. I have create Perl module that extracts data with using template. Template is HTML code with some special tags and attributes on place of data to extract.

 

The WebScraper is Perl module that is used for extracting necessary listed data from HTML page. The listed data are understood as data which are concluded in HTML tags of similar structure.

WebScraper module can be downloaded there.

In this article i will describe how to use this module. I have created small script that extracts list of news from Google News web page. The page is http://news.google.com/news?pz=1&ned=us&topic=b&hl=en&q

Google also has RSS feed for news. But data in that feed is little different. Messages in feed appears later that on page. So if you want to have latest news in time it is better to get news from that page.

There is code of the script that uses WebScraper to extract news from page.

#!/usr/bin/perl -w

use strict; 
use FindBin;
use lib $FindBin::Bin;

use CGI::Carp qw(fatalsToBrowser);
use WebScraper;
use CGI;

my $q=new CGI;
print $q->header(-charset=>'utf-8');

my $newsurl='http://news.google.com/news?pz=1&ned=us&topic=b&hl=en&q';

my $g=WebScraper->new(
 debuglevel=>0,
 trim=>1
 );

$g->loadTemplate('newslist.htm'); 

$g->getPage($newsurl);

my @a=$g->grabListedData();

foreach my $k (@a){
 foreach my $p (keys %$k){
 print "$p => $$k{$p}<br>\n";
 }
 print '<hr>';
 
}

Main fragment is next.

my $newsurl='http://news.google.com/news?pz=1&ned=us&topic=b&hl=en&q';
#init webscraper module
my $g=WebScraper->new(
 debuglevel=>0,
 trim=>1  #means remove left and right slashes from extracted data
 );

$g->loadTemplate('newslist.htm');   #load template from file

$g->getPage($newsurl);   #load page with url

my @a=$g->grabListedData();    #start data extraction
#next code prints extracted data
foreach my $k (@a){
 foreach my $p (keys %$k){
 print "$p => $$k{$p}<br>\n";
 }
 print '<hr>';
}

Array returned by grabListedData function is array of hashes . Each item is hash with keys specified in template.

The template used there (file 'newslist.htm') is:

<div class="story">
 <datatag name="null" -pass="all">
 <h2 class="title">
 <datatag name="null" -pass="all">
 <input class="unstar-url">
 <datatag name="null" -pass="all">
 <datatag name="sourcelink" -extractfrom="a" -attrforextract="href">
 <datatag name="title" -pass="b">
 </a>
 </h2>
 <div class="sub-title">
 <span class="source">
 <datatag name="sourcecompany">
 </span>
 <datatag name="null" -pass="all">
 <span class="date"><datatag name="time"></span>
 </div>
 <div class="body">
 <div class="snippet">
 <datatag name="shorttext" -pass="all">
 </div>
 <datatag name="additional-article" -pass="all" -astext="yes">
 
 <div class="sources">
 <datatag name="sources" -pass="all" -astext="yes">
 <div class="moreLinks">
 <datatag name="moreLinks" -extractfrom="a" -attrforextract="href"></a>
 </div>
 <norequired>
 <div class="stock-tckrs">
 <datatag name="stock-tckrs" -pass="all" -astext="yes">
 </div>
 </norequired>

<div class="r">

This is HTML code with special tags - <datatag> . These tags are put on places where data should be. Names of these tags will be names of hashes keys.

One tag <datatag name="null" -pass="all">  is used when HTML page have long fragment that must be ignored. Tag with name null will not be put in resulted hash.

Also there is another special tag <norequired>. It is used when there is fragment that can be present in HTML page or not.

For details see Perl WebScraper manual.

Working sample is there

Last Updated on Friday, 26 February 2010 08:32
 

Comments  

 
#5 luigi4235 2015-02-23 10:56
[url=http://thatsafunnypic.com]
 
 
#4 luigi4235 2015-02-21 13:19
(╯°□°)--︻╦╤─ - - -
+++thatsafunnypic.com+++
 
 
#3 nishi 2014-07-18 19:22
hi,this is almost the exact resource i needed to work on,i mean to get the basic concept of data extraction. But i am facing problems in downloading "webscaper manual,example zip file and the .tgz file".
Though i can download these files,i cant open the files..
please help me out with the resources as soon as possible..
And also would appreciate if i am provided with the additional resources on "Data Extraction from HTML doc"
 
 
#2 Administrator 2012-02-28 10:56
Hi,
I think i can't help you. I am not native speaking English. I can't be good corrector of the text.
Quoting Nazarnete:
I wish to say thanks to you with regard to your web site. You have lots of unique resources as a result the website is handy. I'm a university student in Texas and I have an significant school assignment scheduled this month. I am trapped as well as have got writers-block these days as I am studying. Would just like any person to help
me revise the free article I ran across on the net illegal immigration in the u.s essay. That paper matches my specifications yet is authored in a bad style and there are grammar faults. Do you consider I should proceed? I am simply just desperate for support, and so almost any hint would be terrific.
 
 
#1 Nazarnete 2012-02-27 16:41
I wish to say thanks to you with regard to your web site. You have lots of unique resources as a result the website is handy. I'm a university student in Texas and I have an significant school assignment scheduled this month. I am trapped as well as have got writers-block these days as I am studying. Would just like any person to help
me revise the free article I ran across on the net illegal immigration in the u.s essay. That paper matches my specifications yet is authored in a bad style and there are grammar faults. Do you consider I should proceed? I am simply just desperate for support, and so almost any hint would be terrific.