|Extracting data from HTML pages with Perl|
Extracting data from HTML pages is procedure that is used more and more in different applications. Usually, codes use regular expressions to get some data from HTML page. This is not good solution when it is needed to extract complex data structure. Regular expression usually are big and difficult to understand if it is needed to change after some time or another coder want to change.
I found another solution. I have create Perl module that extracts data with using template. Template is HTML code with some special tags and attributes on place of data to extract.
The WebScraper is Perl module that is used for extracting necessary listed data from HTML page. The listed data are understood as data which are concluded in HTML tags of similar structure.
WebScraper module can be downloaded there.
In this article i will describe how to use this module. I have created small script that extracts list of news from Google News web page. The page is http://news.google.com/news?pz=1&ned=us&topic=b&hl=en&q
Google also has RSS feed for news. But data in that feed is little different. Messages in feed appears later that on page. So if you want to have latest news in time it is better to get news from that page.
There is code of the script that uses WebScraper to extract news from page.
Main fragment is next.
Array returned by grabListedData function is array of hashes . Each item is hash with keys specified in template.
The template used there (file 'newslist.htm') is:
This is HTML code with special tags - <datatag> . These tags are put on places where data should be. Names of these tags will be names of hashes keys.
One tag <datatag name="null" -pass="all"> is used when HTML page have long fragment that must be ignored. Tag with name null will not be put in resulted hash.
Also there is another special tag <norequired>. It is used when there is fragment that can be present in HTML page or not.
For details see Perl WebScraper manual.
Working sample is there
|Last Updated on Friday, 26 February 2010 08:32|