How to extract useful content from HTML page PDF Print E-mail

One of most important tasks in information retrieval science is to extract useful content from Web pages.

The problem is Web pages (usually HTML) contain some information (text) with tons of information additional texts - navigation structures, advertising etc. From point of view of information retrieval tools each HTML page has main (useful) content and helpful information that is good when viewing Web, but not when extracting the data.

The task is to filter useful content from HTML without knowing of structure of the page.

It is not difficult to get useful content when structure of HTML page is known. For example, with using WebScraper. But it is not possible to create templates or REGEXP exprassions for every page in the Web.

There is simple solution for this task.

As it is known, HTML page can be presented as tree view.

Example:

Or another example as HTML code:

<div id="container">

 <h1>Main Heading</h1>
 <p>Most programmers <em>rely</em> on caffine</p>
 <p>Most programmers like Anime</p>

 <div class="lister">
 
 <h2>The 'P' Interpreted Languages</h2>
 <ul>
 <li>Perl</li>
 <li>PHP</li>
 <li>Python</li>
 </ul>
 
 </div>

</div>

My idea of extracting useful content is to represent HTML page as tree of nodes and then check the list of one level nodes to see how text is distributed between them. Then take node with major text part and process its subnodes with the same procedure.

Let look at example.

<html>
<head>
<title> My New Web Page </title>
</head>

<body>
<table>
<tr><td colspan="2"><h1> Welcome to My Web Page! </h1></td></tr>
<tr>
<td width="200">
<a>Menu item 1</a> <br>
<a>Menu item 2</a> <br>
<a>Menu item 3</a> <br>
</td>
<td id="maincontent">
<p>
This page illustrates how you can write proper HTML 
using only a text editor, such as Windows Notepad. You can also 
download a free text editor, such as Crimson Editor, which is 
better than Notepad.
</p>

<p>
There is a small graphic after the period at the end of this sentence. 
<img src="/images/mouse.gif" alt="Mousie" width="32" height="32" border="0"> The graphic is in a file. The file is inside a folder named "images."
</p>

<p>
Link: <a href="http://www.yahoo.com/">Yahoo!</a> <br>
Another link: <a href="/tableexample.htm">Another Web page</a> <br>
Note the way the BR tag works in the two lines above.
</p>

<p>&gt; <a href="/index.htm">HTML examples index</a></p>
</td>
</tr>
</table>
</body>
</html>

For any HTML page it make sense only to look at body of the document - everything that is in <body>....</body> tags.

So we have top level node in the tree - tag body .

My agloritm of extracting useful data is:

  1. Get root node (body tag) and put it for processing to recursively procedure that begins at step 2.
  2. Calculate length of clear text in node.
  3. Get list of subnodes for node.
  4. If there is only 1 subnode then go to step 2 to process this subnode.
  5. For each subnode calculate length of clear text in the subnode.
  6. Get coefficient (K) of distribution of a text between subnodes.
  7. If K is low (<5%) it means that text is uniformly distributed between sobnodes and current node is exact main content of the page.
  8. If K is not low (>=5%) then choose the subnode with longest clear text , and apply the same procedure for this subnode (go to step 2). There is recursive call of content extracting function.

There are different ways to calculate coefficient of distribution of a text. And different value can be used as signal to stop. I used 5%.

Let look how this agloritm will work with example above.

First it will process node - body.

Body has 1 subnode - table. So just take this subnode to procedure.

Table has 2 subnodes - tr 2 times. One of then (second) has much more text then first. Coefficient of distribution of a text will be hight there . So, next node to process is second tr.

Lets go deeper. The tr has 2 subnodes - td 2 times. The second has much more text. It will be used in next step.

Td tag (current node) has few subnodes - p few times. In this case coefficient of distribution of a text will be low. Because, part of the text is in each p tag. So, at this step procedure will stop and return HTML code in this td as main (useful) content of the page.

Calculating of the coefficient (K).

Let L(i) is length if text in subnode number i. i=1..n . Where n is count of subnodes of node.

FTL is length of  all text in node (just text after removing HTML tags).

LV is linear variance of list L(i) , i=1..n  .

And at last K=100*LV/FTL

I have created simple tool to demonstrates how this works. It is written with Perl. Getting useful content of the Web page.

Try to post some urls there. It is not ideal. But works fine for all tests that i did.

Last Updated on Thursday, 18 March 2010 15:05
 

Comments  

 
#14 Joana Conde 2015-07-20 09:53
@Administrator

Can you please send the algorithm for me?

if yes thanks a lot
 
 
#13 ami 2012-10-24 09:43
this is a very helpful post, can u pls send the algorithm for me??
if yes, it is very much helpful
thanks
 
 
#12 Ankur parihar 2012-09-27 05:08
[quote name=""]Hi,

Can some one please provide me code for extact content block of webpages algorithm.
 
 
#11 Ankursingh 2012-09-06 05:20
what is application of extraction of content block from webpage
 
 
#10 kennylazy2010 2012-06-25 05:03
How I calculate LV??
Thanks
 
 
#9 tony wang 2012-03-15 08:33
Can you please provide me code for the this algorithm.

Thanks
tony wang
 
 
#8 hthieu 2011-08-02 12:08
LV-linear variance, what's it?
Could you show how to calculate this,please?
Thank for your this article
 
 
#7 2011-01-30 14:58
please send me the example ....
 
 
#6 2010-07-16 19:00
Thank you for this info. I've been trying to extract useful info from html pages for years. now
 
 
#5 2010-04-12 14:11
Hi,

Can some one please provide me code for the above algorithm.

Thanks,
Dilip Kumar Kola