Roman Gelembjuk's science notes



How to extract useful content from HTML page PDF Print E-mail

One of most important tasks in information retrieval science is to extract useful content from Web pages.

The problem is Web pages (usually HTML) contain some information (text) with tons of information additional texts - navigation structures, advertising etc. From point of view of information retrieval tools each HTML page has main (useful) content and helpful information that is good when viewing Web, but not when extracting the data.

The task is to filter useful content from HTML without knowing of structure of the page.

It is not difficult to get useful content when structure of HTML page is known. For example, with using WebScraper. But it is not possible to create templates or REGEXP exprassions for every page in the Web.

Last Updated on Thursday, 18 March 2010 15:05
 
<< Start < Prev 1 2 Next > End >>

Page 2 of 2