Login Form



Extracting useful content from Web-resource. Technologies review. PDF Print E-mail

There is review of techniques of extracting important content from web-resources.

It is not secret that Internet giants like Google, Microsoft or Yahoo already have powerful technologies that can extract important content from web-pages. But big corporations are not interested in publishing of their technologies.

If to try to find information on this question we can find 2 types of resources - scientific publications and blog posts and articles of individuals that did some small tools for their needs.

Let see what information we can find now for the subject "Extracting useful (important) information from web pages".


Some blog posts on this subject:

In general, idea if these and some another articles is to use HTML DOM structure to calculate ratio of text and HTML tags in document nodes to see where biggest text parts are.

Scientific publications on this subject:

Scientific publications have interesting ideas and solutions. But usually published information is just some general description of how technology should work, no algorithms or formulas. When trying to use solutions from publications to build some sample application then i get more and more questions in process. And articles says nothing about this.

It is needed to do very deep research of text mining techniques to find algorithms for solving different small tasks that are on way of applying some idea.

When i did my review or current technologies i had not found way to create effective application that can extract important content from HTML page.So this area has a lot of work to do in future.


Last Updated on Wednesday, 21 April 2010 14:22