How to extract useful content from HTML page PDF Print E-mail

One of most important tasks in information retrieval science is to extract useful content from Web pages.

The problem is Web pages (usually HTML) contain some information (text) with tons of information additional texts - navigation structures, advertising etc. From point of view of information retrieval tools each HTML page has main (useful) content and helpful information that is good when viewing Web, but not when extracting the data.

The task is to filter useful content from HTML without knowing of structure of the page.

It is not difficult to get useful content when structure of HTML page is known. For example, with using WebScraper. But it is not possible to create templates or REGEXP exprassions for every page in the Web.

Last Updated on Thursday, 18 March 2010 15:05
Getting keywords of the text PDF Print E-mail

I have written simple Perl script that extracts keywords from the text.

The tool is avaliable to play with the link

This tool extracts all words from text, remembers each word count of repeating and then leave most used words as keywords. Also words are filtered , some common words are dropped from the list.

Last Updated on Friday, 19 February 2010 14:27
My science interests PDF Print E-mail

I had my first science researches 10 years ago. That were attempts to create Artificial Intelligence. And i was not success :)

Later  i did some researches and get results in Calculus . That was my official scientific research.

Now i am still interested in  Artificial Intelligence. I don't try to create AI now. I want to get some useful technologies related to not-structured text processing.

I have interests in Data mining and Text mining, Computer linguistics.


Last Updated on Wednesday, 23 December 2009 15:30
<< Start < Prev 1 2 Next > End >>

Page 2 of 2