Welcome to the Roman Gelembjuk's blog
Calculation the importance of the event

There are hundreds of thousands of new events described in Web every day. News sites, blogs and another resources publish information about events. Usually it is needed somehow to filter important events from not important.

There i will describe the tool that assign rating (from 1 to 10) to each message in Google News service.

Last Updated on Tuesday, 23 March 2010 15:23
 
Split text for sentences with PHP

Getting sentences from text if function needed in many applications.

There i will describe how to split text for sentences with PHP.

Last Updated on Friday, 19 March 2010 11:43
 
Generating Unique Text with Markov chains

Few years ago i had written Joomla component AutoContent. That component generates new unique content and add it to Joomla site. In general, this is not good tool, because it does black SEO things. If search engine (ex. Google) finds that there is autogenerated text without sense then site can be banned.

I desided to not support that component anymore. But there are some interesting classes in that component.

One of them is php class that generates new unique text with Markov chains.

Last Updated on Friday, 19 March 2010 11:29
 
Creating extended Twitter bot with PHP

If you try to search in the Web info about creating Twitter bot then you will find a lot if articles and blog messages about this. But most of them are just examples of code that posts twitt (message) to Twitter.

This is not enough to create full function Twitter bot.

Good Twitter bot  should: post twitts, subscribe for new users by given subject, somehow make another users to subscribe for needed account.  Actually, last task is most important - grow count of subscribers for some account.

There is example of such bot - Extended Twitter bot with PHP.

Last Updated on Sunday, 07 March 2010 14:28
 
How to extract useful content from HTML page

One of most important tasks in information retrieval science is to extract useful content from Web pages.

The problem is Web pages (usually HTML) contain some information (text) with tons of information additional texts - navigation structures, advertising etc. From point of view of information retrieval tools each HTML page has main (useful) content and helpful information that is good when viewing Web, but not when extracting the data.

The task is to filter useful content from HTML without knowing of structure of the page.

It is not difficult to get useful content when structure of HTML page is known. For example, with using WebScraper. But it is not possible to create templates or REGEXP exprassions for every page in the Web.

Last Updated on Thursday, 18 March 2010 15:05
 
<< Start < Prev 1 2 3 4 5 6 Next > End >>

Page 4 of 6