Main Menu

Login Form



Syndication

feed-image


Categorization of financial news by industry sector PDF Print E-mail

Categorization (classification) of documents is very interesting Text Mining task.

Recently i had tried to create a tool that will categorize business news with industry . There are couple of different industry categories sets used in financial institutions.

I have used the set from Yahoo Finance

 

My aim was to create the tool that will receive some text and find the industry(s) that this text is related to.

I had decided to use text mining methods for this.

The idea was to get the set of texts (documents ) that are already categorized (each document is assigned to some category - industry) and then use the set to learn categorizer.

Of course, i chose Perl as programming language for the tool. There is fine Perl library - AI::Categorizer that can do most of work.

To get texts for learning i have coded the script that scraped 20-30 texts from Yahoo Finance News for each industry. News are already categorized there. So this is easy to get texts for training.

Also i have get another set of categorized texts for testing of created categorizer. That were also news messages , but from different time period.

So STEP 1 was:

Code Perl script to scrap list of industries from Yahoo Finance and scrap 20-30 messages from News section for each industry. To create this scraper i have used my WebScraper Perl module.

STEP 2 was to choose method for categorization and create knowledge base for it.

There are couple of methods supported in AI::Categorizer. I decided to try each of them and see what will be most successful .

STEP 3 - find how to represent texts for learning . There are few options:

1. text "as is".

2. only list of keywords (most used words in text without common words).

3. "keywords cloud". This is list of keywords , but each keyword is repeated as many times as in original text.

Experiment 1.

AI::Categorizer::Learner::NaiveBayes used as categorizer.

Learning set of texts was used "as is". Each text was mapped to category (industry) and added for learning.

After categorizer was "trained" i have run tester using test set of texts . Result was - 18% of correct assigns. Not good result.

Experiment 2.

The same  categorizer method is used. Train texts are replaced with string containing text keywords.

Test texts were used "as is". Result - 20% of correct assigns. Little better.

Experiment 3.

The same  categorizer method is used. Train texts are replaced with string containing text keywords.

Test texts were also replaced with keywords list. Result - 27% of correct assigns. Again better. But not yet enough.

More experiments

Then i started to replace texts with "keyword cloud". But results were the same as for just keywords list.

Next experiments were with different combination of text representation. All results were lower then for experiment 3.

Then i tried different categorizer methods. But all of them had very big percent of "no category found" cases, when test text was not assigned to any category.

Results

My attempt to create news categorizer was not success.

I think there are 2 possible reasons:

1. Too many texts were used for training . Probably, categorizer need much more cases to create stable rules for categories (industries)

2. News messages are time specified, sensitive to events. As i used train texts and test texts from different time periods then this can affect on final result.

 

 

Last Updated on Thursday, 14 October 2010 19:51