If it's not obvious by now, this is the fifth part of my "
Data Mining News and the Stock Market" post, but if you' don't want to go back and read my previous posts, here's where I'm at. I have a sqlite database that's filled with news articles about companies on the NYSE.
I don't have all the data that I'm going to need, but I have enough to start getting some sentiment. While doing some searching on how to do opinion mining in python, I found this repo:
https://github.com/kjahan/opinion-mining.
On the description, they guy described how they used a sentiment analysis API.
I don't mean to sound lazy, but that's just about the best news I've heard in a long time.
Let's be honest. For a one man project, this was going to be a bit of a nightmare. I could have spent my time annotating articles, POS tagging sentences, and building a dictionary of positive/negative tokens. But, it would have either taken me a really long time, or cost me a lot of money. In the end, it might not have even worked. I was really excited to find a simple API.
Also, classifications tend to work better in aggregate. Keeping that in mind, I can architect my code to use several different sentiment analysis APIs. My opinion mining will really just aggregate other people's opinion mining software! I only need to keep the API's pricing model and data restrictions in mind, and that can be controlled by software. To make things even easier, I found a list of API's here:
http://blog.mashape.com/list-of-20-sentiment-analysis-apis/
The API that the opinion-mining repo used is on Mashape, and will allow me 45,000 API calls per month for free (and .01 after that). I've stopped my scraping app after about 12,000 articles, and was about 40% through Reuters. I'm really only going to need to get any article's sentiment once, so this should work well for me as a test.
Adding Sentiment to my System
I've added some models to my database to keep track of all the API's I'm going to be using. Each API will have a unique url and key, but they're all going to be called the same way. I've also added an API response object. This will contain the response that I get back, and a score. These API's seem to have different kinds of results, so the score will have to be calculated differently depending on which one I'm using. Since these API's are going to be used to get opinions, I'm calling them OpinionAPI's
Now my machine learing is done by someone else! I can get the sentiment of an article with a script like this:
After running this, I did a quick sanity check. Good articles seemed to have a high score, and bad articles had a low score! Done!
Now what?
So now I've got some work to do. I've got a little bit of data, and I have a good framework for adding classifiers.
- Build more web scrapers
- Get more data
- Add more API's
- Get prices into database (Simple, but still haven't done it)
After that I'm going to do some exploratory data analysis to see where I'm at. Hopefully I should be posting some pretty graphs and charts soon.
Some Concerns
Some of the assumptions I made throughout this project haven't really panned out. For instance, the number of articles I have per company is way off. I was expecting around 1000 articles per company, but I only have a handful of articles for a several companies. It ranges anywhere from 0 to a little over 1000. I'll need to take a closer look at my Reuters scraper.
Also, the ranges of dates seem to be all over the place. Based on some of my initial testing, I was expecting to only have things from around this year, but I have articles all the way back from 2009. The pricing isn't really an issue because downloading price data is super cheap. I'm just worried about the inconsistency of my data. News and Prices might have had a much different relationship 5 years ago. I can't be sure.
I may be getting a lot of articles that have nothing to do with anything. While browsing my database, I stumbled on an article about the Olympics. It had come up while searching for "Sprint".
with players forced to sprint and stretch behind the goal-lines in order to preserve the surface.
This is a huge concern for me. The whole article had a really negative score, and I have no way of knowing how many articles like this are in my database. I may have to search Reuters by stock ticker instead of by company name. I know I said earlier that I was going to look for the company name in the title, but that could just as easily have the same problem.