Code and Stuff: August 2014

Wednesday, August 13, 2014

Data Mining News and the Stock Market Part 5 - Opinion Mining

If it's not obvious by now, this is the fifth part of my "Data Mining News and the Stock Market" post, but if you' don't want to go back and read my previous posts, here's where I'm at. I have a sqlite database that's filled with news articles about companies on the NYSE.

I don't have all the data that I'm going to need, but I have enough to start getting some sentiment. While doing some searching on how to do opinion mining in python, I found this repo: https://github.com/kjahan/opinion-mining.
On the description, they guy described how they used a sentiment analysis API.
I don't mean to sound lazy, but that's just about the best news I've heard in a long time.

Let's be honest. For a one man project, this was going to be a bit of a nightmare. I could have spent my time annotating articles, POS tagging sentences, and building a dictionary of positive/negative tokens. But, it would have either taken me a really long time, or cost me a lot of money. In the end, it might not have even worked. I was really excited to find a simple API.

Also, classifications tend to work better in aggregate. Keeping that in mind, I can architect my code to use several different sentiment analysis APIs. My opinion mining will really just aggregate other people's opinion mining software! I only need to keep the API's pricing model and data restrictions in mind, and that can be controlled by software. To make things even easier, I found a list of API's here: http://blog.mashape.com/list-of-20-sentiment-analysis-apis/

The API that the opinion-mining repo used is on Mashape, and will allow me 45,000 API calls per month for free (and .01 after that). I've stopped my scraping app after about 12,000 articles, and was about 40% through Reuters. I'm really only going to need to get any article's sentiment once, so this should work well for me as a test.

Adding Sentiment to my System

I've added some models to my database to keep track of all the API's I'm going to be using. Each API will have a unique url and key, but they're all going to be called the same way. I've also added an API response object. This will contain the response that I get back, and a score. These API's seem to have different kinds of results, so the score will have to be calculated differently depending on which one I'm using. Since these API's are going to be used to get opinions, I'm calling them OpinionAPI's

Now my machine learing is done by someone else! I can get the sentiment of an article with a script like this:

After running this, I did a quick sanity check. Good articles seemed to have a high score, and bad articles had a low score! Done!

Now what?

So now I've got some work to do. I've got a little bit of data, and I have a good framework for adding classifiers.

Build more web scrapers
Get more data
Add more API's
Get prices into database (Simple, but still haven't done it)

After that I'm going to do some exploratory data analysis to see where I'm at. Hopefully I should be posting some pretty graphs and charts soon.

Some Concerns

Some of the assumptions I made throughout this project haven't really panned out. For instance, the number of articles I have per company is way off. I was expecting around 1000 articles per company, but I only have a handful of articles for a several companies. It ranges anywhere from 0 to a little over 1000. I'll need to take a closer look at my Reuters scraper.

Also, the ranges of dates seem to be all over the place. Based on some of my initial testing, I was expecting to only have things from around this year, but I have articles all the way back from 2009. The pricing isn't really an issue because downloading price data is super cheap. I'm just worried about the inconsistency of my data. News and Prices might have had a much different relationship 5 years ago. I can't be sure.

I may be getting a lot of articles that have nothing to do with anything. While browsing my database, I stumbled on an article about the Olympics. It had come up while searching for "Sprint".

with players forced to sprint and stretch behind the goal-lines in order to preserve the surface.

This is a huge concern for me. The whole article had a really negative score, and I have no way of knowing how many articles like this are in my database. I may have to search Reuters by stock ticker instead of by company name. I know I said earlier that I was going to look for the company name in the title, but that could just as easily have the same problem.

Monday, August 11, 2014

Data Mining News and the Stock Market Part 4 - Collecting News Articles

This is the fourth part of my "Data Mining News and the Stock Market" post.

First thing I did was set up a repo on github. I needed a good name. "Newtsocks" is an anagram for "news" and "stock", so I went with that.

Next, I set up an isolated python virtual environment using virtualenv. It's pretty handy, and keeps me from cluttering up my computer. At first, I made the mistake of including virtualenv directories in my repo. For future reference, don't do that.

I also needed a database to store all this info I'm going to be grabbing. After a bit too much research, I decided to stick with a relational database. I'm just more comfortable with them, and decided I could build this out a little faster if I stick to a traditional rdbms. Also, it's not important (right now) that my application be able to scale out.

Next, I wanted a good lightweight ORM. I know I'm doing basic web scraping here, and this probably seems like overkill, but I feel like the upfront work will save me some time later on. I decided to go with peewee. Seemed like it would be pretty simple to get running. And it was! I created my database, and added my list of companies.

Then I needed to do the actual scraping. I wrote up a simple script. You can check out the process I went through by looking at the history on the repo, but I ended up with something like this:

Now I've got some data! It's just a prototype for the scraper. It's nothing novel, but it's working. I started running it about 30 minutes ago, and I've got a little less than 5000 news articles. I can start work on the scraper for the Washington Post, but I'm more interested in getting to work on the sentiment analysis aspect of it.

I would like to add, I spent a bit too much time trying to make my code pretty. I added my repo on landscape.io because I wanted to see how my code was ranked. Landscape seems to have some trouble loading peewee. My code came up with a lot of errors on their site, even though it runs fine. I may be doing something wrong, but based on Landscape's issue tracker, this may be fixed soon. I'll keep an eye on it.

For now, I'm pushing forward. Time for some opinion mining!

Wednesday, August 6, 2014

Data Mining News and the Stock Market Part 3 - Collecting Some Prices

This is the third part of my "Data Mining News and the Stock Market" post.
This one turned out to be super easy.
I found this website: http://www.eoddata.com/.
I had an account in a few minutes, and purchased all of the NYSE's end of day data from the past 5 years for $12.50. So far, that's the only money I've spent on this project, and it's well worth it. Since I can't get news data older than April, I really only need the last year. For $12, who cares? I'll take it!

I go to their download page, and this is what I see:

That's right. Less than 10Mb.
The zip files contain a list of text files for every day of 2014. The text files look like this:

I can definitely work with that.

Next up, I'm going to start work on my web scrapers. They're going to need a place to put all the articles I download. I haven't messed with any non relational databases yet. This might be a good project to try one out.

Tuesday, August 5, 2014

Data Mining News and the Stock Market Part 2 - Picking Some Companies

This is the second part of my "Data Mining News and the Stock Market" post. I apologize for my stream of consciousness approach. I was writing this as I was doing it.

A quick Google search of NYSE top companies led me to this Wall Street Journal article: NYSE Most Active Stocks. Seems like the perfect place to start. Now, I said before that the number of companies that I pick is largely going to be determined by how easily I can get data, and how much data I can get. So I need to do a little bit of work now before I pick my companies. There are two things I need for each company; news articles and historical price data.

Collecting the News

While I was looking for a place to download the news, I stumbled across this academic paper from Columbia University: Snowball: Extracting Relations from Large Plain-Text Collection. I didn't read the paper, but I did see this:

Our experiments use large collections of real newspapers from the North American News Text Corpus, available from LDC. This corpus includes articles from the Los Angeles Times, The Wall Street Journal, and The New York Times for 1994 to 1997.

The LDC is the Linguistic Data Consortium. They're a membership organization that basically gives data to research labs. As a student at a university, I had access to all of this data. As a guy sitting on his laptop, this is going to cost me some money, and I doubt I could get it anyway. I decided to look around their site to see if they had an updated version of the corpus. They don't. The latest news corpus I could see was from 2008, and it was actually just a parsing of the same corpus from 1997.

This may just be conjecture, but I feel, given the rise of the Internet, the relationship between news and the market has probably changed drastically over the past 10 years. So information from the 90's probably won't help me. Generally, it's best not to ignore data, but in some circumstances it can be justified. Reading through the LDC's site gave me a great list of news sources to look at:

Washington Post
New York Times
Wall Street Journal
Reuters News Service

The New York Times and Wall Street Journal are both paid subscription services. I don't think they'd appreciate me ripping off their data and putting it online, so they're out. But the Washington Post and Reuters both have a pretty simple search page. I should be able to write a web scraper for them, but before I do, I need to see how much data I can get from them.

Kimono

There's this pretty cool tool I heard about a while ago called Kimono. It lets you easily build out API's for web sites. Basically, it does all your web scraping for you. I'm not going to use it for my final project, but it should give me a pretty good starting point quickly.

First, I'm going to go through my news sources and search for one of the company names: "Bank of America". Then I'll see if I can build out an API for that company.

Washington Post

Had some trouble getting this to work. The individual pages are decent, but Kimono doesn't seem to be able to handle the pagination. So I can't get too much news. This will need a scraper.

Reuters News Service

Kimono worked great with Reuters. The site works exactly as advertised. In about 10 seconds, I built out my API. I even made a nice little mobile app for it here!

Actually, the mobile app stopped working after I let it crawl for a while. Too much data I guess. But still, pretty sweet.

I let the crawler go through 101 search pages and I got back 914 articles that go as far back as Apr 23, 2014. Actually, it got the title, URL, and date of publication for 914 news articles.

That's not too great. Also, the pagination gets a little weird on Reuters when you get around page 100. It keeps breaking Kimono.

It may be difficult to get data that goes too far back. I'll only be able to get a few months worth of news on any given company based on my Reuters search. I was hoping to get at least the last few years. Given that there are almost 1000 articles about Bank of America in the past few months, maybe the last few years would be a bit much.

So, I'm going to stick with my 100 companies, and try to collect news as far back as I can. My guess is about 1000 articles per company per news source. With 2 news sources, 100 companies, and 1000 articles, that's 200,000 news articles. Shouldn't be too bad. Since I can't get news that goes too far back, getting price data shouldn't be too difficult.

Saturday, August 2, 2014

Data Mining News and the Stock Market

I've always wondered if there was a causal relationship between news and the stock market. If Apple's stock price drops 50%, you'll probably here about it on the news, but if an article pops up online saying that Apple is using slave labor in China, what will happen to the stock price? It's a modern day chicken and egg problem. Which comes first? Granted, I may be over simplifying things. The stock market is a very dynamic system, but I thought I'd do some good old fashion data mining and see how far I can get.

I want to make some things clear up front. I don't know that much about the stock market. I'm not a data mining, machine learning, natural language processing expert. I studied these things when I was in school, but I've slept since then. I have a pretty good foundation, but I've been doing .NET web development for the past few years. I'm nowhere near up to date on the latest tools, techniques, and practices.

I'm also well aware that this isn't necessarily a novel idea. As I've said before, I have a bad habbit of testing things out before I google them. Besides, this a great oportunity to ~~test my limits~~, ~~to see if I've still got it~~, to finish a project.

My general hypothesis is that internet news articles about a company have an effect on that company's stock price. My slightly more testable hypothesis is that a company's stock price will fall soon after the publication of a negative internet news article realated to that company. I'll also be testing the inverse; that the price rises when good news articles come out. You get the idea. Here's my basic plan:

Step 1: Pick a some companies

I'm going to pick some big name companies on the NYSE. I'm limiting myself to just the one market for simplification. I'm not sure wether or not I'm going to stick to one particular market area or not. Should I only pick big tech companies, or should I pick big companies from several different market areas. The number of companies and areas that I pick will be determined by how easily I can collect data on them.

Step 2: Collect Prices

I'm going to need to collect as much pricing data as I can about these companies from the past 5-10 years. I'm not sure about the granularity of the data yet. I may just collect closing prices for a given day, or I may collect a lot more. I'm going to have to see what's out there.

Step 3: Collect News Articles

Collect news articles from a few different news sources about these companies. Not sure where I'm going to get the news. I imagine that this will have a large impact on my results. Hopefully there are some websites with API's for this sort of thing, but a web scraper shouldn't be too difficult to script out. In the end, I need a dataset with the articles title, text, source, and date/time of publication. Also, to simplify things, I'm only going to look at news articles who's titles contain the name or stock symbol of the company. I should be able to work from there.

Step 4: Sentament Analysis

This should be the fun part. I'm going to need to do some analysis on these news articles. Basically, I want to be able to measure public opinion of each of these companies over time. I'm going to need to see what's out there in terms of tools. I have some exposure to data/opinion mining, but it's been a few years, so I'm betting the tools that are out there now are a lot better than the ones I was working with. Hopefully, there are some tools out there that won't require any sort of annotation. It's really going to slow me down if I have to start reading these articles.
My main goal is to give each article a numerical score on how positive or negative it makes a company look. A news article titled "RadioShack Under $1: The Clock Is Ticking" would have a negative score. Let's say -10. What I'll do is keep a running total of this score. This way, I can measure public opinion of a company over time.

Step 5: Compare Price to Public Opinion?

This is where I start to get a little hazy. I'm not quite sure how I'm going to be able to tell if public opion is actually having an effect. Sure, I might be able to look at some graphs and say so, but there's probably some fancy math that I can't remember that can prove that there's a correlation. I'm going to need to do more research before I can actually say what I'm going to do here, but hopefully I get some fancy graphs out of it.

That's about it. My hope is that blogging will be enough motivation for me to finish the project. I'll try to keep this as up to date as possible, and put out as much code and data as I can.