Saturday, August 2, 2014

Data Mining News and the Stock Market

I've always wondered if there was a causal relationship between news and the stock market. If Apple's stock price drops 50%, you'll probably here about it on the news, but if an article pops up online saying that Apple is using slave labor in China, what will happen to the stock price? It's a modern day chicken and egg problem. Which comes first? Granted, I may be over simplifying things. The stock market is a very dynamic system, but I thought I'd do some good old fashion data mining and see how far I can get.

I want to make some things clear up front. I don't know that much about the stock market. I'm not a data mining, machine learning, natural language processing expert. I studied these things when I was in school, but I've slept since then. I have a pretty good foundation, but I've been doing .NET web development for the past few years. I'm nowhere near up to date on the latest tools, techniques, and practices.

I'm also well aware that this isn't necessarily a novel idea. As I've said before, I have a bad habbit of testing things out before I google them. Besides, this a great oportunity to test my limits, to see if I've still got it, to finish a project.

My general hypothesis is that internet news articles about a company have an effect on that company's stock price. My slightly more testable hypothesis is that a company's stock price will fall soon after the publication of a negative internet news article realated to that company. I'll also be testing the inverse; that the price rises when good news articles come out. You get the idea. Here's my basic plan:

Step 1: Pick a some companies

I'm going to pick some big name companies on the NYSE. I'm limiting myself to just the one market for simplification. I'm not sure wether or not I'm going to stick to one particular market area or not. Should I only pick big tech companies, or should I pick big companies from several different market areas. The number of companies and areas that I pick will be determined by how easily I can collect data on them.

 

Step 2: Collect Prices

I'm going to need to collect as much pricing data as I can about these companies from the past 5-10 years. I'm not sure about the granularity of the data yet. I may just collect closing prices for a given day, or I may collect a lot more. I'm going to have to see what's out there.

 

Step 3: Collect News Articles

Collect news articles from a few different news sources about these companies. Not sure where I'm going to get the news. I imagine that this will have a large impact on my results. Hopefully there are some websites with API's for this sort of thing, but a web scraper shouldn't be too difficult to script out. In the end, I need a dataset with the articles title, text, source, and date/time of publication. Also, to simplify things, I'm only going to look at news articles who's titles contain the name or stock symbol of the company. I should be able to work from there.

 

Step 4: Sentament Analysis

This should be the fun part. I'm going to need to do some analysis on these news articles. Basically, I want to be able to measure public opinion of each of these companies over time. I'm going to need to see what's out there in terms of tools. I have some exposure to data/opinion mining, but it's been a few years, so I'm betting the tools that are out there now are a lot better than the ones I was working with. Hopefully, there are some tools out there that won't require any sort of annotation. It's really going to slow me down if I have to start reading these articles.
My main goal is to give each article a numerical score on how positive or negative it makes a company look. A news article titled "RadioShack Under $1: The Clock Is Ticking" would have a negative score. Let's say -10. What I'll do is keep a running total of this score. This way, I can measure public opinion of a company over time.

 

Step 5: Compare Price to Public Opinion?

This is where I start to get a little hazy. I'm not quite sure how I'm going to be able to tell if public opion is actually having an effect. Sure, I might be able to look at some graphs and say so, but there's probably some fancy math that I can't remember that can prove that there's a correlation. I'm going to need to do more research before I can actually say what I'm going to do here, but hopefully I get some fancy graphs out of it.

That's about it. My hope is that blogging will be enough motivation for me to finish the project. I'll try to keep this as up to date as possible, and put out as much code and data as I can.

No comments:

Post a Comment