Tuesday, August 5, 2014

Data Mining News and the Stock Market Part 2 - Picking Some Companies

This is the second part of my "Data Mining News and the Stock Market" post. I apologize for my stream of consciousness approach. I was writing this as I was doing it.

A quick Google search of NYSE top companies led me to this Wall Street Journal article: NYSE Most Active Stocks. Seems like the perfect place to start. Now, I said before that the number of companies that I pick is largely going to be determined by how easily I can get data, and how much data I can get. So I need to do a little bit of work now before I pick my companies. There are two things I need for each company; news articles and historical price data.


Collecting the News

While I was looking for a place to download the news, I stumbled across this academic paper from Columbia University: Snowball: Extracting Relations from Large Plain-Text Collection. I didn't read the paper, but I did see this:
Our experiments use large collections of real newspapers from the North American News Text Corpus, available from LDC. This corpus includes articles from the Los Angeles Times, The Wall Street Journal, and The New York Times for 1994 to 1997.
The LDC is the Linguistic Data Consortium. They're a membership organization that basically gives data to research labs. As a student at a university, I had access to all of this data. As a guy sitting on his laptop, this is going to cost me some money, and I doubt I could get it anyway. I decided to look around their site to see if they had an updated version of the corpus. They don't. The latest news corpus I could see was from 2008, and it was actually just a parsing of the same corpus from 1997.

This may just be conjecture, but I feel, given the rise of the Internet, the relationship between news and the market has probably changed drastically over the past 10 years. So information from the 90's probably won't help me. Generally, it's best not to ignore data, but in some circumstances it can be justified. Reading through the LDC's site gave me a great list of news sources to look at:
  • Washington Post
  • New York Times
  • Wall Street Journal
  • Reuters News Service
The New York Times and Wall Street Journal are both paid subscription services. I don't think they'd appreciate me ripping off their data and putting it online, so they're out. But the Washington Post and Reuters both have a pretty simple search page. I should be able to write a web scraper for them, but before I do, I need to see how much data I can get from them.

Kimono

There's this pretty cool tool I heard about a while ago called Kimono. It lets you easily build out API's for web sites. Basically, it does all your web scraping for you. I'm not going to use it for my final project, but it should give me a pretty good starting point quickly.

First, I'm going to go through my news sources and search for one of the company names: "Bank of America". Then I'll see if I can build out an API for that company.

Washington Post

Had some trouble getting this to work. The individual pages are decent, but Kimono doesn't seem to be able to handle the pagination. So I can't get too much news. This will need a scraper.

Reuters News Service


Kimono worked great with Reuters. The site works exactly as advertised. In about 10 seconds, I built out my API. I even made a nice little mobile app for it here!


Actually, the mobile app stopped working after I let it crawl for a while. Too much data I guess. But still, pretty sweet.

I let the crawler go through 101 search pages and I got back 914 articles that go as far back as Apr 23, 2014. Actually, it got the title, URL, and date of publication for 914 news articles.

That's not too great. Also, the pagination gets a little weird on Reuters when you get around page 100. It keeps breaking Kimono.

It may be difficult to get data that goes too far back. I'll only be able to get a few months worth of news on any given company based on my Reuters search. I was hoping to get at least the last few years. Given that there are almost 1000 articles about Bank of America in the past few months, maybe the last few years would be a bit much.

So, I'm going to stick with my 100 companies, and try to collect news as far back as I can. My guess is about 1000 articles per company per news source. With 2 news sources, 100 companies, and 1000 articles, that's 200,000 news articles. Shouldn't be too bad. Since I can't get news that goes too far back, getting price data shouldn't be too difficult.





No comments:

Post a Comment