Monday, August 11, 2014

Data Mining News and the Stock Market Part 4 - Collecting News Articles

This is the fourth part of my "Data Mining News and the Stock Market" post.

First thing I did was set up a repo on github. I needed a good name. "Newtsocks" is an anagram for "news" and "stock", so I went with that.

Next, I set up an isolated python virtual environment using virtualenv. It's pretty handy, and keeps me from cluttering up my computer. At first, I made the mistake of including virtualenv directories in my repo. For future reference, don't do that.

I also needed a database to store all this info I'm going to be grabbing. After a bit too much research, I decided to stick with a relational database. I'm just more comfortable with them, and decided I could build this out a little faster if I stick to a traditional rdbms. Also, it's not important (right now) that my application be able to scale out.

Next, I wanted a good lightweight ORM. I know I'm doing basic web scraping here, and this probably seems like overkill, but I feel like the upfront work will save me some time later on. I decided to go with peewee. Seemed like it would be pretty simple to get running. And it was! I created my database, and added my list of companies.

Then I needed to do the actual scraping. I wrote up a simple script. You can check out the process I went through by looking at the history on the repo, but I ended up with something like this:

Now I've got some data! It's just a prototype for the scraper. It's nothing novel, but it's working. I started running it about 30 minutes ago, and I've got a little less than 5000 news articles. I can start work on the scraper for the Washington Post, but I'm more interested in getting to work on the sentiment analysis aspect of it.

I would like to add, I spent a bit too much time trying to make my code pretty. I added my repo on landscape.io because I wanted to see how my code was ranked. Landscape seems to have some trouble loading peewee. My code came up with a lot of errors on their site, even though it runs fine. I may be doing something wrong, but based on Landscape's issue tracker, this may be fixed soon. I'll keep an eye on it.

For now, I'm pushing forward. Time for some opinion mining!

No comments:

Post a Comment