Thursday, May 5, 2016

The Office Ping Pong Ball Cannon

It's been almost a year since my last blog post. It's not that I haven't been doing anything. I've actually been doing a lot. I got promoted. I've started and stopped dozens of side projects. I'm getting married in a few weeks. But that's not what I want to write about right now. I want to write about this cannon I'm building for my office.

So we have a stand-up meeting every morning. In our stand-up we have a ball. You're not allowed to talk unless you have the ball. It's a complicated rule, I know, but it keeps the meetings short and simple. Here's the snag. Not all of our team members are in Plano TX. Some of them are in Appleton WI.

We have big TVs and cameras in our conference rooms that allow us to communicate pretty easily, but they don't really help when I need to toss a ball to Wisconsin. When we first started ball rule, we just kind of awkwardly threw the ball at the camera, and they mimed catching a ball and throwing it back to us. I tried buying them some of the same balls we were using, but there was still something missing.

That's when I came up with the cannon. I wanted a device in Plano that I could throw a ball into, and a ball would launch in Appleton. Then when the Appleton guys threw the ball back in, the ball would launch in Plano. It would be like playing toss across state lines. Here are some of my initial sketches:







I drew those sketches back in March. In my spare time, I've been looking at parts, and putting some prototypes together. I've come a long way since those original sketches. I'll start putting up some posts to show my progress.

Now I'm a coder. Electronics are not my forte. This project was more about learning. I wanted to do something that was just outside of technical abilities. There was a lot of talking with co-workers, and looking up circuits online. I'm no electrical engineer. I just play one on the internet.




Friday, April 24, 2015

Office Prank - Fake Question Answering

I'm about to take a 2 month vacation. For me, there's always a certain amount of guilt and worry associated with taking a vacation. Is my team going to make their deadlines? Did I manage to meet all of my commitments? Are people going to hate me while I'm gone? A few weeks ago, I came up with the idea of writing a Deep Question Answering System to replace me while I'm on vacation. Kind of like Watson.

I would train it off of all the unstructured text of my email, skype, jabber, salesforce, conversations. Maybe even throw in our dev wiki. I wanted to try doing this with the new IBM Watson API's. It was going to be amazing! 

With a 2 month backpacking trip coming up, there was no time to build something like this. So, saner heads prevailed. I decided to write a simple web app that just gave canned deflecting answers, and just tell everyone that I wrote a super awesome system. So I wrote a page that just gives these answers randomly:
  • "Yeah. I don't know."
  • "Brian might know."
  • "Object reference not set to instance of an object"
  • "I'm busy."
  • "Have you tried Googling it?"
  • "What are you talking about?"
  • "I don't remember..."
  • "I'm in a meeting. I'll come grab you after.
I sent everyone this email:


Then gave everyone a link to the actual site: http://aaronmyster.github.io/laughing-shame/

It got a pretty good laugh. The interesting part was that people believed me! Not interesting in a "haha you're stupid for believing me" kind of way. More of a "It's crazy that this is possible" kind of way. It's amazing that we live in a time where something like this is even feasible. 

Honestly, it's not a bad idea for a project...

Thursday, October 16, 2014

Tracking my Sneezes

I didn't really write any code for this one, but I still thought it was fun. I've been keeping track of the number of times that I sneeze, and trying to correlate it with pollen counts from Pollen.com, and Weather.com. Here's my (normalized) results so far.



It seems to track pretty well with Pollen.com. I'll see how it goes...


Friday, September 19, 2014

Visualizing a Processing Queue

As many of you know, I have a thing for fancy charts. It's been a while since I've posted anything, so I thought I'd take some time to share a new fancy chart I've come up with to view the processing time of queues.

The Gist

  • Every point in the graph represents a job processed by some queue. 
  • The y axis shows the total time the job was in the queue, including processing time
  • The x axis shows the time that the job completed
  • The red line trailing on the back of the point represents the time that job waited before being processed
  • The green line trailing on the back of the point represents the processing time for that job


Why


We have many queues where I work. Some application will throw a job into a queue, and another application will pick up that job and process it. Usually these queues are implemented as tables in a SQL database, and they almost always have these columns in common:

  • Insertion Time (What time did we add the job to the queue)
  • Start Time (What time did the application start the job?)
  • End Time (What time did the application finished the job?)
  • Success (Did the application finish successfully?)
Whenever there's a problem, the question "Why is it going slow" inevitably ends up on someone's shoulders. Just looking at the times in our logs doesn't always give you a good answer. Ultimately, there are two things that need to be considered when jobs start to slow down.
  1. How long are jobs taking to process? (EndTime-StartTime)
  2. How long are jobs waiting before processing starts? (StartTime-InsertionTime)
My new graph answers these questions perfectly!

An extremely small how to

Let's say I've got 3 different jobs that processed. Job 1,2, and 3 were inserted at time 1,2, and 3 respectively. For simplicity, I'm going to show times as integers. Here's the log:

JobInsertionTime StartTime StopTime Success Duration
111312
223503
335714

Here, duration is that Total Time in the processing queue (EndTime - InsertiontTime).

Let's plot the Duration and StopTime together:

This gives me this:

Now lets add on the line segments:

This gives me my final output:

And that's it!

Conclusion

This has proven to be extremely valuable. With this graph, I can quickly determine whether an application is genuinely taking too long, or if we just got hit with a bunch of input, and we need to start up new instances of said application. Ultimately, I can now run a quick R script and have a great picture of what's happening to our system as a whole.

Wednesday, August 13, 2014

Data Mining News and the Stock Market Part 5 - Opinion Mining

If it's not obvious by now, this is the fifth part of my "Data Mining News and the Stock Market" post, but if you' don't want to go back and read my previous posts, here's where I'm at. I have a sqlite database that's filled with news articles about companies on the NYSE.

I don't have all the data that I'm going to need, but I have enough to start getting some sentiment. While doing some searching on how to do opinion mining in python, I found this repo: https://github.com/kjahan/opinion-mining.
On the description, they guy described how they used a sentiment analysis API.
I don't mean to sound lazy, but that's just about the best news I've heard in a long time.

Let's be honest. For a one man project, this was going to be a bit of a nightmare. I could have spent my time annotating articles, POS tagging sentences, and building a dictionary of positive/negative tokens. But, it would have either taken me a really long time, or cost me a lot of money. In the end, it might not have even worked. I was really excited to find a simple API.

Also, classifications tend to work better in aggregate. Keeping that in mind, I can architect my code to use several different sentiment analysis APIs. My opinion mining will really just aggregate other people's opinion mining software! I only need to keep the API's pricing model and data restrictions in mind, and that can be controlled by software. To make things even easier, I found a list of API's here: http://blog.mashape.com/list-of-20-sentiment-analysis-apis/


The API that the opinion-mining repo used is on Mashape, and will allow me 45,000 API calls per month for free (and .01 after that). I've stopped my scraping app after about 12,000 articles, and was about 40% through Reuters. I'm really only going to need to get any article's sentiment once, so this should work well for me as a test.

Adding Sentiment to my System

I've added some models to my database to keep track of all the API's I'm going to be using. Each API will have a unique url and key, but they're all going to be called the same way. I've also added an API response object. This will contain the response that I get back, and a score. These API's seem to have different kinds of results, so the score will have to be calculated differently depending on which one I'm using. Since these API's are going to be used to get opinions, I'm calling them OpinionAPI's

Now my machine learing is done by someone else! I can get the sentiment of an article with a script like this:

After running this, I did a quick sanity check. Good articles seemed to have a high score, and bad articles had a low score! Done!

Now what?

So now I've got some work to do. I've got a little bit of data, and I have a good framework for adding classifiers.
  • Build more web scrapers
  • Get more data
  • Add more API's
  • Get prices into database (Simple, but still haven't done it)
After that I'm going to do some exploratory data analysis to see where I'm at. Hopefully I should be posting some pretty graphs and charts soon.

Some Concerns

Some of the assumptions I made throughout this project haven't really panned out. For instance, the number of articles I have per company is way off. I was expecting around 1000 articles per company, but I only have a handful of articles for a several companies. It ranges anywhere from 0 to a little over 1000. I'll need to take a closer look at my Reuters scraper.

Also, the ranges of dates seem to be all over the place. Based on some of my initial testing, I was expecting to only have things from around this year, but I have articles all the way back from 2009. The pricing isn't really an issue because downloading price data is super cheap. I'm just worried about the inconsistency of my data. News and Prices might have had a much different relationship 5 years ago. I can't be sure.

I may be getting a lot of articles that have nothing to do with anything. While browsing my database, I stumbled on an article about the Olympics. It had come up while searching for "Sprint".
with players forced to sprint and stretch behind the goal-lines in order to preserve the surface.
This is a huge concern for me. The whole article had a really negative score, and I have no way of knowing how many articles like this are in my database. I may have to search Reuters by stock ticker instead of by company name. I know I said earlier that I was going to look for the company name in the title, but that could just as easily have the same problem.

Monday, August 11, 2014

Data Mining News and the Stock Market Part 4 - Collecting News Articles

This is the fourth part of my "Data Mining News and the Stock Market" post.

First thing I did was set up a repo on github. I needed a good name. "Newtsocks" is an anagram for "news" and "stock", so I went with that.

Next, I set up an isolated python virtual environment using virtualenv. It's pretty handy, and keeps me from cluttering up my computer. At first, I made the mistake of including virtualenv directories in my repo. For future reference, don't do that.

I also needed a database to store all this info I'm going to be grabbing. After a bit too much research, I decided to stick with a relational database. I'm just more comfortable with them, and decided I could build this out a little faster if I stick to a traditional rdbms. Also, it's not important (right now) that my application be able to scale out.

Next, I wanted a good lightweight ORM. I know I'm doing basic web scraping here, and this probably seems like overkill, but I feel like the upfront work will save me some time later on. I decided to go with peewee. Seemed like it would be pretty simple to get running. And it was! I created my database, and added my list of companies.

Then I needed to do the actual scraping. I wrote up a simple script. You can check out the process I went through by looking at the history on the repo, but I ended up with something like this:

Now I've got some data! It's just a prototype for the scraper. It's nothing novel, but it's working. I started running it about 30 minutes ago, and I've got a little less than 5000 news articles. I can start work on the scraper for the Washington Post, but I'm more interested in getting to work on the sentiment analysis aspect of it.

I would like to add, I spent a bit too much time trying to make my code pretty. I added my repo on landscape.io because I wanted to see how my code was ranked. Landscape seems to have some trouble loading peewee. My code came up with a lot of errors on their site, even though it runs fine. I may be doing something wrong, but based on Landscape's issue tracker, this may be fixed soon. I'll keep an eye on it.

For now, I'm pushing forward. Time for some opinion mining!

Wednesday, August 6, 2014

Data Mining News and the Stock Market Part 3 - Collecting Some Prices

This is the third part of my "Data Mining News and the Stock Market" post.
This one turned out to be super easy.
I found this website: http://www.eoddata.com/.
I had an account in a few minutes, and purchased all of the NYSE's end of day data from the past 5 years for $12.50. So far, that's the only money I've spent on this project, and it's well worth it. Since I can't get news data older than April, I really only need the last year. For $12, who cares? I'll take it!

I go to their download page, and this is what I see:

That's right. Less than 10Mb.
The zip files contain a list of text files for every day of 2014. The text files look like this:
I can definitely work with that.

Next up, I'm going to start work on my web scrapers. They're going to need a place to put all the articles I download. I haven't messed with any non relational databases yet. This might be a good project to try one out.