Skip to main content

What's so great about Knime?

Last March, (for the fifth time, according to Forest Grove Technology), the Knime Analytics platform was named a Gartner Magic Quadrant leader. This year's other leaders are Alteryx, SAS, RapidMinder, and H2Oai. The best thing I learned from the announcement? Knime is open source, and free for individual users—I can afford to look at it!

Knime (silent "k"; rhymes with "dime") provides a graphical user interface to chain together blocks that represent steps in a data science workflow. (So they're like Pentaho or Informatica but for machine learning. Or LabView if you have an engineering background.)

It has dozens of built-in data access and transformation functions, statistical inference and machine learning algorithms, PMML, and custom Python, Java, R, Scala, a zillion other nodes, or other community plugins (since it's open source, anyone can make a plugin.) Even better, Knime imposes structure and modularity on a data science workflow by requiring code fit into specified building blocks.

This post implements the Bayesian NFL model from last month in Knime. It adds the upstream and downstream workflows to pull new data each week and write the model output to a spreadsheet: enough for a first look at this tool.

Read more

Bayesian updating and the NFL

It's football season again, hooray! Every year for my friends' football pool I try out a different algorithm. Invariably, my picks are around 60% accurate. Not terrible, but according to NFL Pickwatch (archive, current season), the best pickers get to 68 or 69%. So, an amazing performance—my upper bound—is just under 70%, and the lower bound for a competitive model—the FiveThirtyEight baseline—is 60%.

I've been modeling NFL outcomes for a couple of years, and running linear (predicting point spread) and logistic (predicting win probability) regressions given various team and player data. My best year so far incorporated the Vegas spread into the model, and my biggest disaster so far was an aggressive lasso model on every player in every offensive line, with team defenses lumped as a group. Attempting to track injuries, suspensions, and other changes to the starting lineup was not sustainable for the amount of time I wanted to spend.

Enter Nate Silver's awesome NFL Elo rankings, the aspirational target for this year. What's impressive is that he gets something like 60% accuracy out of literally no information but home field advantage and past scores. I particularly love that it updates weekly to incorporate the new information—this immediately says "Bayesian" and in fact is a lot how people using their intuition are making their picks anyway. A system like his—but with a more straightforward Bayesian model—is the goal of this post.

Read more

Modeling property tax assessment in Cook County, IL

The year my Mom moved in down the street from us, my husband tried to get some local property tax appeal company to reduce her assessment. They refused, saying they thought there wasn't a case.

The next year, she got a postcard from that same company: they would appeal her case and split the savings with her 50/50. Who wants to give up 50% of their tax savings? Plus, I was miffed from the prior year. I decided to try and appeal myself. Success!

Selenium via Python bindings was used to pull the data from the web, and statsmodels, with an interface that resembles R, was used to make the model.

Read more

MCMC and the Ising Model

Markov-Chain Monte Carlo (MCMC) methods are a category of numerical technique used in Bayesian statistics. They numerically estimate the distribution of a variable (the posterior) given two other distributions: the prior and the likelihood function, and are useful when direct integration of the likelihood function is not tractable.

I am new to Bayesian statistics, but became interested in the approach partly from exposure to the PyMC3 library, and partly from FiveThirtyEight's promoting it in a commentary soon after the time of the p-hacking scandals a few years back (Simmons et. al. coin 'p-hacking' in 2011, and Head et. al. quantify the scale of the issue in 2014).

Until the 1980's, it was not realistic to use Bayesian techniques except when analytic solutions were possible. (Here's Wikipedia's list of analytic options. They're still useful.) MCMC opens up more options.

The Python library pymc3 provides a suite of modern Bayesian tools: both MCMC algorithms and variational inference. One of its core contributors, Thomas Wiecki, wrote a blog post entitled MCMC sampling for dummies, which was the inspiration for this post. It was enthusiastically received, and cited by people I follow as the best available explanation of MCMC. To my dismay, I didn't understand it; probably because he comes from a stats background and I come from engineering. This post is for people like me.

Read more