Gaps in O'Reilly's Python repertoire

Tanya Schlusser 11 December 2014

Slides prepared using iPython Notebook. (Awesome quick tutorial... and how to 'Markdown')

Following along? Clone this: https://github.com/tanyaschlusser/ipython_talk__OReilly_python_books

Motivation

empty beer

But

No money Fran

So we

crying chipmunk

What to do?

Option 1

beg for beer

Option 2

chimpunk swim

Option Oh Heck Yeah

write a book

And the reaction?

Not naming names... B---- R--

scoff

Well I say

fart in general direction

Well, it would be a haul ...

...wouldn't be able to get a drink for the next 18 - 36 months...

But what to write?

There are over 120 Python publications by O'Reilly alone

Approach it systematically:

  1. See what's out there now
  2. Think about what we can contribute
  3. Lots of writing

So, what's out there now?

Of course we're going to use Python to find out. And of course the universe is only the size of O'Reilly

In [1]:
# See what's out there now. Pull the:
#  -- media type (book | video)
#  -- title
#  -- publication date
import requests
from bs4 import BeautifulSoup

books_uri = "http://shop.oreilly.com/category/browse-subjects/programming/python.do?sortby=publicationDate&page=%d"
In [2]:
# Loop over all of the pages
results = []
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        yr = b.find("span", "directorydate").string.strip().split()
        while not yr[-1].isdigit():
            yr.pop()
        yr = int(yr[-1])
        title = b.find("div", "thumbheader").text.strip()
        url = b.find("div", "thumbheader").find("a")["href"]
        hasvideo = "Video" in b.text
        results.append(dict(year=yr, title=title, hasvideo=hasvideo))
In [3]:
# Want to
#  -- plot year over year number of books
#      ++ stacked plot with video + print
#  -- Get all the different words in the titles
#     ++ count them
#     ++ and order by frequency
#
# Use the Matplotlib magic command. Magic commands start with '%'.
# This sets up to plot inline. It doesn't import anything...
# Or use %pylab inline -- this apparently imports a lot of things into
# the global namespace
#
%matplotlib inline
In [4]:
# For year over year I need pandas.DataFrame.groupby
# For stacked plot I need matplotlib.pyplot
# Plain dictionary for the word counts
#
import matplotlib.pyplot as plt
import pandas as pd
In [5]:
# Year over year -- number of publications by 'video' and 'print'.
#
df = pd.DataFrame(results)
byyear = pd.crosstab(df["year"],df["hasvideo"])
byyear.rename(columns={True:'video', False:'print'}, inplace=True)
byyear.plot(kind="area", xlim=(2000,2014), title="Ever increasing publications")
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x108cc9210>

Um

That actually wasn't very satisfying...

In [6]:
# Out of curiosity, what happened in 2010?
df[df["year"]==2010]
Out[6]:
hasvideo title year
77 False Python 2.6 Text Processing Beginner's Guide 2010
78 False Programming Python, 4th Edition 2010
79 False Python Geospatial Development 2010
80 False wxPython 2.8 Application Development Cookbook 2010
81 False Python 2.6 Graphics Cookbook 2010
82 False Head First Python 2010
83 False Real World Instrumentation with Python 2010
84 False Python Text Processing with NLTK 2.0 Cookbook 2010
85 False MySQL for Python 2010
86 False Python Multimedia 2010
87 True Practical Python Programming: Callbacks 2010
88 False Python 3 Object Oriented Programming 2010
89 False Spring Python 1.1 2010
90 False Blender 2.49 Scripting 2010
91 False Professional IronPython 2010
92 False Grok 1.0 Web Development 2010
93 False Beginning Python 2010
94 False Python Testing 2010
In [14]:
plt.figure(figsize=(5,20))
data_link = linkage(percent_overlap, method='single', metric='euclidean')

den = dendrogram(data_link,labels=sorted_titles, orientation="left")

plt.ylabel('Samples', fontsize=9)
plt.xlabel('Distance')
plt.suptitle('Books clustered by description similarity', fontweight='bold', fontsize=14);
In [9]:
from IPython.display import HTML

container = """
    <a href='http://bl.ocks.org/mbostock/4063570'>Layout attribution: Michael Bostock</a>
    <div id='display_container'></div>"""

with open("data/d3-stacked-Tree.js") as infile:
    display_js = infile.read()

with open("data/human_hclust.json") as infile:
    the_json = infile.read()

HTML(container + display_js % the_json )
Out[9]:
Layout attribution: Michael Bostock

Quick interpretation

It makes sense that topics have little overlap. Otherwise why write a different book? Do we have anything to contribute?

  • Pure programming
    • Unless we write our own library, I think this space is full
    • Plus, any book should be better than Udacity / existing blogs, or why do it?
  • Games / Web
    • Lots of expertise here, right? (Not me, but what percent of ChiPy does this?)
  • Scientific / Hobby
    • The other half of ChiPy? The finance guys? (What percent?)
    • We could actually contact the pyMCMC guy Thomas Wiecki and offer to help him out

Thank you -- what next?

  • Contact me if you want do do something
  • Ideas welcome
  • This may or may not happen
    • depends on interest
    • and whether interest can be sustained

There was support for a 'personality piece'

  • About local Chicago successes (Thanks Don Sheu!)
  • Idea:
    • Python in production seems newer, and we have a handful of local companies who have gone this route
    • A recent Berkely paper summarizes today's enterprise analytics pipeline:
      • Three types of people:
        • App users (e.g. Business Objects, Qlikview)
        • Scripters (e.g. SAS)
        • Hackers (whole workflow: SQL (100%) + Python(63%) + various)
      • Some issues:
        • Visualization at the data exploration stage
        • Managing the workflow
        • Isolated inaccessible 3rd party data
    • We've (as a community) figured out how to do this -- could this be our book?
      • How different successful local firms use Python in production

Postscript: IPython evangelism

Making this deck in IPython was life-changing -- Python talks belong in IPython.

How to:

  1. .github.io repo: (instructions)
  2. Make the notebook

     pip install ipython
     ipython notebook  # Make something.
    

    Remember to identify the slides: slide

  3. Convert to html slideshow

     export PREFIX=http://cdn.jsdelivr.net/reveal.js/2.6.2
     ipython nbconvert <my_notebook>.ipynb \
         --to slides \
         --reveal-prefix ${PREFIX}
    
  4. Add the new slides to the .github.io repo. The slides are served statically

  5. Wait about 10 minutes and the slides are there

Also: