Gaps in O'Reilly's Python repertoire

Tanya Schlusser 11 December 2014

Slides prepared using iPython Notebook. (Awesome quick tutorial... and how to 'Markdown')

Following along? Clone this: https://github.com/tanyaschlusser/ipython_talk__OReilly_python_books

Motivation

empty beer

But

No money Fran

So we

crying chipmunk

What to do?

Option 1

beg for beer

Option 2

chimpunk swim

Option Oh Heck Yeah

write a book

And the reaction?

Not naming names... B---- R--

scoff

Well I say

fart in general direction

Well, it would be a haul ...

...wouldn't be able to get a drink for the next 18 - 36 months...

But what to write?

There are over 120 Python publications by O'Reilly alone

Approach it systematically:

See what's out there now
Think about what we can contribute
Lots of writing

So, what's out there now?

Of course we're going to use Python to find out. And of course the universe is only the size of O'Reilly

In [1]:

# See what's out there now. Pull the:
#  -- media type (book | video)
#  -- title
#  -- publication date
import requests
from bs4 import BeautifulSoup

books_uri = "http://shop.oreilly.com/category/browse-subjects/programming/python.do?sortby=publicationDate&page=%d"

In [2]:

# Loop over all of the pages
results = []
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        yr = b.find("span", "directorydate").string.strip().split()
        while not yr[-1].isdigit():
            yr.pop()
        yr = int(yr[-1])
        title = b.find("div", "thumbheader").text.strip()
        url = b.find("div", "thumbheader").find("a")["href"]
        hasvideo = "Video" in b.text
        results.append(dict(year=yr, title=title, hasvideo=hasvideo))

In [3]:

# Want to
#  -- plot year over year number of books
#      ++ stacked plot with video + print
#  -- Get all the different words in the titles
#     ++ count them
#     ++ and order by frequency
#
# Use the Matplotlib magic command. Magic commands start with '%'.
# This sets up to plot inline. It doesn't import anything...
# Or use %pylab inline -- this apparently imports a lot of things into
# the global namespace
#
%matplotlib inline

In [4]:

# For year over year I need pandas.DataFrame.groupby
# For stacked plot I need matplotlib.pyplot
# Plain dictionary for the word counts
#
import matplotlib.pyplot as plt
import pandas as pd

In [5]:

# Year over year -- number of publications by 'video' and 'print'.
#
df = pd.DataFrame(results)
byyear = pd.crosstab(df["year"],df["hasvideo"])
byyear.rename(columns={True:'video', False:'print'}, inplace=True)
byyear.plot(kind="area", xlim=(2000,2014), title="Ever increasing publications")

Out[5]:

<matplotlib.axes._subplots.AxesSubplot at 0x108cc9210>

Um

That actually wasn't very satisfying...

In [6]:

# Out of curiosity, what happened in 2010?
df[df["year"]==2010]

Out[6]:

	hasvideo	title	year
77	False	Python 2.6 Text Processing Beginner's Guide	2010
78	False	Programming Python, 4th Edition	2010
79	False	Python Geospatial Development	2010
80	False	wxPython 2.8 Application Development Cookbook	2010
81	False	Python 2.6 Graphics Cookbook	2010
82	False	Head First Python	2010
83	False	Real World Instrumentation with Python	2010
84	False	Python Text Processing with NLTK 2.0 Cookbook	2010
85	False	MySQL for Python	2010
86	False	Python Multimedia	2010
87	True	Practical Python Programming: Callbacks	2010
88	False	Python 3 Object Oriented Programming	2010
89	False	Spring Python 1.1	2010
90	False	Blender 2.49 Scripting	2010
91	False	Professional IronPython	2010
92	False	Grok 1.0 Web Development	2010
93	False	Beginning Python	2010
94	False	Python Testing	2010

In [7]:

# Break up the titles and count words in the titles.
#  -- Need a regex for the punctuation (commas and colons)
#  -- Need a stemmer for plurals, posessives, and verb conjugations
import re
from nltk.stem.porter import PorterStemmer

space_or_punct = re.compile("[\s,:\.]+")
stemmer = PorterStemmer()

title_words = {}
for r in results:
    title = space_or_punct.split( r["title"].lower() )
    stemmed_title = (stemmer.stem(t) for t in title)
    for t in stemmed_title:
        # don't retain version or release numbers
        if not t[0].isdigit():
            if t not in title_words:
                title_words[t] = 1
            else:
                title_words[t] += 1

print "Total distinct words in the titles:", len(title_words), "\n"
print "\t".join(title_words.keys())
print"\n"
print "\n".join(r["title"] for r in results if not r["hasvideo"])

Total distinct words in the titles: 158 

** Truncated for brevity **

In [11]:

# That was useless -- almost every word except for "Python" and "Learn"
# shows up only in one title. Lame.
#
# Loop over all of the pages and get the book descriptions.
# Maybe see if some things in the descriptions show common topics
# (hoping for 'introductory' or 'web development' or 'machine learning')
import nltk
# Before running the below you need to do nltk.download() and select 'stopwords' from the corpus
from nltk.corpus import stopwords

english_stops = stopwords.words("English")
description_results = {}
for page in range(1,5):
    result = requests.get(books_uri % page)
    soup = BeautifulSoup(result.text)
    books = soup.find_all("td", "thumbtext")
    for b in books:
        title = b.find("div", "thumbheader").text.strip()
        # Only look at the books
        if not "Video" in b.text:
            print ".",
            url = "http://shop.oreilly.com" + b.find("div", "thumbheader").find("a")["href"]
            result2 = requests.get(url)
            soup2 = BeautifulSoup(result2.text)
            description = soup2.find("div", "detail-description-content").text
            description_results[title] = set([stemmer.stem(word)
                                          for word in space_or_punct.split(description.lower())
                                          if word not in english_stops])
print

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

In [12]:

# Try clustering:
#   -- Distance between two book descriptions is the percent overlap
#      of words in both their descriptions
#   -- Arbitrarily (from qualitative looking) decide on a threshold of
#      17% overlap in descriptions for both books to be 'similar'
#      and look at what we get
percent_overlap = []
min_intersections = []
max_intersections = []
avg_intersections = []
similar_books = {}

sorted_titles = sorted(description_results.keys())
for i in range(len(sorted_titles)):
    this_description = description_results[sorted_titles[i]]
    
    def get_percent_overlap(title):
        intersection_size = len(this_description.intersection(description_results[title]))
        union_size = len(this_description.union(description_results[t]))
        return (intersection_size * 100) / union_size
        
    percent_overlap.append([get_percent_overlap(t) for t in sorted_titles])
    similar_books[sorted_titles[i]] = [
            t for t in sorted_titles
            if get_percent_overlap(t) > 17 and t != sorted_titles[i]
    ]
    min_intersections.append(round(min(percent_overlap[-1])))
    max_intersections.append(round(max(percent_overlap[-1])))
    avg_intersections.append(round(100 * sum(percent_overlap[-1]) / len(sorted_titles)))

print "\n".join("\n%s\n%s" % (k, "|".join(v)) for k, v in similar_books.iteritems())


** Truncated for brevity **

In [14]:

plt.figure(figsize=(5,20))
data_link = linkage(percent_overlap, method='single', metric='euclidean')

den = dendrogram(data_link,labels=sorted_titles, orientation="left")

plt.ylabel('Samples', fontsize=9)
plt.xlabel('Distance')
plt.suptitle('Books clustered by description similarity', fontweight='bold', fontsize=14);

In [9]:

from IPython.display import HTML

container = """
    <a href='http://bl.ocks.org/mbostock/4063570'>Layout attribution: Michael Bostock</a>
    <div id='display_container'></div>"""

with open("data/d3-stacked-Tree.js") as infile:
    display_js = infile.read()

with open("data/human_hclust.json") as infile:
    the_json = infile.read()

HTML(container + display_js % the_json )

Out[9]:

Layout attribution: Michael Bostock

Quick interpretation

It makes sense that topics have little overlap. Otherwise why write a different book? Do we have anything to contribute?

Pure programming
- Unless we write our own library, I think this space is full
- Plus, any book should be better than Udacity / existing blogs, or why do it?
Games / Web
- Lots of expertise here, right? (Not me, but what percent of ChiPy does this?)
Scientific / Hobby
- The other half of ChiPy? The finance guys? (What percent?)
- We could actually contact the pyMCMC guy Thomas Wiecki and offer to help him out

Thank you -- what next?

Contact me if you want do do something
Ideas welcome
This may or may not happen
- depends on interest
- and whether interest can be sustained

There was support for a 'personality piece'

About local Chicago successes (Thanks Don Sheu!)
Idea:
- Python in production seems newer, and we have a handful of local companies who have gone this route
- A recent Berkely paper summarizes today's enterprise analytics pipeline:
  - Three types of people:
    - App users (e.g. Business Objects, Qlikview)
    - Scripters (e.g. SAS)
    - Hackers (whole workflow: SQL (100%) + Python(63%) + various)
  - Some issues:
    - Visualization at the data exploration stage
    - Managing the workflow
    - Isolated inaccessible 3rd party data
- We've (as a community) figured out how to do this -- could this be our book?
  - How different successful local firms use Python in production

Postscript: IPython evangelism

Making this deck in IPython was life-changing -- Python talks belong in IPython.

How to:

.github.io repo: (instructions)

Make the notebook

 pip install ipython
 ipython notebook  # Make something.

Remember to identify the slides: slide

Convert to html slideshow

 export PREFIX=http://cdn.jsdelivr.net/reveal.js/2.6.2
 ipython nbconvert <my_notebook>.ipynb \
     --to slides \
     --reveal-prefix ${PREFIX}

Add the new slides to the .github.io repo. The slides are served statically
Wait about 10 minutes and the slides are there

Also:

Can include javascript -- even libraries (example with D3)
As customizable as your imagination, using reveal.js for the slides