SlideShare presentation: Use 1.5 images per slide to maximize views

SlideShare presentations: Use 1.5 images per slide to maximize views

What do you remember better? Pictures or words?

Likely, you said pictures. And there is clear evidence to support you.

So it makes sense to use pictures in your presentations to maximize results.

The more stunning images you can include, the better. — SlideShare Blog

Hmm. Does that mean your presentation should be chock-full of images? After all, the more the better?

In this post, we're going to analyze presentations from SlideShare and try to determine the number of images that maximizes views. You know that images work but nobody has ever told you how many images to shoot for. Today you are in luck.

SlideShare is a global hub of professional content and presentations and we're going to source the presentations for our analysis from there.

Load data

The data contains statistics on SlideShare presentations. Let's load the data and see what it looks like. I extracted the data, presentations.csv, from SlideShare using my script here.

In [1]:
import pandas as pd
presns = pd.read_csv("presentations.csv")

# Select relevant columns
presns = presns[['views', 'days_posted', 'comments', 'likes', \
                 'total_images', 'total_slides', 'tweets']]
print("The data contains {0} presentations and {1} stats per \
presentation.".format(presns.shape[0], presns.shape[1]))
print("\n")
print(presns.head())
The data contains 6743 presentations and 7 stats per presentation.


     views  days_posted  comments  likes  total_images  total_slides  tweets
0  1236764          683         6    201            92            58      40
1    22020          838         2     94            53            23      57
2     9599          717         0      7            52            24       0
3    44136         1833         5     42            54            21       1
4    10394         1278         0      5            42            14       0

Here's a quick explanation of the important columns in the dataset:

  • views: how many people viewed the presentation
  • days_posted: how many days the presentation was posted
  • comments: how many comments the presentation received
  • likes: how many 'likes' the presentation received
  • total_images: total number of images in the presentation
  • total_slides: total number of slides in the presentation
  • tweets: how many times the presentation was tweeted

Create a plot function

Let's define a function to plot a given statistic against views. We can use the plot to get an idea how that statistic relates to views.

In [2]:
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress

def plot_func(x, xlabel, y, ylabel):
    '''This function plots presentation statistic against number of views'''

    # fit with np.polyfit
    fit = np.polyfit(x,y,1)
    fit_fn = np.poly1d(fit)
    plt.plot(x,y, 'bo', x, fit_fn(x), '--k')

    #plt.xlim(0, 15000)
    #plt.ylim(0, 1500)

    plt.xlabel(xlabel)
    plt.ylabel(ylabel)

    plot_stats = linregress(x,y)

    print("\n")
    print("p-value: {0}".format(plot_stats.pvalue))
    
    #fig = plt.figure()
    #fig.savefig("fig1.png")
    
    plt.show()
    
    print("Slope: {0}".format(plot_stats.slope))

Check data

Whenever possible, it's a good idea to do internal check on your data to see it things make sense. In our case, we can plot comments against views and see what the relationship looks like.

We expect that presentations that have more views are more likely to have more comments (and more likes and tweets too). If 10 people viewed your presentation, you can get at most 10 comments. If 10,000 people viewed it, you can probably get a lot more comments.

Let's see if our data bears this out.

In [3]:
y = presns.comments
x = presns.views
plot_func(x,"Views", y, "Comments")

p-value: 0.0
Slope: 2.770516818355093e-05

Yes, the more views the more comments as we expected! The same thing happens if we're to plot likes and tweets against views.

Okay, we're done with sanity check. Let's get to the reason we're here: how does images in your presentations relate to views?

Explore image-view relationship

First, let's plot views against total_images.

In [4]:
y = presns.views
x = presns.total_images
plot_func(x,"Total images", y, "Views")

p-value: 0.009773074513231372
Slope: 191.38615994115028

It seems that the more images in the presentation the more views. That's fine and good but just counting the number of images in a presentation does not tell the whole story. A presentation may have 100 slides and 100 pictures and another presentation may have 10 slides and 100 pictures. The number of pictures are the same but images per slide are different and that seems like something we want to tease out.

We'll compute a new variable, images_per_slide, to help us do that.

Also some presentations have been posted for 100s of days whereas other slides have been posted for 1 day only. To get a sense of how popular a presentation is, we really need to know views_per_day. Otherwise, we'd be comparing apples to oranges.

In [5]:
# It's okay to turn off panda's warning for this analysis
pd.options.mode.chained_assignment = None

# Compute images per slide
presns['images_per_slide'] = presns.total_images / presns.total_slides

# Computer views per day
presns['views_per_day'] = presns.views / presns.days_posted
print(presns.head())
     views  days_posted  comments  likes  total_images  total_slides  tweets  \
0  1236764          683         6    201            92            58      40   
1    22020          838         2     94            53            23      57   
2     9599          717         0      7            52            24       0   
3    44136         1833         5     42            54            21       1   
4    10394         1278         0      5            42            14       0   

   images_per_slide  views_per_day  
0          1.586207    1810.781845  
1          2.304348      26.276850  
2          2.166667      13.387727  
3          2.571429      24.078560  
4          3.000000       8.133020  

Now, let's repeat our last plot and use views_per_day and images_per_slide instead.

In [6]:
y = presns.views_per_day
x = presns.images_per_slide
plot_func(x,"Images per slide", y, "Views per day")

p-value: 0.032789832837888354
Slope: -4.7436582036219335

We've a significant p-value and a negative slope. While images seem good for presentations as we saw earlier, this plot says you shouldn't go overboard with it. The slope is negative suggesting that few images is better than lots of images.

Eye-balling the plot, it seems that most of the heavily viewed presentations cluster around 1 to 2 images_per_slide. So there you've your sweet spot. Aim for 1 to 2 images_per_slide to maximize views!

We can do better than eye-balling though. Let's look at images_per_slide for the Top 1 percent of presentations.

In [7]:
presns2 = presns.sort_values(by='views_per_day', ascending=False)
nrows = len(presns2)
top1 = presns2.images_per_slide[0 : int(0.01*nrows)]
print("Top 1 percent of presentations had {0} to {1} \
images per slide. ".format(min(top1), max(top1)))
Top 1 percent of presentations had 0.13559322033898305 to 4.3 images per slide. 

Should we use mean or median to get the average of the Top 1 percent? If we've a normal distribution, we should use mean. If our distribution isn't normal, median is better.

Let's conduct 2 quick tests to determine if Top 1 percent is normally distributed.

In [8]:
# Histogram of distribution
plt.hist(top1)
plt.xlabel('Images per slide')
plt.ylabel('Frequency')
plt.show()

# Normal test
import scipy.stats as stats
print(stats.normaltest(top1))
NormaltestResult(statistic=12.645730952719852, pvalue=0.0017947931932384294)

The histogram seems skewed. The small p-value from normaltest also suggests that it's unlikely that the data came from a normal distribution. So we should use median.

In [9]:
top1_views = presns2.views_per_day[0 : int(0.01*nrows)]
print("The Top 1 percent had {0} median images per slide.".format(top1.median()))
print("And they had {0} median views per day.".format(top1_views.median()))
The Top 1 percent had 1.5593220338983051 median images per slide.
And they had 2316.0096540627515 median views per day.

This is more precise than the 1 to 2 images_per_slide from our eye-balling experiment.

So shoot for about 1.5 images per slide or 3 images per 2 slides in your presentations!

Let's see if we can visualize the cluster. We can use the elbow method as David W described in his stackexchange answer.

In [10]:
# Convert to numpy array for use with scipy
views_images = presns[["views_per_day", "images_per_slide"]].as_matrix()

from scipy import cluster

np.random.seed(1)

# Plot variance for each value for 'k' between 1,10
initial = [cluster.vq.kmeans(views_images,i) for i in range(1,10)]
plt.plot([var for (cent,var) in initial])
plt.xlabel('k')
plt.ylabel('Variance')
plt.show()

# It seems variance really levels off after k= 3
cent, var = initial[3]

# Use vq() to get an assignment for each observation
assignment,cdist = cluster.vq.vq(views_images,cent)
plt.scatter(views_images[:,1], views_images[:,0], c=assignment)
plt.xlabel('Images per slide')
plt.ylabel('Views per day')
plt.axvline(1.5593220338983051, color='r', linestyle='--')
plt.show()

The yellow cluster contains the presentations with the highest views_per_day. And it does seem the cluster is centered around 1.5 images_per_slide.

So whether you eye-ball it, take Top 1 percent, or cluster it, you come to the same conclusion: 1.5 images_per_slide could help you go viral on SlideShare.

Of course, other things need to come together to create a SlideShare sensation. Do make sure you're milking the 1.5 images_per_slide sweet spot.