SlideShare presentations: Use 1.5 images per slide to maximize views¶
What do you remember better? Pictures or words?
Likely, you said pictures. And there is clear evidence to support you.
So it makes sense to use pictures in your presentations to maximize results.
The more stunning images you can include, the better. — SlideShare Blog¶
Hmm. Does that mean your presentation should be chock-full of images? After all, the more the better?
In this post, we're going to analyze presentations from SlideShare and try to determine the number of images that maximizes views. You know that images work but nobody has ever told you how many images to shoot for. Today you are in luck.
SlideShare is a global hub of professional content and presentations and we're going to source the presentations for our analysis from there.
Load data¶
The data contains statistics on SlideShare presentations. Let's load the data and see what it looks like. I extracted the data, presentations.csv, from SlideShare using my script here.
import pandas as pd
presns = pd.read_csv("presentations.csv")
# Select relevant columns
presns = presns[['views', 'days_posted', 'comments', 'likes', \
'total_images', 'total_slides', 'tweets']]
print("The data contains {0} presentations and {1} stats per \
presentation.".format(presns.shape[0], presns.shape[1]))
print("\n")
print(presns.head())
Here's a quick explanation of the important columns in the dataset:
views
: how many people viewed the presentationdays_posted
: how many days the presentation was postedcomments
: how many comments the presentation receivedlikes
: how many 'likes' the presentation receivedtotal_images
: total number of images in the presentationtotal_slides
: total number of slides in the presentationtweets
: how many times the presentation was tweeted
Create a plot function¶
Let's define a function to plot a given statistic against views
. We can use the plot to get an idea how that statistic relates to views
.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import linregress
def plot_func(x, xlabel, y, ylabel):
'''This function plots presentation statistic against number of views'''
# fit with np.polyfit
fit = np.polyfit(x,y,1)
fit_fn = np.poly1d(fit)
plt.plot(x,y, 'bo', x, fit_fn(x), '--k')
#plt.xlim(0, 15000)
#plt.ylim(0, 1500)
plt.xlabel(xlabel)
plt.ylabel(ylabel)
plot_stats = linregress(x,y)
print("\n")
print("p-value: {0}".format(plot_stats.pvalue))
#fig = plt.figure()
#fig.savefig("fig1.png")
plt.show()
print("Slope: {0}".format(plot_stats.slope))
Check data¶
Whenever possible, it's a good idea to do internal check on your data to see it things make sense. In our case, we can plot comments
against views
and see what the relationship looks like.
We expect that presentations that have more views
are more likely to have more comments
(and more likes
and tweets
too). If 10 people viewed your presentation, you can get at most 10 comments
. If 10,000 people viewed it, you can probably get a lot more comments
.
Let's see if our data bears this out.
y = presns.comments
x = presns.views
plot_func(x,"Views", y, "Comments")
Yes, the more views
the more comments
as we expected! The same thing happens if we're to plot likes
and tweets
against views
.
Okay, we're done with sanity check. Let's get to the reason we're here: how does images in your presentations relate to views
?
Explore image-view relationship¶
First, let's plot views
against total_images
.
y = presns.views
x = presns.total_images
plot_func(x,"Total images", y, "Views")
It seems that the more images in the presentation the more views. That's fine and good but just counting the number of images in a presentation does not tell the whole story. A presentation may have 100 slides and 100 pictures and another presentation may have 10 slides and 100 pictures. The number of pictures are the same but images per slide are different and that seems like something we want to tease out.
We'll compute a new variable, images_per_slide
, to help us do that.
Also some presentations have been posted for 100s of days whereas other slides have been posted for 1 day only. To get a sense of how popular a presentation is, we really need to know views_per_day
. Otherwise, we'd be comparing apples to oranges.
# It's okay to turn off panda's warning for this analysis
pd.options.mode.chained_assignment = None
# Compute images per slide
presns['images_per_slide'] = presns.total_images / presns.total_slides
# Computer views per day
presns['views_per_day'] = presns.views / presns.days_posted
print(presns.head())
Now, let's repeat our last plot and use views_per_day
and images_per_slide
instead.
y = presns.views_per_day
x = presns.images_per_slide
plot_func(x,"Images per slide", y, "Views per day")
We've a significant p-value and a negative slope. While images seem good for presentations as we saw earlier, this plot says you shouldn't go overboard with it. The slope is negative suggesting that few images is better than lots of images.
Eye-balling the plot, it seems that most of the heavily viewed presentations cluster around 1 to 2 images_per_slide
. So there you've your sweet spot. Aim for 1 to 2 images_per_slide
to maximize views
!
We can do better than eye-balling though. Let's look at images_per_slide
for the Top 1 percent of presentations.
presns2 = presns.sort_values(by='views_per_day', ascending=False)
nrows = len(presns2)
top1 = presns2.images_per_slide[0 : int(0.01*nrows)]
print("Top 1 percent of presentations had {0} to {1} \
images per slide. ".format(min(top1), max(top1)))
Should we use mean or median to get the average of the Top 1 percent? If we've a normal distribution, we should use mean. If our distribution isn't normal, median is better.
Let's conduct 2 quick tests to determine if Top 1 percent is normally distributed.
# Histogram of distribution
plt.hist(top1)
plt.xlabel('Images per slide')
plt.ylabel('Frequency')
plt.show()
# Normal test
import scipy.stats as stats
print(stats.normaltest(top1))
The histogram seems skewed. The small p-value from normaltest
also suggests that it's unlikely that the data came from a normal distribution. So we should use median.
top1_views = presns2.views_per_day[0 : int(0.01*nrows)]
print("The Top 1 percent had {0} median images per slide.".format(top1.median()))
print("And they had {0} median views per day.".format(top1_views.median()))
This is more precise than the 1 to 2 images_per_slide
from our eye-balling experiment.
So shoot for about 1.5 images per slide or 3 images per 2 slides in your presentations!
Let's see if we can visualize the cluster. We can use the elbow method as David W described in his stackexchange answer.
# Convert to numpy array for use with scipy
views_images = presns[["views_per_day", "images_per_slide"]].as_matrix()
from scipy import cluster
np.random.seed(1)
# Plot variance for each value for 'k' between 1,10
initial = [cluster.vq.kmeans(views_images,i) for i in range(1,10)]
plt.plot([var for (cent,var) in initial])
plt.xlabel('k')
plt.ylabel('Variance')
plt.show()
# It seems variance really levels off after k= 3
cent, var = initial[3]
# Use vq() to get an assignment for each observation
assignment,cdist = cluster.vq.vq(views_images,cent)
plt.scatter(views_images[:,1], views_images[:,0], c=assignment)
plt.xlabel('Images per slide')
plt.ylabel('Views per day')
plt.axvline(1.5593220338983051, color='r', linestyle='--')
plt.show()
The yellow cluster contains the presentations with the highest views_per_day
. And it does seem the cluster is centered around 1.5 images_per_slide
.
So whether you eye-ball it, take Top 1 percent, or cluster it, you come to the same conclusion: 1.5 images_per_slide
could help you go viral on SlideShare.
Of course, other things need to come together to create a SlideShare sensation. Do make sure you're milking the 1.5 images_per_slide
sweet spot.