[O] Obtaining

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_excel('https://github.com/theleadio/datascience_demo/blob/master/social-ads-raw.xlsx?raw=true')

[S] Scrubbing

In [2]:
df.sample(5)
Out[2]:
segment google facebook instagram sales size area
95 96 163.3 31.6 52.9 16.9 large rural
75 76 16.9 43.7 89.4 8.7 small suburban
120 121 141.3 26.8 46.2 15.5 large rural
72 73 26.8 33.0 19.3 8.8 small rural
85 86 193.2 18.4 65.7 15.2 small suburban
In [3]:
df.describe()
Out[3]:
segment google facebook instagram sales
count 200.000000 200.000000 200.000000 200.000000 200.000000
mean 100.500000 147.042500 23.264000 30.554000 14.022500
std 57.879185 85.854236 14.846809 21.778621 5.217457
min 1.000000 0.700000 0.000000 0.300000 1.600000
25% 50.750000 74.375000 9.975000 12.750000 10.375000
50% 100.500000 149.750000 22.900000 25.750000 12.900000
75% 150.250000 218.825000 36.525000 45.100000 17.400000
max 200.000000 296.400000 49.600000 114.000000 27.000000
In [ ]:
df.shape
Out[ ]:
(200, 7)
In [ ]:
# find out the unique values in 'size' and 'area' columns
print("size:", df['size'].unique(), "; area:", df['area'].unique())
size: ['large' 'small'] ; area: ['rural' 'urban' 'suburban']
In [ ]:
df['is_large'] = df['size'].map({"small":0, "large":1})
area_dummies = pd.get_dummies(df['area'])

df1 = df.join(area_dummies)

df1.sample(3)
Out[ ]:
segment google facebook instagram sales size area is_large rural suburban urban
42 43 293.6 27.7 1.8 20.7 large suburban 1 0 1 0
36 37 266.9 43.8 5.0 25.4 large rural 1 1 0 0
54 55 262.7 28.8 15.9 20.2 large rural 1 1 0 0

[E] Exploring

In [ ]:
figs, subs = plt.subplots(1,3, figsize=(20, 7))

sns.scatterplot(x='google', y='sales', hue='size', data=df1, ax=subs[0])
sns.scatterplot(x='facebook', y='sales', hue='size', data=df1, ax=subs[1])
sns.scatterplot(x='instagram', y='sales', hue='size', data=df1, ax=subs[2]);

Findings:

  1. Can't derive much from these scatterplots when considering the market size.
  2. However, the more budget spent on Google and Facebook, the higher the sales volume.
In [ ]:
figs, subs = plt.subplots(1,3, figsize=(20, 7))

sns.regplot(x='google', y='sales', data=df1, ax=subs[0])
sns.regplot(x='facebook', y='sales', data=df1, ax=subs[1])
sns.regplot(x='instagram', y='sales', data=df1, ax=subs[2]);

Findings:

  1. Sales results from ads spent on Google seems to be more consistent than Facebook and Instagram.
  2. However, the amount of ads spent on Google is also significantly higher than Facebook and Instagram on this dataset.
  3. In other words, Google ads is much more expensive than Facebook and Instagram ads.

[M] Modeling - Individual factor

In [ ]:
# create input features and target output
google = df1[['google']]
facebook = df1[['facebook']]
instagram = df1[['instagram']]
y = df1['sales']

# create model
from sklearn.linear_model import LinearRegression

google_model = LinearRegression()
facebook_model = LinearRegression()
instagram_model = LinearRegression()

# fit data into model
google_model.fit(google, y)
facebook_model.fit(facebook, y)
instagram_model.fit(instagram, y)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [ ]:
print("Google - R square:", google_model.score(google,y), "; Intercept:", google_model.intercept_, "; Coefficient:", google_model.coef_)
print("Facebook - R square:", facebook_model.score(facebook,y), "; Intercept:", facebook_model.intercept_, "; Coefficient:", facebook_model.coef_)
print("Instagram - R square:", instagram_model.score(instagram,y), "; Intercept:", instagram_model.intercept_, "; Coefficient:", instagram_model.coef_)
Google - R square: 0.611875050850071 ; Intercept: 7.032593549127695 ; Coefficient: [0.04753664]
Facebook - R square: 0.33203245544529525 ; Intercept: 9.311638095158283 ; Coefficient: [0.20249578]
Instagram - R square: 0.05212044544430516 ; Intercept: 12.35140706927816 ; Coefficient: [0.0546931]
In [ ]:
google_raw_unit = round(google_model.intercept_ * 1000)
facebook_raw_unit = round(facebook_model.intercept_ * 1000)
instagram_raw_unit = round(instagram_model.intercept_ * 1000)

google_ads_unit = round(google_model.coef_[0] * 1000)
facebook_ads_unit = round(facebook_model.coef_[0] * 1000)
instagram_ads_unit = round(instagram_model.coef_[0] * 1000)

print(f"If we don't spend a single cent on ads, we will be able to sell: \nGoogle: {google_raw_unit} units\nFacebook: {facebook_raw_unit} units\nInstagram: {instagram_raw_unit} units")
print("\n")
print(f"For every thousand dollars that we spent on the each platform, we will be able to sell an additional of: \nGoogle: {google_ads_unit} units\nFacebook: {facebook_ads_unit} units\nInstagram: {instagram_ads_unit} units")
If we don't spend a single cent on ads, we will be able to sell: 
Google: 7033 units
Facebook: 9312 units
Instagram: 12351 units


For every thousand dollars that we spent on the each platform, we will be able to sell an additional of: 
Google: 48 units
Facebook: 202 units
Instagram: 55 units
In [ ]:
# formula for predicting sales unit: y = ax + c * 1000, a = coefficient, c = intercept

ads_amount = input("The amount that we are going to spend? ")
platform = input("On which platform? ")
int_ads_amount = int(ads_amount)
platform = platform.lower()

if platform == 'google':
  total_units = ((0.04753664 * int_ads_amount / 1000) + 7.032593549127695)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-google_raw_unit))} units")
elif platform == 'facebook':
  total_units = ((0.20249578 * int_ads_amount / 1000) + 9.311638095158283)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-facebook_raw_unit))} units.")
else:
  total_units = ((0.0546931 * int_ads_amount / 1000) + 12.35140706927816)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-instagram_raw_unit))} units.")
The amount that we are going to spend? 40000
On which platform? google
If we spend $40,000 on google, we will be able to sell 8,934 units in total, an additional of 1,901 units

Findings:

  1. When isolating the factors by platform individually, it seems like Facebook has the most effective conversion rate. However, the accuracy score of this model is only 0.33, much less accurate than of Google's 0.61.

[M] Modeling - Improvised version

In [ ]:
# use multiple features to improve accuracy
df1.columns
Out[ ]:
Index(['segment', 'google', 'facebook', 'instagram', 'sales', 'size', 'area',
       'is_large', 'rural', 'suburban', 'urban'],
      dtype='object')
In [ ]:
input1 = ['google', 'facebook', 'instagram', 'is_large', 'rural', 'suburban', 'urban']
input2 = ['google', 'facebook', 'instagram', 'rural', 'suburban', 'urban']
input3 = ['google', 'facebook', 'instagram', 'is_large']
input4 = ['google', 'facebook', 'is_large', 'rural', 'suburban', 'urban']
input5 = ['google', 'facebook', 'instagram']

input_list = [input1, input2, input3, input4, input5]

for x in input_list:
  X = df1[x]
  y = df1['sales']
  model = LinearRegression()
  model.fit(X,y)
  print("R square:", model.score(X,y), "; Intercept:", model.intercept_, "; Coefficient:", model.coef_)
R square: 0.8981611875394305 ; Intercept: 2.9280493329187873 ; Coefficient: [ 0.04574401  0.1878667  -0.0010877   0.07739661 -0.05385834 -0.16042133
  0.21427968]
R square: 0.8981090280322743 ; Intercept: 2.9653347769419405 ; Coefficient: [ 0.0458023   0.18761682 -0.00101815 -0.04521056 -0.16310085  0.20831141]
R square: 0.8972400790092474 ; Intercept: 2.9117013282568607 ; Coefficient: [ 0.04571982  0.18872814 -0.00109768  0.05742385]
R square: 0.8981432428843079 ; Intercept: 2.910122835599342 ; Coefficient: [ 0.04573529  0.18730017  0.07569404 -0.05498525 -0.15973597  0.21472122]
R square: 0.8972106381789521 ; Intercept: 2.938889369459412 ; Coefficient: [ 0.04576465  0.18853002 -0.00103749]

input1 that includes all the columns has the highest R-square value, so we will use features in input1 for the subsequent modeling.

For every $1,000 spent on these variables, we will be able to sell an additional of:

  1. Google: 46 units
  2. Facebook: 187 units
  3. Instagram: -1 units
  4. Large market: 77 units
  5. Rural area: -53 units
  6. Suburban area: -160 units
  7. Urban area: 214 units

It seems like by marketing on Facebook, in a large market in an urban area, will provide the highest return.

However, the accuracy score for Facebook as per our previous isolated modeling is found to be half the time less accurate than Google.

In [ ]:
final_input = input1

X = df1[final_input]
y = df1['sales']

model = LinearRegression()
model.fit(X,y)
Out[ ]:
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
In [ ]:
predict1 = [50, 40, 10]  # 'google', 'facebook', 'instagram'
predict2 = [60, 30, 10]
predict3 = [70, 20, 10]
predict4 = [80, 10, 10]
predict5 = [40, 50, 10]
predict6 = [30, 60, 10]
predict7 = [20, 70, 10]
predict8 = [10, 80, 10]

budget_allocation = [predict1, predict2, predict3, predict4, predict5, predict6, predict7, predict8]

for x in budget_allocation:
  # select large market and urban area only
  market_area = [1, 0, 0, 1]  # 'is_large', 'rural', 'suburban', 'urban'
  x.extend(market_area)
  print(f"For {x}, the total predicted sales will be {'{:,}'.format(round(model.predict([x])[0] *1000))} units.")
For [50, 40, 10, 1, 0, 0, 1], the total predicted sales will be 13,011 units.
For [60, 30, 10, 1, 0, 0, 1], the total predicted sales will be 11,589 units.
For [70, 20, 10, 1, 0, 0, 1], the total predicted sales will be 10,168 units.
For [80, 10, 10, 1, 0, 0, 1], the total predicted sales will be 8,747 units.
For [40, 50, 10, 1, 0, 0, 1], the total predicted sales will be 14,432 units.
For [30, 60, 10, 1, 0, 0, 1], the total predicted sales will be 15,853 units.
For [20, 70, 10, 1, 0, 0, 1], the total predicted sales will be 17,274 units.
For [10, 80, 10, 1, 0, 0, 1], the total predicted sales will be 18,696 units.

[N] iNterpretation

The more budget spent on Facebook, the higher our return. However, due to its performance inconsistency, we can't put all our budgets into one single platform.

Hence the budget should be allocated to the other more consistent platform like Google as well.

My preferred budget allocation will be:

  • Google: 40,000
  • Facebook: 50,000
  • Instagram: 10,000

and they will be spent on a large market, in an urban area.

The predicted sales will be around 14,432 units sold out of this $100k budget.