[O] Obtaining¶

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_excel('https://github.com/theleadio/datascience_demo/blob/master/social-ads-raw.xlsx?raw=true')

[S] Scrubbing¶

df.sample(5)

df.describe()

df.shape

(200, 7)

# find out the unique values in 'size' and 'area' columns
print("size:", df['size'].unique(), "; area:", df['area'].unique())

size: ['large' 'small'] ; area: ['rural' 'urban' 'suburban']

df['is_large'] = df['size'].map({"small":0, "large":1})
area_dummies = pd.get_dummies(df['area'])

df1 = df.join(area_dummies)

df1.sample(3)

[E] Exploring¶

figs, subs = plt.subplots(1,3, figsize=(20, 7))

sns.scatterplot(x='google', y='sales', hue='size', data=df1, ax=subs[0])
sns.scatterplot(x='facebook', y='sales', hue='size', data=df1, ax=subs[1])
sns.scatterplot(x='instagram', y='sales', hue='size', data=df1, ax=subs[2]);

Findings:

Can't derive much from these scatterplots when considering the market size.
However, the more budget spent on Google and Facebook, the higher the sales volume.

figs, subs = plt.subplots(1,3, figsize=(20, 7))

sns.regplot(x='google', y='sales', data=df1, ax=subs[0])
sns.regplot(x='facebook', y='sales', data=df1, ax=subs[1])
sns.regplot(x='instagram', y='sales', data=df1, ax=subs[2]);

Findings:

Sales results from ads spent on Google seems to be more consistent than Facebook and Instagram.
However, the amount of ads spent on Google is also significantly higher than Facebook and Instagram on this dataset.
In other words, Google ads is much more expensive than Facebook and Instagram ads.

[M] Modeling - Individual factor¶

# create input features and target output
google = df1[['google']]
facebook = df1[['facebook']]
instagram = df1[['instagram']]
y = df1['sales']

# create model
from sklearn.linear_model import LinearRegression

google_model = LinearRegression()
facebook_model = LinearRegression()
instagram_model = LinearRegression()

# fit data into model
google_model.fit(google, y)
facebook_model.fit(facebook, y)
instagram_model.fit(instagram, y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

print("Google - R square:", google_model.score(google,y), "; Intercept:", google_model.intercept_, "; Coefficient:", google_model.coef_)
print("Facebook - R square:", facebook_model.score(facebook,y), "; Intercept:", facebook_model.intercept_, "; Coefficient:", facebook_model.coef_)
print("Instagram - R square:", instagram_model.score(instagram,y), "; Intercept:", instagram_model.intercept_, "; Coefficient:", instagram_model.coef_)

Google - R square: 0.611875050850071 ; Intercept: 7.032593549127695 ; Coefficient: [0.04753664]
Facebook - R square: 0.33203245544529525 ; Intercept: 9.311638095158283 ; Coefficient: [0.20249578]
Instagram - R square: 0.05212044544430516 ; Intercept: 12.35140706927816 ; Coefficient: [0.0546931]

google_raw_unit = round(google_model.intercept_ * 1000)
facebook_raw_unit = round(facebook_model.intercept_ * 1000)
instagram_raw_unit = round(instagram_model.intercept_ * 1000)

google_ads_unit = round(google_model.coef_[0] * 1000)
facebook_ads_unit = round(facebook_model.coef_[0] * 1000)
instagram_ads_unit = round(instagram_model.coef_[0] * 1000)

print(f"If we don't spend a single cent on ads, we will be able to sell: \nGoogle: {google_raw_unit} units\nFacebook: {facebook_raw_unit} units\nInstagram: {instagram_raw_unit} units")
print("\n")
print(f"For every thousand dollars that we spent on the each platform, we will be able to sell an additional of: \nGoogle: {google_ads_unit} units\nFacebook: {facebook_ads_unit} units\nInstagram: {instagram_ads_unit} units")

If we don't spend a single cent on ads, we will be able to sell: 
Google: 7033 units
Facebook: 9312 units
Instagram: 12351 units


For every thousand dollars that we spent on the each platform, we will be able to sell an additional of: 
Google: 48 units
Facebook: 202 units
Instagram: 55 units

# formula for predicting sales unit: y = ax + c * 1000, a = coefficient, c = intercept

ads_amount = input("The amount that we are going to spend? ")
platform = input("On which platform? ")
int_ads_amount = int(ads_amount)
platform = platform.lower()

if platform == 'google':
  total_units = ((0.04753664 * int_ads_amount / 1000) + 7.032593549127695)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-google_raw_unit))} units")
elif platform == 'facebook':
  total_units = ((0.20249578 * int_ads_amount / 1000) + 9.311638095158283)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-facebook_raw_unit))} units.")
else:
  total_units = ((0.0546931 * int_ads_amount / 1000) + 12.35140706927816)*1000
  print(f"If we spend ${'{:,}'.format(int_ads_amount)} on {platform}, we will be able to sell {'{:,}'.format(round(total_units))} units in total, an additional of {'{:,}'.format(round(total_units-instagram_raw_unit))} units.")

The amount that we are going to spend? 40000
On which platform? google
If we spend $40,000 on google, we will be able to sell 8,934 units in total, an additional of 1,901 units

Findings:

When isolating the factors by platform individually, it seems like Facebook has the most effective conversion rate. However, the accuracy score of this model is only 0.33, much less accurate than of Google's 0.61.

[M] Modeling - Improvised version¶

# use multiple features to improve accuracy
df1.columns

Index(['segment', 'google', 'facebook', 'instagram', 'sales', 'size', 'area',
       'is_large', 'rural', 'suburban', 'urban'],
      dtype='object')

input1 = ['google', 'facebook', 'instagram', 'is_large', 'rural', 'suburban', 'urban']
input2 = ['google', 'facebook', 'instagram', 'rural', 'suburban', 'urban']
input3 = ['google', 'facebook', 'instagram', 'is_large']
input4 = ['google', 'facebook', 'is_large', 'rural', 'suburban', 'urban']
input5 = ['google', 'facebook', 'instagram']

input_list = [input1, input2, input3, input4, input5]

for x in input_list:
  X = df1[x]
  y = df1['sales']
  model = LinearRegression()
  model.fit(X,y)
  print("R square:", model.score(X,y), "; Intercept:", model.intercept_, "; Coefficient:", model.coef_)

R square: 0.8981611875394305 ; Intercept: 2.9280493329187873 ; Coefficient: [ 0.04574401  0.1878667  -0.0010877   0.07739661 -0.05385834 -0.16042133
  0.21427968]
R square: 0.8981090280322743 ; Intercept: 2.9653347769419405 ; Coefficient: [ 0.0458023   0.18761682 -0.00101815 -0.04521056 -0.16310085  0.20831141]
R square: 0.8972400790092474 ; Intercept: 2.9117013282568607 ; Coefficient: [ 0.04571982  0.18872814 -0.00109768  0.05742385]
R square: 0.8981432428843079 ; Intercept: 2.910122835599342 ; Coefficient: [ 0.04573529  0.18730017  0.07569404 -0.05498525 -0.15973597  0.21472122]
R square: 0.8972106381789521 ; Intercept: 2.938889369459412 ; Coefficient: [ 0.04576465  0.18853002 -0.00103749]

input1 that includes all the columns has the highest R-square value, so we will use features in input1 for the subsequent modeling.

For every $1,000 spent on these variables, we will be able to sell an additional of:

Google: 46 units
Facebook: 187 units
Instagram: -1 units
Large market: 77 units
Rural area: -53 units
Suburban area: -160 units
Urban area: 214 units

It seems like by marketing on Facebook, in a large market in an urban area, will provide the highest return.

However, the accuracy score for Facebook as per our previous isolated modeling is found to be half the time less accurate than Google.

final_input = input1

X = df1[final_input]
y = df1['sales']

model = LinearRegression()
model.fit(X,y)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

predict1 = [50, 40, 10]  # 'google', 'facebook', 'instagram'
predict2 = [60, 30, 10]
predict3 = [70, 20, 10]
predict4 = [80, 10, 10]
predict5 = [40, 50, 10]
predict6 = [30, 60, 10]
predict7 = [20, 70, 10]
predict8 = [10, 80, 10]

budget_allocation = [predict1, predict2, predict3, predict4, predict5, predict6, predict7, predict8]

for x in budget_allocation:
  # select large market and urban area only
  market_area = [1, 0, 0, 1]  # 'is_large', 'rural', 'suburban', 'urban'
  x.extend(market_area)
  print(f"For {x}, the total predicted sales will be {'{:,}'.format(round(model.predict([x])[0] *1000))} units.")

For [50, 40, 10, 1, 0, 0, 1], the total predicted sales will be 13,011 units.
For [60, 30, 10, 1, 0, 0, 1], the total predicted sales will be 11,589 units.
For [70, 20, 10, 1, 0, 0, 1], the total predicted sales will be 10,168 units.
For [80, 10, 10, 1, 0, 0, 1], the total predicted sales will be 8,747 units.
For [40, 50, 10, 1, 0, 0, 1], the total predicted sales will be 14,432 units.
For [30, 60, 10, 1, 0, 0, 1], the total predicted sales will be 15,853 units.
For [20, 70, 10, 1, 0, 0, 1], the total predicted sales will be 17,274 units.
For [10, 80, 10, 1, 0, 0, 1], the total predicted sales will be 18,696 units.

[N] iNterpretation¶

The more budget spent on Facebook, the higher our return. However, due to its performance inconsistency, we can't put all our budgets into one single platform.

Hence the budget should be allocated to the other more consistent platform like Google as well.

My preferred budget allocation will be:

Google: 40,000
Facebook: 50,000
Instagram: 10,000

and they will be spent on a large market, in an urban area.

The predicted sales will be around 14,432 units sold out of this $100k budget.

	segment	google	facebook	instagram	sales
count	200.000000	200.000000	200.000000	200.000000	200.000000
mean	100.500000	147.042500	23.264000	30.554000	14.022500
std	57.879185	85.854236	14.846809	21.778621	5.217457
min	1.000000	0.700000	0.000000	0.300000	1.600000
25%	50.750000	74.375000	9.975000	12.750000	10.375000
50%	100.500000	149.750000	22.900000	25.750000	12.900000
75%	150.250000	218.825000	36.525000	45.100000	17.400000
max	200.000000	296.400000	49.600000	114.000000	27.000000

	segment	google	facebook	instagram	sales	size	area
95	96	163.3	31.6	52.9	16.9	large	rural
75	76	16.9	43.7	89.4	8.7	small	suburban
120	121	141.3	26.8	46.2	15.5	large	rural
72	73	26.8	33.0	19.3	8.8	small	rural
85	86	193.2	18.4	65.7	15.2	small	suburban

	segment	google	facebook	instagram	sales	size	area	is_large	rural	suburban
42	43	293.6	27.7	1.8	20.7	large	suburban	1	0	1
36	37	266.9	43.8	5.0	25.4	large	rural	1	1	0
54	55	262.7	28.8	15.9	20.2	large	rural	1	1	0