重回帰分析で費用対効果を予測する。

さて今回のテーマは重回帰分析です。

データは広告費（テレビCMと雑誌）とアプリのインストール数のcsvファイルです。

確認してみます。

ad_result_df = pd.read_csv('ad_result.csv')

ad_result_df

	month	tvcm	magazine	install
0	2013-01	6358	5955	53948
1	2013-02	8176	6069	57300
2	2013-03	6853	5862	52057
3	2013-04	5271	5247	44044
4	2013-05	6473	6365	54063
5	2013-06	7682	6555	58097
6	2013-07	5666	5546	47407
7	2013-08	6659	6066	53333
8	2013-09	6066	5646	49918
9	2013-10	10090	6545	59963

何にもわからないので、それぞれの散布図を描いてみます。

custom_style = {'axes.labelcolor': 'white',
                'xtick.color': 'black',
                'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)
sns.jointplot('tvcm', 'install', data=ad_result_df)

f:id:icchy333:20180218163906p:plain

custom_style = {'axes.labelcolor': 'white',
                'xtick.color': 'black',
                'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)
sns.jointplot('magazine', 'install', data=ad_result_df)

f:id:icchy333:20180218163926p:plain

custom_styleはジュピターで表示するとき用のコードなので気にしにないでください。

さて二つの散布図を見ると何やら相関がありそうですね。

回帰直線を当てはめてプロットしてみます。

custom_style = {'axes.labelcolor': 'white',
                'xtick.color': 'black',
                'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)
sns.regplot('tvcm', 'install', data=ad_result_df)

f:id:icchy333:20180218164704p:plain

custom_style = {'axes.labelcolor': 'white',
                'xtick.color': 'black',
                'ytick.color': 'black'}
sns.set_style("darkgrid", rc=custom_style)
sns.regplot('magazine', 'install', data=ad_result_df)

f:id:icchy333:20180218165104p:plain

雑誌の方が傾きが大きいですね。

さて、では重回帰分析してみます。

from sklearn import linear_model

model = linear_model.LinearRegression()
x_multi = ad_result_df.drop(['install','month'], axis=1)
y_target = ad_result_df.install
model.fit(x_multi, y_target)

model.coef_

array([ 1.3609213 ,  7.24980915])

model.intercept_

188.17427483039501

TVCMの係数は1.36、雑誌広告の係数は7.25と、やはり雑誌広告の方が大きい値になっていますね。

.score関数でR²を出すことが出来るようなのでやってみます。

model.score(x_multi, y_target)

0.93790143010444693

まあまあの値ではないでしょうか。

ちなみにstatsmodelsを使うとp値なんかも出て来ます。

import statsmodels.formula.api as sm

models = sm.OLS(y_target, x_multi)
results = models.fit()
results.summary()

OLS Regression Results
Dep. Variable:	install	R-squared:	1.000
Model:	OLS	Adj. R-squared:	0.999
Method:	Least Squares	F-statistic:	8403.
Date:	Sun, 18 Feb 2018	Prob (F-statistic):	5.12e-14
Time:	15:54:45	Log-Likelihood:	-84.758
No. Observations:	10	AIC:	173.5
Df Residuals:	8	BIC:	174.1
Df Model:	2
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
tvcm	1.3540	0.405	3.347	0.010	0.421	2.287
magazine	7.2892	0.476	15.320	0.000	6.192	8.386

Omnibus:	1.009	Durbin-Watson:	0.876
Prob(Omnibus):	0.604	Jarque-Bera (JB):	0.804
Skew:	0.539	Prob(JB):	0.669
Kurtosis:	2.123	Cond. No.	14.0

さて、最後に予測をしてみます。

本にはTVCMは4200万、雑誌広告は7500万で予測すると書いてあるので、その数字を入れてみます。

x_pre = [4200, 7500]
x = np.reshape(x_pre, (1, -1))
model.predict(x)

array([ 60277.61237361])

60278人がインストールすると予測できました。