どの属性の顧客が離脱しているのか？

さて、今回は探索型のデータ分析です。

先月と比べて、なぜか落ちてしまったユーザー数の原因を探索します。

まず、与えられているデータを確認。

dau_df = pd.read_csv("section4-dau.csv")
user_info_df = pd.read_csv("section4-user_info.csv")

dau_df.head()

	log_date	app_name	user_id
0	2013-08-01	game-01	33754
1	2013-08-01	game-01	28598
2	2013-08-01	game-01	30306
3	2013-08-01	game-01	117
4	2013-08-01	game-01	6605

user_info_df.head()

	install_date	app_name	user_id	gender	generation	device_type
0	2013-04-15	game-01	1	M	40	iOS
1	2013-04-15	game-01	2	M	10	Android
2	2013-04-15	game-01	3	F	40	iOS
3	2013-04-15	game-01	4	M	10	Android
4	2013-04-15	game-01	5	M	40	iOS

keyをuser_idとして結合します。

all_df = pd.merge(dau_df, user_info_df, on='user_id', how='left')

all_df.head()

	log_date	app_name_x	user_id	install_date	app_name_y	gender	generation	device_type
0	2013-08-01	game-01	33754	2013-08-01	game-01	M	20	iOS
1	2013-08-01	game-01	28598	2013-07-16	game-01	M	50	iOS
2	2013-08-01	game-01	30306	2013-07-20	game-01	F	30	iOS
3	2013-08-01	game-01	117	2013-04-17	game-01	F	20	iOS
4	2013-08-01	game-01	6605	2013-05-02	game-01	M	20	iOS

先月との比較を行いたいので、月の情報を追加します。

log_month =[]
for k in range(len(all_df['log_date'])):
    str = all_df['log_date'][k]
    str_list = list(str)
    month = str_list[5:7]
    month = ''.join(month)
    log_month.append(month)
all_df["log_month"] = log_month

all_df.head()

	log_date	user_id	install_date	gender	generation	device_type	log_month
0	2013-08-01	33754	2013-08-01	M	20	iOS	08
1	2013-08-01	28598	2013-07-16	M	50	iOS	08
2	2013-08-01	30306	2013-07-20	F	30	iOS	08
3	2013-08-01	117	2013-04-17	F	20	iOS	08
4	2013-08-01	6605	2013-05-02	M	20	iOS	08

さてここからデータをみていきます。まずは、性別で違いがあるのかみていきます。

all_df.groupby(['log_month', 'gender']).count()

		log_date	user_id	install_date	generation	device_type
log_month	gender
08	F	47343	47343	47343	47343	47343
08	M	46842	46842	46842	46842	46842
09	F	38027	38027	38027	38027	38027
09	M	38148	38148	38148	38148	38148

全然違いはなさそうですね。一応plotしておきます。

sns.countplot("log_month", data=all_df, hue='gender')

f:id:icchy333:20180214180239p:plain

次は年齢別で違いがあるのか見たいと思います。

all_df.groupby(['log_month', 'generation']).count()

		log_date	user_id	install_date	gender	device_type
log_month	generation
08	10	18785	18785	18785	18785	18785
	20	33671	33671	33671	33671	33671
	30	28072	28072	28072	28072	28072
	40	8828	8828	8828	8828	8828
	50	4829	4829	4829	4829	4829
09	10	15391	15391	15391	15391	15391
	20	27229	27229	27229	27229	27229
	30	22226	22226	22226	22226	22226
	40	7494	7494	7494	7494	7494
	50	3835	3835	3835	3835	3835

とても見難いので、これはplotする意味がありそうです。

sns.countplot("log_month", data=all_df, hue='generation')

f:id:icchy333:20180214180532p:plain

色は綺麗ですがまだちょっと見難いので、今度は"log_month"の方で層別化してみます。

sns.countplot("generation", data=all_df, hue='log_month')

f:id:icchy333:20180214180751p:plain

10〜30代の減少具合が高いような気もしますが、そもそものパイが大きいので、根本的な原因とは考えにくいかと思います。

実はこの章はクロス集計がテーマらしいので、それっぽいことをしていきます。

grouped = all_df.groupby(['log_month', 'gender', 'generation']).count()

grouped_df =  grouped.reset_index

grouped.log_date.plot(kind='bar')

f:id:icchy333:20180214181133p:plain

果たしてこのplotが見やすいかは謎ですが、これを見る限り、特別大きな原因があるようには思えません。。。

もう一度、全体のデータを確認します。

all_df.head()

	log_date	user_id	install_date	gender	generation	device_type	log_month
0	2013-08-01	33754	2013-08-01	M	20	iOS	08
1	2013-08-01	28598	2013-07-16	M	50	iOS	08
2	2013-08-01	30306	2013-07-20	F	30	iOS	08
3	2013-08-01	117	2013-04-17	F	20	iOS	08
4	2013-08-01	6605	2013-05-02	M	20	iOS	08

'device_type'の情報が含まれています。こっちも分析してみましょう。

sns.countplot("device_type", data=all_df, hue='log_month')

f:id:icchy333:20180214181510p:plain

お？

これは明らかにおかしい結果が出てきました。

Androidユーザーの利用率が9月になってめちゃめちゃ下がってますね。

この辺に原因がありそうです。

性別、年代との関係があるか見てみます。

android_df = all_df[all_df['device_type'] == 'Android']

sns.countplot("generation", data=android_df, hue='log_month')

f:id:icchy333:20180214181815p:plain

sns.countplot("gender", data=android_df, hue='log_month')

f:id:icchy333:20180214181841p:plain

この辺に関係はなさそうですね。 Android用のアプリに異常があったことは間違いなさそうなので、月ごとではなく、日毎でplotしてみます。

device_count = all_df.groupby(["device_type", 'log_date']).count()

device_count = device_count.drop(['install_date', 'gender', 'generation', 'log_month'],axis=1)

device_count = device_count.reset_index()

device_count.columns = ['device_type', 'log_date', 'count']

sns.set()
sns.set_context("notebook")
plt.figure(figsize=(24, 12))
sns.pointplot(x='log_date', y='count', hue='device_type', data=device_count, markers=["^", "o"], linestyles=["-", "--"])