前回のkaggleで遊んでみたの巻 - 肉眼天文台を経て

自分がそもそもPythonの基礎をぜんぜん分かってねーなってことを実感したので

今日はiris.csvで遊んでみた。

前回で学んだパスを通したのち、

import pandas as pd

df= pd.read_csv("iris.csv")
print(df)

で出力した。

せっかくなので、覚えたての

print(df.dtypes)

も使ってどういうデータが含まれるか確認した。

当然全部floatだよね！！知ってた！！

さらにせっかくなので、覚えたての

import matplotlib.pyplot as plt
import missingno as msno
msno.matrix(df, figsize=(20,14), color=(0.5,0,0))

も使ってデータに欠損がないか確認した。

当然欠損ないよね！！知ってた！！

ここから少しまじめに遊びます。

最大値、最小値、平均値、中央値、分散、標準偏差を求めて出力してみた。

分散と標準偏差ってなんやねん。。。

kusanagi.hatenablog.jp

basic_statistics = pd.concat(
[df.max(), df.min(), df.mean(), df.median(), df.var(ddof=1), df.std(ddof=1)],
axis=1)
basic_statistics.columns = ['max', 'min', 'average', 'median', 'var', 'std']
basic_statistics.to_csv('basic_statistics.csv')
out= pd.read_csv("basic_statistics.csv")
print(out)

分散と標準偏差はよくわからんけど、

まぁ打ち込んだらコンピューターが計算して結果出力してくれるからおｋ！！？？？

     Unnamed: 0       max      min   average  median       var       std
0  Petal.Length       6.9      1.0  3.758000    4.35  3.116278  1.765298
1   Petal.Width       2.5      0.1  1.199333    1.30  0.581006  0.762238
2  Sepal.Length       7.9      4.3  5.843333    5.80  0.685694  0.828066
3   Sepal.Width       4.4      2.0  3.057333    3.00  0.189979  0.435866
4    Unnamed: 0  sample99  sample1       NaN     NaN       NaN       NaN

ちなみにそれぞれ可視化するとこんな感じ

import seaborn as sns

y_df_1 = df['Petal.Length']
y_df_2 = df['Petal.Width']
y_df_3 = df['Sepal.Length']
y_df_4 = df['Sepal.Width']

ax = sns.distplot(y_df_1)
plt.show()

bx = sns.distplot(y_df_2)
plt.show()

cx = sns.distplot(y_df_3)
plt.show()

dx = sns.distplot(y_df_4)
plt.show()

f:id:Wjenga:20181222164014p:plain f:id:Wjenga:20181222164152p:plain

f:id:Wjenga:20181222162650p:plain f:id:Wjenga:20181222163709p:plain

これで分散と標準偏差の値の大小がグラフのきれいさで視覚的にわかるか・・・（　＾ω＾）？

ちなみに今回は大した数字でもないけどこれでlog取れるそうです。

import numpy as np
y_df = np.log(y_df)

ax = sns.distplot(y_df)
plt.show()

次に共分散と相関関数を求めて出力してみた。

共分散と相関関数ってなんやねん。。。

www.sekkachi.com

covariance = df.cov()
covariance.to_csv('covariance.csv')
out2= pd.read_csv("covariance.csv")
print(out2)

correlation_coefficient = df.corr()
correlation_coefficient.to_csv('correlation_coefficient.csv')
out3= pd.read_csv("correlation_coefficient.csv")
print(out3)

共分散と相関係数はよくわからんけど、

まぁ打ち込んだらコンピューターが計算して結果出力してくれるからおｋ！！？？？

     Unnamed: 0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
0  Sepal.Length      0.685694    -0.042434      1.274315     0.516271
1   Sepal.Width     -0.042434     0.189979     -0.329656    -0.121639
2  Petal.Length      1.274315    -0.329656      3.116278     1.295609
3   Petal.Width      0.516271    -0.121639      1.295609     0.581006
     Unnamed: 0  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
0  Sepal.Length      1.000000    -0.117570      0.871754     0.817941
1   Sepal.Width     -0.117570     1.000000     -0.428440    -0.366126
2  Petal.Length      0.871754    -0.428440      1.000000     0.962865
3   Petal.Width      0.817941    -0.366126      0.962865     1.000000

ちなみにそれぞれrandom forestするとこんな感じ

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

y_df_1 = df['Sepal.Length']
y_df_2 = df['Sepal.Width']
y_df_3 = df['Petal.Length']
y_df_4 = df['Petal.Width']
X_df_1 = df.drop(['Unnamed: 0', 'Sepal.Length'], axis=1)
X_df_2 = df.drop(['Unnamed: 0', 'Sepal.Width'], axis=1)
X_df_3 = df.drop(['Unnamed: 0', 'Petal.Length'], axis=1)
X_df_4 = df.drop(['Unnamed: 0', 'Petal.Width'], axis=1)

from sklearn.ensemble import RandomForestRegressor
rf_1 = RandomForestRegressor(n_estimators=80, max_features='auto')
rf_1.fit(X_df_1, y_df_1)
print('Training done using Random Forest')

ranking = np.argsort(-rf_1.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=rf_1.feature_importances_[ranking], y=X_df_1.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

from sklearn.ensemble import RandomForestRegressor
rf_2 = RandomForestRegressor(n_estimators=80, max_features='auto')
rf_2.fit(X_df_2, y_df_2)
print('Training done using Random Forest')

ranking = np.argsort(-rf_2.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=rf_2.feature_importances_[ranking], y=X_df_2.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

from sklearn.ensemble import RandomForestRegressor
rf_3 = RandomForestRegressor(n_estimators=80, max_features='auto')
rf_3.fit(X_df_3, y_df_3)
print('Training done using Random Forest')

ranking = np.argsort(-rf_3.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=rf_3.feature_importances_[ranking], y=X_df_3.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

from sklearn.ensemble import RandomForestRegressor
rf_4 = RandomForestRegressor(n_estimators=80, max_features='auto')
rf_4.fit(X_df_4, y_df_4)
print('Training done using Random Forest')

ranking = np.argsort(-rf_4.feature_importances_)
f, ax = plt.subplots(figsize=(11, 9))
sns.barplot(x=rf_4.feature_importances_[ranking], y=X_df_4.columns.values[ranking], orient='h')
ax.set_xlabel("feature importance")
plt.tight_layout()
plt.show()

f:id:Wjenga:20181222171231p:plain f:id:Wjenga:20181222171234p:plain