ESET HOME セキュリティプレミアム| 5台3年 |カード版|ウイルス対策|Win/Mac/Android/iOS対応

(542141)

￥14,980 (2025-07-17 16:40 GMT +09:00 時点 - )

Foxit PDF Editor Win Pro (高機能版・最新日本語買い切り版)|Windows対応|ダウンロード版

(5463)

￥18,590 (2025-07-17 16:40 GMT +09:00 時点 - )

【2025新版・MFi認証・自動接続】 iPhone hdmi変換ケーブルライトニング設定不要・ APP不要・給電不要・1080PプルHD TV大画面音声同期出力ライトニング hdmi iphone tv 変換ケーブルテレビに映す遅延なし簡単接続 iPhone/iPad などに対応日本語取説付き（iOS13 - iOS18対応）

(546348)

￥1,999 (2025-07-17 16:40 GMT +09:00 時点 - )

Microsoft Office Home & Business 2024(最新永続版)|カード版|Windows11、10/mac対応|PC2台

(53848)

￥43,980 (2025-07-17 16:40 GMT +09:00 時点 - )

【Adobe公式】Illustrator(イラストレーター) 生成AI Firefly搭載デザインソフト(最新)|12ヵ月| オンラインコード版 Win / Mac 対応 | イラストロゴイラレ|オンラインコード版

(5501)

￥31,528 (2025-07-17 16:40 GMT +09:00 時点 - )

目次 [非表示]

1 Pythonでのデータ分析基礎：Pandas・NumPy・Matplotlibで始めるデータサイエンス入門

Pythonでのデータ分析基礎：Pandas・NumPy・Matplotlibで始めるデータサイエンス入門

データ分析は現代のビジネスや研究において欠かせないスキルとなっています。Pythonは、豊富なライブラリとわかりやすい文法により、データ分析分野で最も人気の高いプログラミング言語の一つです。この記事では、Python初心者がデータ分析を始めるために必要な基礎知識と実践的な手法を、実際のコード例とともに詳しく解説します。

データ分析とは

データ分析の重要性

データ分析とは、大量のデータから有用な情報やパターンを抽出し、意思決定に活用する手法です。現代社会では以下のような場面で活用されています：

ビジネス分析: 売上予測、顧客行動分析、マーケティング効果測定
研究分野: 科学的データの解析、仮説検証
金融: リスク評価、投資判断、不正検知
医療: 臨床データ分析、薬効評価

Pythonがデータ分析に適している理由

豊富なライブラリ: NumPy、Pandas、Matplotlib、Scikit-learn等
オープンソース: 無料で利用可能
コミュニティ: 活発な開発者コミュニティと豊富な情報
汎用性: データ収集からWebアプリ開発まで一貫して利用可能

環境構築

データ分析を始める前に、必要なライブラリをインストールしましょう。Pythonの環境構築については、Pythonの環境構築の記事を参照してください。

必要なライブラリのインストール

# Poetryを使用する場合
poetry add pandas numpy matplotlib seaborn jupyter

# pipを使用する場合
pip install pandas numpy matplotlib seaborn jupyter

Jupyter Notebookの起動

# Poetryを使用する場合
poetry run jupyter notebook

# pipを使用する場合
jupyter notebook

NumPy：数値計算の基礎

NumPyとは

NumPy（Numerical Python）は、数値計算を効率的に行うためのライブラリです。多次元配列オブジェクトとそれを操作する関数を提供します。

基本的な配列操作

import numpy as np

# 配列の作成
arr1 = np.array([1, 2, 3, 4, 5])
arr2 = np.array([[1, 2, 3], [4, 5, 6]])

print("1次元配列:", arr1)
print("2次元配列:")
print(arr2)

# 配列の情報
print("配列の形状:", arr2.shape)      # (2, 3)
print("配列の次元:", arr2.ndim)       # 2
print("配列のサイズ:", arr2.size)     # 6
print("データ型:", arr2.dtype)        # int64

配列の生成

# 特定のパターンで配列を生成
zeros = np.zeros((3, 4))           # 0で初期化された3x4配列
ones = np.ones((2, 3))             # 1で初期化された2x3配列
full = np.full((2, 2), 7)          # 7で初期化された2x2配列
identity = np.eye(3)               # 3x3の単位行列

# 数列の生成
range_arr = np.arange(0, 10, 2)    # [0, 2, 4, 6, 8]
linspace = np.linspace(0, 1, 5)    # 0から1まで5等分

print("arange:", range_arr)
print("linspace:", linspace)

数学的演算

# 基本的な演算
arr = np.array([1, 2, 3, 4, 5])

print("元の配列:", arr)
print("各要素を2倍:", arr * 2)
print("各要素を二乗:", arr ** 2)
print("平方根:", np.sqrt(arr))

# 統計関数
print("合計:", np.sum(arr))
print("平均:", np.mean(arr))
print("標準偏差:", np.std(arr))
print("最大値:", np.max(arr))
print("最小値:", np.min(arr))

配列のインデックシングとスライシング

# 2次元配列での操作
matrix = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])

print("元の行列:")
print(matrix)
print("最初の行:", matrix[0])
print("最初の列:", matrix[:, 0])
print("2x2の部分行列:")
print(matrix[:2, :2])

# 条件による抽出
print("5より大きい要素:", matrix[matrix > 5])

Pandas：データ操作の中核

Pandasとは

Pandasは、データ操作と分析のためのライブラリです。表形式データ（DataFrame）と系列データ（Series）を効率的に処理できます。

DataFrameの基本操作

import pandas as pd

# サンプルデータの作成
data = {
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'age': [25, 30, 35, 28, 32],
    'city': ['Tokyo', 'Osaka', 'Tokyo', 'Kyoto', 'Osaka'],
    'salary': [50000, 60000, 70000, 55000, 65000]
}

df = pd.DataFrame(data)
print("DataFrame:")
print(df)

# データの基本情報
print("nデータの形状:", df.shape)
print("nデータ型:")
print(df.dtypes)
print("n基本統計情報:")
print(df.describe())

データの読み込みと保存

# CSVファイルの読み込み
# df = pd.read_csv('data.csv')

# Excelファイルの読み込み
# df = pd.read_excel('data.xlsx')

# JSONファイルの読み込み
# df = pd.read_json('data.json')

# CSVファイルとして保存
# df.to_csv('output.csv', index=False)

# サンプルCSVデータの作成（実際のファイル読み込みの代替）
sample_csv_data = """name,age,department,salary
Alice,25,Engineering,75000
Bob,30,Marketing,65000
Charlie,35,Engineering,80000
Diana,28,Sales,60000
Eve,32,Marketing,70000"""

from io import StringIO
df_csv = pd.read_csv(StringIO(sample_csv_data))
print("CSVから読み込んだデータ:")
print(df_csv)

データの選択と絞り込み

# 列の選択
print("名前列:")
print(df['name'])

# 複数列の選択
print("n名前と年齢:")
print(df[['name', 'age']])

# 行の選択（インデックス）
print("n最初の3行:")
print(df.head(3))

print("n最後の2行:")
print(df.tail(2))

# 条件による絞り込み
print("n30歳以上の人:")
print(df[df['age'] >= 30])

print("n東京在住かつ給与60000以上:")
print(df[(df['city'] == 'Tokyo') & (df['salary'] >= 60000)])

データの集計とグループ化

# 基本的な集計
print("平均年齢:", df['age'].mean())
print("最高給与:", df['salary'].max())
print("都市別カウント:")
print(df['city'].value_counts())

# グループ化による集計
print("n都市別平均年齢:")
print(df.groupby('city')['age'].mean())

print("n都市別統計:")
print(df.groupby('city').agg({
    'age': ['mean', 'max', 'min'],
    'salary': ['mean', 'sum']
}))

データの変換と加工

# 新しい列の追加
df['age_category'] = df['age'].apply(lambda x: 'Young' if x < 30 else 'Adult')
print("年齢カテゴリを追加:")
print(df)

# 列の削除
df_copy = df.copy()
df_copy = df_copy.drop('age_category', axis=1)
print("n年齢カテゴリを削除:")
print(df_copy)

# データの並び替え
print("n給与でソート（降順）:")
print(df.sort_values('salary', ascending=False))

Matplotlib：データの可視化

Matplotlibとは

Matplotlibは、グラフやチャートを作成するためのライブラリです。データの可視化により、パターンや傾向を直感的に理解できます。

基本的なグラフ作成

import matplotlib.pyplot as plt

# 日本語フォントの設定（必要に応じて）
plt.rcParams['font.family'] = 'DejaVu Sans'

# 線グラフ
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)

plt.figure(figsize=(10, 6))
plt.plot(x, y1, label='sin(x)')
plt.plot(x, y2, label='cos(x)')
plt.xlabel('x')
plt.ylabel('y')
plt.title('Sine and Cosine Functions')
plt.legend()
plt.grid(True)
plt.show()

さまざまなグラフの種類

# 散布図
plt.figure(figsize=(12, 8))

# サブプロット1: 散布図
plt.subplot(2, 2, 1)
plt.scatter(df['age'], df['salary'])
plt.xlabel('Age')
plt.ylabel('Salary')
plt.title('Age vs Salary')

# サブプロット2: ヒストグラム
plt.subplot(2, 2, 2)
plt.hist(df['age'], bins=5, alpha=0.7)
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.title('Age Distribution')

# サブプロット3: 棒グラフ
plt.subplot(2, 2, 3)
city_counts = df['city'].value_counts()
plt.bar(city_counts.index, city_counts.values)
plt.xlabel('City')
plt.ylabel('Count')
plt.title('City Distribution')

# サブプロット4: 円グラフ
plt.subplot(2, 2, 4)
plt.pie(city_counts.values, labels=city_counts.index, autopct='%1.1f%%')
plt.title('City Distribution (Pie Chart)')

plt.tight_layout()
plt.show()

実践的なデータ分析例

売上データの分析

実際のビジネスシーンを想定した売上データの分析を行ってみましょう。

# サンプル売上データの作成
np.random.seed(42)
dates = pd.date_range('2024-01-01', periods=365, freq='D')
sales_data = {
    'date': dates,
    'sales': np.random.normal(100000, 20000, 365) + 
             np.sin(np.arange(365) * 2 * np.pi / 365) * 10000,  # 季節変動
    'product_category': np.random.choice(['A', 'B', 'C'], 365),
    'region': np.random.choice(['East', 'West', 'North', 'South'], 365)
}

sales_df = pd.DataFrame(sales_data)
sales_df['sales'] = sales_df['sales'].clip(lower=0)  # 負の値を0にクリップ

print("売上データの最初の5行:")
print(sales_df.head())

時系列分析

# 月別売上の集計
sales_df['month'] = sales_df['date'].dt.month
monthly_sales = sales_df.groupby('month')['sales'].sum()

plt.figure(figsize=(12, 6))
plt.plot(monthly_sales.index, monthly_sales.values, marker='o')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.title('Monthly Sales Trend')
plt.grid(True)
plt.show()

print("月別売上:")
print(monthly_sales)

カテゴリ別分析

# 商品カテゴリ別の売上分析
category_analysis = sales_df.groupby('product_category').agg({
    'sales': ['sum', 'mean', 'count']
}).round(2)

print("商品カテゴリ別分析:")
print(category_analysis)

# 地域別・カテゴリ別のクロス集計
cross_table = pd.crosstab(sales_df['region'], sales_df['product_category'], 
                         values=sales_df['sales'], aggfunc='sum')

print("n地域別・カテゴリ別売上:")
print(cross_table.round(2))

相関分析

# 数値データの相関分析
# 売上データに追加の数値指標を作成
sales_df['day_of_year'] = sales_df['date'].dt.dayofyear
sales_df['is_weekend'] = sales_df['date'].dt.dayofweek >= 5

# 相関行列の計算
correlation_matrix = sales_df[['sales', 'day_of_year']].corr()

print("相関行列:")
print(correlation_matrix)

# ヒートマップで可視化
plt.figure(figsize=(8, 6))
import seaborn as sns
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', center=0)
plt.title('Correlation Matrix')
plt.show()

統計的分析の基礎

記述統計

# 売上データの記述統計
print("売上データの記述統計:")
print(sales_df['sales'].describe())

# 追加の統計量
print(f"n歪度: {sales_df['sales'].skew():.3f}")
print(f"尖度: {sales_df['sales'].kurtosis():.3f}")
print(f"中央値: {sales_df['sales'].median():.2f}")

分布の可視化

plt.figure(figsize=(15, 5))

# ヒストグラム
plt.subplot(1, 3, 1)
plt.hist(sales_df['sales'], bins=30, alpha=0.7, density=True)
plt.xlabel('Sales')
plt.ylabel('Density')
plt.title('Sales Distribution')

# ボックスプロット
plt.subplot(1, 3, 2)
sales_df.boxplot(column='sales', by='product_category', ax=plt.gca())
plt.title('Sales by Product Category')
plt.suptitle('')  # サブタイトルを削除

# Q-Qプロット（正規性の確認）
from scipy import stats
plt.subplot(1, 3, 3)
stats.probplot(sales_df['sales'], dist="norm", plot=plt)
plt.title('Q-Q Plot (Normal Distribution)')

plt.tight_layout()
plt.show()

仮説検定の基礎

from scipy import stats

# t検定：地域間の売上差の検定
east_sales = sales_df[sales_df['region'] == 'East']['sales']
west_sales = sales_df[sales_df['region'] == 'West']['sales']

# 等分散性の検定
levene_stat, levene_p = stats.levene(east_sales, west_sales)
print(f"等分散性検定 - Levene統計量: {levene_stat:.3f}, p値: {levene_p:.3f}")

# t検定の実行
if levene_p > 0.05:  # 等分散を仮定
    t_stat, t_p = stats.ttest_ind(east_sales, west_sales)
else:  # 等分散を仮定しない
    t_stat, t_p = stats.ttest_ind(east_sales, west_sales, equal_var=False)

print(f"nt検定結果:")
print(f"t統計量: {t_stat:.3f}")
print(f"p値: {t_p:.3f}")
print(f"東地域平均売上: {east_sales.mean():.2f}")
print(f"西地域平均売上: {west_sales.mean():.2f}")

if t_p < 0.05:
    print("有意差あり (p < 0.05)")
else:
    print("有意差なし (p >= 0.05)")

データクリーニング

欠損値の処理

# 欠損値を含むサンプルデータの作成
dirty_data = {
    'name': ['Alice', 'Bob', None, 'Diana', 'Eve'],
    'age': [25, None, 35, 28, 32],
    'salary': [50000, 60000, 70000, None, 65000]
}

dirty_df = pd.DataFrame(dirty_data)
print("欠損値を含むデータ:")
print(dirty_df)

# 欠損値の確認
print("n欠損値の数:")
print(dirty_df.isnull().sum())

# 欠損値の処理方法
# 1. 削除
print("n欠損値を含む行を削除:")
print(dirty_df.dropna())

# 2. 平均値で補完
df_filled_mean = dirty_df.copy()
df_filled_mean['age'] = df_filled_mean['age'].fillna(df_filled_mean['age'].mean())
df_filled_mean['salary'] = df_filled_mean['salary'].fillna(df_filled_mean['salary'].mean())
print("n平均値で補完:")
print(df_filled_mean)

# 3. 前の値で補完
df_filled_forward = dirty_df.fillna(method='ffill')
print("n前の値で補完:")
print(df_filled_forward)

異常値の検出と処理

# 売上データで異常値を検出
Q1 = sales_df['sales'].quantile(0.25)
Q3 = sales_df['sales'].quantile(0.75)
IQR = Q3 - Q1

# 異常値の閾値
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR

print(f"Q1: {Q1:.2f}")
print(f"Q3: {Q3:.2f}")
print(f"IQR: {IQR:.2f}")
print(f"異常値の範囲: {lower_bound:.2f} ～ {upper_bound:.2f}")

# 異常値の検出
outliers = sales_df[(sales_df['sales'] < lower_bound) | 
                   (sales_df['sales'] > upper_bound)]
print(f"n異常値の数: {len(outliers)}")

# 異常値を除外したデータ
clean_sales = sales_df[(sales_df['sales'] >= lower_bound) & 
                      (sales_df['sales'] <= upper_bound)]
print(f"クリーンなデータの数: {len(clean_sales)}")

データ分析のワークフロー

1. データの理解

def data_overview(df):
    """データの概要を表示する関数"""
    print("=" * 50)
    print("データ概要")
    print("=" * 50)
    print(f"行数: {df.shape[0]}")
    print(f"列数: {df.shape[1]}")
    print(f"n列名: {list(df.columns)}")
    print(f"nデータ型:")
    print(df.dtypes)
    print(f"n欠損値:")
    print(df.isnull().sum())
    print(f"n基本統計:")
    print(df.describe())

# 使用例
data_overview(sales_df)

2. 探索的データ分析（EDA）

def exploratory_analysis(df, target_column):
    """探索的データ分析を実行する関数"""
    plt.figure(figsize=(15, 10))

    # ヒストグラム
    plt.subplot(2, 3, 1)
    plt.hist(df[target_column], bins=30, alpha=0.7)
    plt.title(f'{target_column} Distribution')
    plt.xlabel(target_column)
    plt.ylabel('Frequency')

    # ボックスプロット
    plt.subplot(2, 3, 2)
    plt.boxplot(df[target_column])
    plt.title(f'{target_column} Boxplot')
    plt.ylabel(target_column)

    # 時系列プロット（日付列がある場合）
    if 'date' in df.columns:
        plt.subplot(2, 3, 3)
        daily_avg = df.groupby('date')[target_column].mean()
        plt.plot(daily_avg.index, daily_avg.values)
        plt.title(f'{target_column} Time Series')
        plt.xlabel('Date')
        plt.ylabel(target_column)
        plt.xticks(rotation=45)

    # カテゴリ別分析（カテゴリ列がある場合）
    categorical_columns = df.select_dtypes(include=['object']).columns
    if len(categorical_columns) > 0:
        plt.subplot(2, 3, 4)
        cat_col = categorical_columns[0]
        category_means = df.groupby(cat_col)[target_column].mean()
        plt.bar(category_means.index, category_means.values)
        plt.title(f'{target_column} by {cat_col}')
        plt.xlabel(cat_col)
        plt.ylabel(f'Average {target_column}')

    plt.tight_layout()
    plt.show()

# 使用例
exploratory_analysis(sales_df, 'sales')

3. レポート作成

def generate_analysis_report(df, target_column):
    """分析レポートを生成する関数"""
    report = f"""
データ分析レポート
================

データ概要:
- 総レコード数: {len(df):,}
- 分析対象: {target_column}

統計サマリー:
- 平均値: {df[target_column].mean():.2f}
- 中央値: {df[target_column].median():.2f}
- 標準偏差: {df[target_column].std():.2f}
- 最小値: {df[target_column].min():.2f}
- 最大値: {df[target_column].max():.2f}

分布特性:
- 歪度: {df[target_column].skew():.3f}
- 尖度: {df[target_column].kurtosis():.3f}
"""

    # カテゴリ別分析
    categorical_columns = df.select_dtypes(include=['object']).columns
    if len(categorical_columns) > 0:
        report += "nカテゴリ別分析:n"
        for cat_col in categorical_columns:
            if cat_col != 'date':  # 日付列は除外
                category_stats = df.groupby(cat_col)[target_column].agg(['mean', 'count'])
                report += f"n{cat_col}別:n"
                for idx, row in category_stats.iterrows():
                    report += f"  {idx}: 平均{row['mean']:.2f} (件数: {row['count']})n"

    return report

# 使用例
print(generate_analysis_report(sales_df, 'sales'))

よくある分析パターン

RFM分析（顧客分析）

# RFM分析のサンプル（顧客の購買行動分析）
# Recency（最新購買日）、Frequency（購買頻度）、Monetary（購買金額）

# サンプル顧客データの作成
customer_data = {
    'customer_id': range(1, 101),
    'last_purchase_days': np.random.randint(1, 365, 100),
    'purchase_frequency': np.random.randint(1, 20, 100),
    'total_spent': np.random.normal(50000, 20000, 100)
}

customer_df = pd.DataFrame(customer_data)

# RFMスコアの計算
def calculate_rfm_score(value, quartiles, reverse=False):
    """RFMスコア（1-4）を計算"""
    if reverse:  # Recencyは逆順（小さいほど良い）
        if value <= quartiles[0.25]:
            return 4
        elif value <= quartiles[0.5]:
            return 3
        elif value <= quartiles[0.75]:
            return 2
        else:
            return 1
    else:  # FrequencyとMonetaryは順順（大きいほど良い）
        if value <= quartiles[0.25]:
            return 1
        elif value <= quartiles[0.5]:
            return 2
        elif value <= quartiles[0.75]:
            return 3
        else:
            return 4

# 四分位数の計算
r_quartiles = customer_df['last_purchase_days'].quantile([0.25, 0.5, 0.75])
f_quartiles = customer_df['purchase_frequency'].quantile([0.25, 0.5, 0.75])
m_quartiles = customer_df['total_spent'].quantile([0.25, 0.5, 0.75])

# RFMスコアの適用
customer_df['R_score'] = customer_df['last_purchase_days'].apply(
    lambda x: calculate_rfm_score(x, r_quartiles, reverse=True))
customer_df['F_score'] = customer_df['purchase_frequency'].apply(
    lambda x: calculate_rfm_score(x, f_quartiles))
customer_df['M_score'] = customer_df['total_spent'].apply(
    lambda x: calculate_rfm_score(x, m_quartiles))

# 総合スコア
customer_df['RFM_score'] = (customer_df['R_score'] + 
                           customer_df['F_score'] + 
                           customer_df['M_score'])

print("RFM分析結果（上位10顧客）:")
print(customer_df.sort_values('RFM_score', ascending=False).head(10))

ABテスト分析

# ABテストの分析例
np.random.seed(42)

# サンプルデータ（AとBの2つのグループ）
ab_test_data = {
    'user_id': range(1, 1001),
    'group': np.random.choice(['A', 'B'], 1000),
    'conversion': np.random.choice([0, 1], 1000, p=[0.85, 0.15])  # 15%の転換率
}

# グループBの転換率を少し高く設定
ab_df = pd.DataFrame(ab_test_data)
b_indices = ab_df['group'] == 'B'
ab_df.loc[b_indices, 'conversion'] = np.random.choice([0, 1], 
                                                      sum(b_indices), 
                                                      p=[0.82, 0.18])  # 18%の転換率

# 結果の集計
conversion_by_group = ab_df.groupby('group')['conversion'].agg(['count', 'sum', 'mean'])
conversion_by_group.columns = ['total_users', 'conversions', 'conversion_rate']

print("ABテスト結果:")
print(conversion_by_group)

# 統計的有意性の検定
from scipy.stats import chi2_contingency

# クロス集計表の作成
contingency_table = pd.crosstab(ab_df['group'], ab_df['conversion'])
print("nクロス集計表:")
print(contingency_table)

# カイ二乗検定
chi2, p_value, dof, expected = chi2_contingency(contingency_table)

print(f"nカイ二乗検定結果:")
print(f"カイ二乗統計量: {chi2:.3f}")
print(f"p値: {p_value:.3f}")
print(f"自由度: {dof}")

if p_value < 0.05:
    print("有意差あり（p < 0.05）")
else:
    print("有意差なし（p >= 0.05）")

分析結果の可視化ベストプラクティス

効果的なグラフ作成

# 美しく情報量の多いグラフの作成例
plt.style.use('seaborn-v0_8')  # スタイルの設定

fig, axes = plt.subplots(2, 2, figsize=(15, 12))

# 1. 売上トレンド（移動平均付き）
sales_df['sales_ma7'] = sales_df['sales'].rolling(window=7).mean()
axes[0, 0].plot(sales_df['date'], sales_df['sales'], alpha=0.3, label='Daily Sales')
axes[0, 0].plot(sales_df['date'], sales_df['sales_ma7'], linewidth=2, label='7-day Moving Average')
axes[0, 0].set_title('Sales Trend with Moving Average', fontsize=14, fontweight='bold')
axes[0, 0].set_xlabel('Date')
axes[0, 0].set_ylabel('Sales')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. カテゴリ別売上（棒グラフ + 誤差棒）
category_stats = sales_df.groupby('product_category')['sales'].agg(['mean', 'std'])
x_pos = range(len(category_stats))
axes[0, 1].bar(x_pos, category_stats['mean'], 
               yerr=category_stats['std'], capsize=5, alpha=0.8)
axes[0, 1].set_title('Sales by Product Category', fontsize=14, fontweight='bold')
axes[0, 1].set_xlabel('Product Category')
axes[0, 1].set_ylabel('Average Sales')
axes[0, 1].set_xticks(x_pos)
axes[0, 1].set_xticklabels(category_stats.index)
axes[0, 1].grid(True, alpha=0.3)

# 3. 地域別分布（パイチャート）
region_sales = sales_df.groupby('region')['sales'].sum()
colors = plt.cm.Set3(np.linspace(0, 1, len(region_sales)))
wedges, texts, autotexts = axes[1, 0].pie(region_sales.values, 
                                         labels=region_sales.index,
                                         autopct='%1.1f%%',
                                         colors=colors)
axes[1, 0].set_title('Sales Distribution by Region', fontsize=14, fontweight='bold')

# 4. 相関ヒートマップ
numeric_cols = sales_df.select_dtypes(include=[np.number]).columns
correlation_matrix = sales_df[numeric_cols].corr()
im = axes[1, 1].imshow(correlation_matrix, cmap='coolwarm', vmin=-1, vmax=1)
axes[1, 1].set_xticks(range(len(numeric_cols)))
axes[1, 1].set_yticks(range(len(numeric_cols)))
axes[1, 1].set_xticklabels(numeric_cols, rotation=45)
axes[1, 1].set_yticklabels(numeric_cols)
axes[1, 1].set_title('Correlation Matrix', fontsize=14, fontweight='bold')

# カラーバーの追加
plt.colorbar(im, ax=axes[1, 1])

# 相関係数の表示
for i in range(len(numeric_cols)):
    for j in range(len(numeric_cols)):
        text = axes[1, 1].text(j, i, f'{correlation_matrix.iloc[i, j]:.2f}',
                              ha="center", va="center", color="black")

plt.tight_layout()
plt.show()

次のステップ

機械学習への発展

データ分析の基礎を身につけたら、次は機械学習の分野に進むことをお勧めします：

# 機械学習の簡単な例（線形回帰）
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

# 売上予測モデルの例
# 特徴量の準備
features = sales_df[['day_of_year']].values
target = sales_df['sales'].values

# 訓練・テストデータの分割
X_train, X_test, y_train, y_test = train_test_split(
    features, target, test_size=0.2, random_state=42)

# モデルの訓練
model = LinearRegression()
model.fit(X_train, y_train)

# 予測
y_pred = model.predict(X_test)

# 評価
r2 = r2_score(y_test, y_pred)
print(f"R²スコア: {r2:.3f}")

# 可視化
plt.figure(figsize=(10, 6))
plt.scatter(X_test, y_test, alpha=0.5, label='Actual')
plt.scatter(X_test, y_pred, alpha=0.5, label='Predicted')
plt.xlabel('Day of Year')
plt.ylabel('Sales')
plt.title('Sales Prediction Model')
plt.legend()
plt.show()

学習リソース

データ分析のスキルをさらに向上させるために、以下のトピックについて学習することをお勧めします：

統計学の基礎: 仮説検定、信頼区間、回帰分析
機械学習: Scikit-learn、教師あり学習、教師なし学習
データベース: SQL、データウェアハウス
ビッグデータ: Apache Spark、分散処理
深層学習: TensorFlow、PyTorch

まとめ

この記事では、Pythonを使ったデータ分析の基礎について、実践的なコード例とともに詳しく解説しました。重要なポイントをまとめます：

学習したスキル

NumPy: 数値計算と配列操作の基礎
Pandas: データ操作、集計、変換の手法
Matplotlib: データ可視化の技術
統計分析: 記述統計、仮説検定、相関分析
実践的な分析: RFM分析、ABテスト、時系列分析

データ分析の基本プロセス

データの理解: 概要把握、品質確認
データクリーニング: 欠損値・異常値の処理
探索的データ分析: パターンや傾向の発見
統計的分析: 仮説検定、相関分析
可視化: 結果の効果的な表現
レポート作成: 知見の共有

ベストプラクティス

再現可能性: コードとデータのバージョン管理
可読性: コメントとドキュメンテーション
検証: 結果の妥当性確認
可視化: 適切なグラフ選択
継続学習: 新しい手法とツールの習得

データ分析は、データから価値ある洞察を得るための強力なスキルです。この記事で学んだ基礎知識をベースに、実際のデータでの分析経験を積み重ねることで、より高度な分析技術を身につけていきましょう。

ご質問やご意見がありましたら、お問い合わせページからお気軽にご連絡ください！

関連記事:

Windows版 | Minecraft (マインクラフト): Java & Bedrock Edition | オンラインコード版

(5453499)

￥3,564 (2025-07-17 16:40 GMT +09:00 時点 - )

Nulea M501 ワイヤレストラックボールマウス 3台接続可能 Bluetooth/USB接続 4段階DPI調節可能充電式親指操作人間工学設計 6ボタン進む/戻るボタン搭載マウス無線 Windows/Mac対応赤色

(5432753)

￥2,842 (2025-07-17 16:40 GMT +09:00 時点 - )

AI分析でわかったトップ5％社員の読書術

(54471)

￥1,833 (2025-07-17 16:40 GMT +09:00 時点 - )

山本ゆう　ヴィーナスのHカップ Oilyショット (アサ芸Secret！デジタル写真集)

(54046)

￥1,320 (2025-07-17 16:40 GMT +09:00 時点 - )

USB Type C ケーブル【3本セット 1m/1m/2m】3A 急速充電 cタイプ QuickCharge3.0対応高速データ転送高耐久ナイロン素材 Samsung Galaxy S10 / Note / Huawei P30 / P20 lite / Sony Xperia XZ2 / XZ3 / Nintendo Switch / GoPro Hero 5/6 その他Android USB-C機器対応

(5426929)

￥586 (2025-07-17 16:40 GMT +09:00 時点 - )