ブログ投稿のためのPythonによるBag of Words分析

テキストの前処理:
- テキストのクリーニング: テキスト内の特殊文字や句読点を削除し、単語だけを残します。
- 小文字化: 全ての単語を小文字に変換します。
- ストップワードの削除: 頻出するが意味を持たない単語（例: "a", "the", "in"）を削除します。

単語のベクトル化:

CountVectorizerを使用する方法:

from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.',
        'This document is the second document.',
        'And this is the third one.',
        'Is this the first document?']
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())  # 単語のリストを表示
print(X.toarray())  # テキストのベクトル表現を表示

TfidfVectorizerを使用する方法（単語の重要性を考慮）:

from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())  # 単語のリストを表示
print(X.toarray())  # テキストのベクトル表現を表示

文書分類:

分類器（例: ナイーブベイズ分類器、ロジスティック回帰、サポートベクターマシン）を使用して、テキストを特定のカテゴリに分類します。

from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
# 分類器を含むパイプラインを作成
classifier = make_pipeline(vectorizer, MultinomialNB())
# テキストを分類
labels = ['category1', 'category2', 'category1', 'category2']
classifier.fit(corpus, labels)
predictions = classifier.predict(corpus)