Hallo zusammen
Ich bin mit folgender Herausforderung konfrontiert . Es geht um: Sentiment Classification, Datensatz: https://keras.io/api/datasets/reuters
Kennt sich jemand damit aus und möchte mir mit dem Code (siehe unten) weiterhelfen?
You can implement your algorithm using any library.
Pre-processing
- You will create feature vector using histogram (word count) technique
Method
- Multinomial naive Bayes classifier
Problem Statement
In this project your task is to design a naive Bayes classifier for sentiment classification. You will use
the Reuters newswire dataset provided by Keras. In the dataset there are 46 class labels. You should
construct your feature vectors by taking the histogram (word count) of the text. When you call the
dataset from the Keras library, you have several options to pass to the constructor.
Options that you need to play with are:
num_words: integer or None. Words are ranked by how often they occur (in the training set) and
only the num_words most frequent words are kept. Any less frequent words will appear as oov_char
value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept.
skip_top: skip the top N most frequently occurring words (which may not be informative).
These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
Remember that num_words will affect the number of bins in your dataset. You may set skip_top to
any small integer since the most frequently used words are redundant e,g, "the", 'a', 'an', 'it'.
Bisheriger Code:
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "85b3eaef",
"metadata": {},
"outputs": [],
"source": [
"# i"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "98fc9696",
"metadata": {},
"outputs": [],
"source": [
"import keras\n",
"import keras_utils"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d4fa4a38",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x_train : (8982,)\n",
"x_test : (2246,)\n"
]
}
],
"source": [
"print(\"x_train :\", x_train.shape)\n",
"print(\"x_test :\", x_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "91b495e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of y_train is : (8982,)\n",
"shape of y_test is : (2246,)\n"
]
}
],
"source": [
"print(\"shape of y_train is :\", y_train.shape)\n",
"print(\"shape of y_test is :\", y_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bdeeda6b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n",
"3\n"
]
}
],
"source": [
"print(x_train[0])\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1c0edb8f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"62"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_index = reuters.get_word_index()\n",
"word_index['oil']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4e646da8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'oil'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Loop um Wort der Frequenz zu finden\n",
"index_to_word = {} \n",
"for key, value in word_index.items():\n",
" index_to_word[value] = key\n",
" \n",
"index_to_word[62] # \"the\" ist das häufigste Wort in den 11'228 Newsartikel"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c68f266b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs\n",
"3\n"
]
}
],
"source": [
"# um zu prüfen, welche Wörter in x_train enthalten sind\n",
"print(' '.join([index_to_word[x] for x in x_train[0]]))\n",
"print(y_train[0]) # 0te Stickprobe von x_train"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6f324a7b",
"metadata": {},
"outputs": [],
"source": [
"from keras.preprocessing.text import Tokenizer\n",
"\n",
"max_words = 100\n",
"\n",
"tokenizer = Tokenizer(num_words=100)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b198365d",
"metadata": {},
"outputs": [],
"source": [
"y_train = tokenizer.sequences_to_matrix(x_train, mode='binary')\n",
"y_test = tokenizer.sequences_to_matrix(x_test, mode='binary')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "d88cd974",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of x_train is (8982,)\n",
"shape of x_test is (2246,)\n",
"data in training sample 1: [1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n"
]
}
],
"source": [
"print(\"shape of x_train is \", x_train.shape)\n",
"print(\"shape of x_test is \",x_test.shape)\n",
"print(\"data in training sample 1: \", x_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e7075e43",
"metadata": {},
"outputs": [
{
"ename": "ImportError",
"evalue": "cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mImportError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m~\\AppData\\Local\\Temp/ipykernel_14400/3408643968.py\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mutils\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mnum_classes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m46\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0my_train\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0my_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mImportError\u001b[0m: cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)"
]
}
],
"source": [
"from keras.utils import to_categorical\n",
"num_classes = 46\n",
"\n",
"y_train = to_categorical(y_train, num_classes)\n",
"y_test = to_categorical(y_test, num_classes)\n",
"\n",
"print(y_train.shape)\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7d2ee35",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Sentiment Classification
Das sieht mir nach einem exportiertem IPython Notebook aus.
Das kann so niemand lesen.
Wenn das mit dem Export nicht klappt, mach es händisch und copy paste den Code (nicht die Ausgabe) in Notepad und dann komplett hier rein.
Benutze den Vollständigen Editor und dort den Button </> und füge den Code zwischen die Tags.
Das kann so niemand lesen.
Wenn das mit dem Export nicht klappt, mach es händisch und copy paste den Code (nicht die Ausgabe) in Notepad und dann komplett hier rein.
Benutze den Vollständigen Editor und dort den Button </> und füge den Code zwischen die Tags.
Ich bin Pazifist und greife niemanden an, auch nicht mit Worten.
Für alle meine Code Beispiele gilt: "There is always a better way."
https://projecteuler.net/profile/Brotherluii.png
Für alle meine Code Beispiele gilt: "There is always a better way."
https://projecteuler.net/profile/Brotherluii.png
Hi ThomasL
Danke für deine Info. Folgend mein bisheriger Code wie gewünscht: siehe bitte auch weiter unter ein komplettes identisches Beispiel von multinomial naive bayes classifier
Folgend ein beispielhaftes vollständiges Beispiel von multinomial naive bayes classifier & 100% funktionierender Code: Ich möchte eigentlich nahezu das Selbe für den Datensatz von reuters implementieren
Danke für deine Info. Folgend mein bisheriger Code wie gewünscht: siehe bitte auch weiter unter ein komplettes identisches Beispiel von multinomial naive bayes classifier
Code: Alles auswählen
import numpy as np
import pandas as pd
from tensorflow import keras
import tensorflow as tf
import keras
import keras_utils
Code: Alles auswählen
from keras.datasets import reuters
Code: Alles auswählen
(x_train, y_train), (x_test, y_test) = reuters.get_word_index(test_split = 0.2)
Code: Alles auswählen
print("x_train :", x_train.shape)
print("x_test :", x_test.shape)
Code: Alles auswählen
print(x_train[0])
print(y_train[0])
Code: Alles auswählen
index_to_word = {}
for key, value in word_index.items():
index_to_word[value] = key
index_to_word[62]
Code: Alles auswählen
print(' '.join([index_to_word[x] for x in x_train[0]]))
print(y_train[0])
Code: Alles auswählen
tf.keras.datasets.reuters.load_data(
path="reuters.npz",
num_words=None,
skip_top=0,
maxlen=None,
test_split=0.2,
seed=113,
start_char=1
)
Code: Alles auswählen
from keras.preprocessing.text import Tokenizer
max_words = 100
tokenizer = Tokenizer(num_words=100)
Code: Alles auswählen
y_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
y_test = tokenizer.sequences_to_matrix(x_test, mode='binary')
Code: Alles auswählen
print("shape of x_train is ", x_train.shape)
print("shape of x_test is ",x_test.shape)
print("data in training sample 1: ", x_train[0])
Code: Alles auswählen
from keras.utils import to_categorical
num_classes = 46
y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)
print(y_train.shape)
print(y_train[0])
Code: Alles auswählen
# Import some relevant packages and datasets
# Import train-test-split
from sklearn.model_selection import train_test_split
# import naive bayes classifier
from sklearn.naive_bayes import MultinomialNB
#Define mutinomial naive Bayes classifier as mnb
mnb = MultinomialNB()
import pandas as pd
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
import seaborn as sns
iris = sns.load_dataset("iris")
Code: Alles auswählen
# Consider newgroups texts
from sklearn.datasets import fetch_20newsgroups
import numpy as np
Code: Alles auswählen
newsgroups_train = fetch_20newsgroups(subset='train')
# Load train subset data
Code: Alles auswählen
from pprint import pprint
pprint(list(newsgroups_train.target_names))
print("\nThere are ", len(list(newsgroups_train.target_names)), " categories.")
Code: Alles auswählen
newsgroups_train.filenames
Code: Alles auswählen
newsgroups_train.target
Code: Alles auswählen
#categories = ['talk.religion.misc', 'sci.med','sci.space', 'comp.graphics']
categories = list(newsgroups_train.target_names)
train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)
Code: Alles auswählen
print(train.data[3])
Code: Alles auswählen
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix
Code: Alles auswählen
model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # Apply two (or more generally, several)
#models in a pipeline
#In this case the data is first vectorized and then used on multinomial naive Bayes
Code: Alles auswählen
model
Code: Alles auswählen
#Fit model
model.fit(train.data, train.target)
#predicted labels
predictedlabels = model.predict(test.data)
Code: Alles auswählen
mat = confusion_matrix(test.target, predictedlabels)
fig= plt.figure(figsize=(12,12))
#plot heatmap (generalized confusion matrix, i.e. confusion matrix for multiclass classification)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
xticklabels=train.target_names, yticklabels=train.target_names)
plt.xlabel('true label')
plt.ylabel('predicted label');