Sentiment Classification

mit matplotlib, NumPy, pandas, SciPy, SymPy und weiteren mathematischen Programmbibliotheken.
Antworten
lev
User
Beiträge: 2
Registriert: Mittwoch 12. Januar 2022, 17:47

Hallo zusammen

Ich bin mit folgender Herausforderung konfrontiert :o :D . Es geht um: Sentiment Classification, Datensatz: https://keras.io/api/datasets/reuters

Kennt sich jemand damit aus und möchte mir mit dem Code (siehe unten) weiterhelfen?

You can implement your algorithm using any library.

Pre-processing
- You will create feature vector using histogram (word count) technique

Method
- Multinomial naive Bayes classifier

Problem Statement
In this project your task is to design a naive Bayes classifier for sentiment classification. You will use
the Reuters newswire dataset provided by Keras. In the dataset there are 46 class labels. You should
construct your feature vectors by taking the histogram (word count) of the text. When you call the
dataset from the Keras library, you have several options to pass to the constructor.

Options that you need to play with are:

num_words: integer or None. Words are ranked by how often they occur (in the training set) and
only the num_words most frequent words are kept. Any less frequent words will appear as oov_char
value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept.

skip_top: skip the top N most frequently occurring words (which may not be informative).
These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
Remember that num_words will affect the number of bins in your dataset. You may set skip_top to
any small integer since the most frequently used words are redundant e,g, "the", 'a', 'an', 'it'.

Bisheriger Code:

{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "85b3eaef",
"metadata": {},
"outputs": [],
"source": [
"# i"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "98fc9696",
"metadata": {},
"outputs": [],
"source": [
"import keras\n",
"import keras_utils"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d4fa4a38",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x_train : (8982,)\n",
"x_test : (2246,)\n"
]
}
],
"source": [
"print(\"x_train :\", x_train.shape)\n",
"print(\"x_test :\", x_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "91b495e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of y_train is : (8982,)\n",
"shape of y_test is : (2246,)\n"
]
}
],
"source": [
"print(\"shape of y_train is :\", y_train.shape)\n",
"print(\"shape of y_test is :\", y_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bdeeda6b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n",
"3\n"
]
}
],
"source": [
"print(x_train[0])\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1c0edb8f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"62"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_index = reuters.get_word_index()\n",
"word_index['oil']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4e646da8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'oil'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Loop um Wort der Frequenz zu finden\n",
"index_to_word = {} \n",
"for key, value in word_index.items():\n",
" index_to_word[value] = key\n",
" \n",
"index_to_word[62] # \"the\" ist das häufigste Wort in den 11'228 Newsartikel"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c68f266b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs\n",
"3\n"
]
}
],
"source": [
"# um zu prüfen, welche Wörter in x_train enthalten sind\n",
"print(' '.join([index_to_word[x] for x in x_train[0]]))\n",
"print(y_train[0]) # 0te Stickprobe von x_train"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6f324a7b",
"metadata": {},
"outputs": [],
"source": [
"from keras.preprocessing.text import Tokenizer\n",
"\n",
"max_words = 100\n",
"\n",
"tokenizer = Tokenizer(num_words=100)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b198365d",
"metadata": {},
"outputs": [],
"source": [
"y_train = tokenizer.sequences_to_matrix(x_train, mode='binary')\n",
"y_test = tokenizer.sequences_to_matrix(x_test, mode='binary')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "d88cd974",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of x_train is (8982,)\n",
"shape of x_test is (2246,)\n",
"data in training sample 1: [1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n"
]
}
],
"source": [
"print(\"shape of x_train is \", x_train.shape)\n",
"print(\"shape of x_test is \",x_test.shape)\n",
"print(\"data in training sample 1: \", x_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e7075e43",
"metadata": {},
"outputs": [
{
"ename": "ImportError",
"evalue": "cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mImportError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m~\\AppData\\Local\\Temp/ipykernel_14400/3408643968.py\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mutils\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mnum_classes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m46\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0my_train\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0my_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mImportError\u001b[0m: cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)"
]
}
],
"source": [
"from keras.utils import to_categorical\n",
"num_classes = 46\n",
"\n",
"y_train = to_categorical(y_train, num_classes)\n",
"y_test = to_categorical(y_test, num_classes)\n",
"\n",
"print(y_train.shape)\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7d2ee35",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Benutzeravatar
ThomasL
User
Beiträge: 1367
Registriert: Montag 14. Mai 2018, 14:44
Wohnort: Kreis Unna NRW

Das sieht mir nach einem exportiertem IPython Notebook aus.
Das kann so niemand lesen.
Wenn das mit dem Export nicht klappt, mach es händisch und copy paste den Code (nicht die Ausgabe) in Notepad und dann komplett hier rein.
Benutze den Vollständigen Editor und dort den Button </> und füge den Code zwischen die Tags.
Ich bin Pazifist und greife niemanden an, auch nicht mit Worten.
Für alle meine Code Beispiele gilt: "There is always a better way."
https://projecteuler.net/profile/Brotherluii.png
lev
User
Beiträge: 2
Registriert: Mittwoch 12. Januar 2022, 17:47

Hi ThomasL


Danke für deine Info. Folgend mein bisheriger Code wie gewünscht:
siehe bitte auch weiter unter ein komplettes identisches Beispiel von multinomial naive bayes classifier :geek:

Code: Alles auswählen

import numpy as np
import pandas as pd 
from tensorflow import keras 
import tensorflow as tf
import keras
import keras_utils

Code: Alles auswählen

from keras.datasets import reuters 

Code: Alles auswählen

(x_train, y_train), (x_test, y_test) = reuters.get_word_index(test_split = 0.2)

Code: Alles auswählen

print("x_train :", x_train.shape)
print("x_test :", x_test.shape)

Code: Alles auswählen

print(x_train[0])
print(y_train[0])

Code: Alles auswählen

index_to_word = {} 
for key, value in word_index.items():
    index_to_word[value] = key
    
index_to_word[62]

Code: Alles auswählen

print(' '.join([index_to_word[x] for x in x_train[0]]))
print(y_train[0])

Code: Alles auswählen

tf.keras.datasets.reuters.load_data(
    path="reuters.npz",
    num_words=None,
    skip_top=0,
    maxlen=None,
    test_split=0.2,
    seed=113,
    start_char=1
)

Code: Alles auswählen

from keras.preprocessing.text import Tokenizer

max_words = 100

tokenizer = Tokenizer(num_words=100)

Code: Alles auswählen

y_train = tokenizer.sequences_to_matrix(x_train, mode='binary')
y_test = tokenizer.sequences_to_matrix(x_test, mode='binary')

Code: Alles auswählen

print("shape of x_train is ", x_train.shape)
print("shape of x_test is ",x_test.shape)
print("data in training sample 1: ", x_train[0])

Code: Alles auswählen

from keras.utils import to_categorical
num_classes = 46

y_train = to_categorical(y_train, num_classes)
y_test = to_categorical(y_test, num_classes)

print(y_train.shape)
print(y_train[0])
Folgend ein beispielhaftes vollständiges Beispiel von multinomial naive bayes classifier & 100% funktionierender Code: Ich möchte eigentlich nahezu das Selbe für den Datensatz von reuters implementieren

Code: Alles auswählen

# Import some relevant packages and datasets

# Import train-test-split
from sklearn.model_selection import train_test_split

# import naive bayes classifier

from sklearn.naive_bayes import MultinomialNB

#Define mutinomial naive Bayes classifier as mnb
mnb = MultinomialNB()

import pandas as pd
from sklearn.metrics import accuracy_score

import matplotlib.pyplot as plt 
import seaborn as sns

iris = sns.load_dataset("iris")

Code: Alles auswählen

# Consider newgroups texts
from sklearn.datasets import fetch_20newsgroups

import numpy as np

Code: Alles auswählen

newsgroups_train = fetch_20newsgroups(subset='train') 
# Load train subset data

Code: Alles auswählen

from pprint import pprint

pprint(list(newsgroups_train.target_names))
print("\nThere are ", len(list(newsgroups_train.target_names)), " categories.")

Code: Alles auswählen

newsgroups_train.filenames

Code: Alles auswählen

newsgroups_train.target

Code: Alles auswählen

#categories = ['talk.religion.misc', 'sci.med','sci.space', 'comp.graphics']
categories = list(newsgroups_train.target_names)

train = fetch_20newsgroups(subset='train', categories=categories)
test = fetch_20newsgroups(subset='test', categories=categories)

Code: Alles auswählen

print(train.data[3])

Code: Alles auswählen

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import confusion_matrix

Code: Alles auswählen

model = make_pipeline(TfidfVectorizer(), MultinomialNB()) # Apply two (or more generally, several)
#models in a pipeline
#In this case the data is first vectorized and then used on multinomial naive Bayes

Code: Alles auswählen

model

Code: Alles auswählen

#Fit model
model.fit(train.data, train.target)

#predicted labels
predictedlabels = model.predict(test.data)

Code: Alles auswählen

mat = confusion_matrix(test.target, predictedlabels)
fig= plt.figure(figsize=(12,12))
#plot heatmap (generalized confusion matrix, i.e. confusion matrix for multiclass classification)
sns.heatmap(mat.T, square=True, annot=True, fmt='d', cbar=False,
            xticklabels=train.target_names, yticklabels=train.target_names)

plt.xlabel('true label')
plt.ylabel('predicted label');
Antworten