Ich bin mit folgender Herausforderung konfrontiert


Kennt sich jemand damit aus und möchte mir mit dem Code (siehe unten) weiterhelfen?
You can implement your algorithm using any library.
Pre-processing
- You will create feature vector using histogram (word count) technique
Method
- Multinomial naive Bayes classifier
Problem Statement
In this project your task is to design a naive Bayes classifier for sentiment classification. You will use
the Reuters newswire dataset provided by Keras. In the dataset there are 46 class labels. You should
construct your feature vectors by taking the histogram (word count) of the text. When you call the
dataset from the Keras library, you have several options to pass to the constructor.
Options that you need to play with are:
num_words: integer or None. Words are ranked by how often they occur (in the training set) and
only the num_words most frequent words are kept. Any less frequent words will appear as oov_char
value in the sequence data. If None, all words are kept. Defaults to None, so all words are kept.
skip_top: skip the top N most frequently occurring words (which may not be informative).
These words will appear as oov_char value in the dataset. Defaults to 0, so no words are skipped.
Remember that num_words will affect the number of bins in your dataset. You may set skip_top to
any small integer since the most frequently used words are redundant e,g, "the", 'a', 'an', 'it'.
Bisheriger Code:
{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"id": "85b3eaef",
"metadata": {},
"outputs": [],
"source": [
"# i"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "98fc9696",
"metadata": {},
"outputs": [],
"source": [
"import keras\n",
"import keras_utils"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "d4fa4a38",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"x_train : (8982,)\n",
"x_test : (2246,)\n"
]
}
],
"source": [
"print(\"x_train :\", x_train.shape)\n",
"print(\"x_test :\", x_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "91b495e9",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of y_train is : (8982,)\n",
"shape of y_test is : (2246,)\n"
]
}
],
"source": [
"print(\"shape of y_train is :\", y_train.shape)\n",
"print(\"shape of y_test is :\", y_test.shape)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "bdeeda6b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n",
"3\n"
]
}
],
"source": [
"print(x_train[0])\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 6,
"id": "1c0edb8f",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"62"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"word_index = reuters.get_word_index()\n",
"word_index['oil']"
]
},
{
"cell_type": "code",
"execution_count": 7,
"id": "4e646da8",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'oil'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Loop um Wort der Frequenz zu finden\n",
"index_to_word = {} \n",
"for key, value in word_index.items():\n",
" index_to_word[value] = key\n",
" \n",
"index_to_word[62] # \"the\" ist das häufigste Wort in den 11'228 Newsartikel"
]
},
{
"cell_type": "code",
"execution_count": 8,
"id": "c68f266b",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"the wattie nondiscriminatory mln loss for plc said at only ended said commonwealth could 1 traders now april 0 a after said from 1985 and from foreign 000 april 0 prices its account year a but in this mln home an states earlier and rise and revs vs 000 its 16 vs 000 a but 3 psbr oils several and shareholders and dividend vs 000 its all 4 vs 000 1 mln agreed largely april 0 are 2 states will billion total and against 000 pct dlrs\n",
"3\n"
]
}
],
"source": [
"# um zu prüfen, welche Wörter in x_train enthalten sind\n",
"print(' '.join([index_to_word[x] for x in x_train[0]]))\n",
"print(y_train[0]) # 0te Stickprobe von x_train"
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "6f324a7b",
"metadata": {},
"outputs": [],
"source": [
"from keras.preprocessing.text import Tokenizer\n",
"\n",
"max_words = 100\n",
"\n",
"tokenizer = Tokenizer(num_words=100)"
]
},
{
"cell_type": "code",
"execution_count": 16,
"id": "b198365d",
"metadata": {},
"outputs": [],
"source": [
"y_train = tokenizer.sequences_to_matrix(x_train, mode='binary')\n",
"y_test = tokenizer.sequences_to_matrix(x_test, mode='binary')"
]
},
{
"cell_type": "code",
"execution_count": 19,
"id": "d88cd974",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"shape of x_train is (8982,)\n",
"shape of x_test is (2246,)\n",
"data in training sample 1: [1, 27595, 28842, 8, 43, 10, 447, 5, 25, 207, 270, 5, 3095, 111, 16, 369, 186, 90, 67, 7, 89, 5, 19, 102, 6, 19, 124, 15, 90, 67, 84, 22, 482, 26, 7, 48, 4, 49, 8, 864, 39, 209, 154, 6, 151, 6, 83, 11, 15, 22, 155, 11, 15, 7, 48, 9, 4579, 1005, 504, 6, 258, 6, 272, 11, 15, 22, 134, 44, 11, 15, 16, 8, 197, 1245, 90, 67, 52, 29, 209, 30, 32, 132, 6, 109, 15, 17, 12]\n"
]
}
],
"source": [
"print(\"shape of x_train is \", x_train.shape)\n",
"print(\"shape of x_test is \",x_test.shape)\n",
"print(\"data in training sample 1: \", x_train[0])"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "e7075e43",
"metadata": {},
"outputs": [
{
"ename": "ImportError",
"evalue": "cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mImportError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m~\\AppData\\Local\\Temp/ipykernel_14400/3408643968.py\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[1;32mfrom\u001b[0m \u001b[0mkeras\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mutils\u001b[0m \u001b[1;32mimport\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mnum_classes\u001b[0m \u001b[1;33m=\u001b[0m \u001b[1;36m46\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 3\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 4\u001b[0m \u001b[0my_train\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_train\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 5\u001b[0m \u001b[0my_test\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mto_categorical\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0my_test\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnum_classes\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;31mImportError\u001b[0m: cannot import name 'to_categorical' from 'keras.utils' (C:\\Users\\xmcc\\Anaconda3\\lib\\site-packages\\keras\\utils\\__init__.py)"
]
}
],
"source": [
"from keras.utils import to_categorical\n",
"num_classes = 46\n",
"\n",
"y_train = to_categorical(y_train, num_classes)\n",
"y_test = to_categorical(y_test, num_classes)\n",
"\n",
"print(y_train.shape)\n",
"print(y_train[0])"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7d2ee35",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.7"
}
},
"nbformat": 4,
"nbformat_minor": 5
}