Hi,
I'm having a strange issue with the StringToWordVector preprocessing procedure on datasets (weka.filters.unsupervised.attribute)
No matter what dataset I use: when I complete the preprocessing and save the dataset, the first label of the class attribute is removed from the istances.
For example, using the SMS Spam Collection:
With this header:
Taking two istances, before preprocessing:
After preprocessing:
We expect that both istances have the respective label in the last position ("ham" and "spam") but the "ham" label is removed in the first istance; this is the same for all istances of the "ham" label.
No problem with all istances for the "spam" label.
Is this normal?
I'm having a strange issue with the StringToWordVector preprocessing procedure on datasets (weka.filters.unsupervised.attribute)
No matter what dataset I use: when I complete the preprocessing and save the dataset, the first label of the class attribute is removed from the istances.
For example, using the SMS Spam Collection:
Code:
http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/
With this header:
Code:
@attribute Text string
@attribute class-att {ham,spam}
Code:
'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',ham
'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv',spam
Code:
{82 1,380 1,423 1,501 1,504 1,557 1,668 1,703 1,736 1,873 1,919 1,945 1,987 1}
{12 1,137 1,253 1,557 1,748 1,769 1,894 1,974 1,1092 1,1160 1,1235 1,1259 1,1271 1,1435 1,1453 1,1522 1,1559 1,1595 1,1602 1,1716 1,1720 1,1756 1,1765 1,1772 1,1803 1,1832 spam}
We expect that both istances have the respective label in the last position ("ham" and "spam") but the "ham" label is removed in the first istance; this is the same for all istances of the "ham" label.
No problem with all istances for the "spam" label.
Is this normal?