Quantcast
Channel: Pentaho Community Forums
Viewing all articles
Browse latest Browse all 16689

Preprocessing: first label removed

$
0
0
Hi,
I'm having a strange issue with the StringToWordVector preprocessing procedure on datasets (weka.filters.unsupervised.attribute)


No matter what dataset I use: when I complete the preprocessing and save the dataset, the first label of the class attribute is removed from the istances.


For example, using the SMS Spam Collection:
Code:

http://www.dt.fee.unicamp.br/~tiago/smsspamcollection/

With this header:


Code:

@attribute Text string
@attribute class-att {ham,spam}

Taking two istances, before preprocessing:


Code:

'Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',ham

'FreeMsg Hey there darling it\'s been 3 week\'s now and no word back! I\'d like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv',spam

After preprocessing:


Code:

{82 1,380 1,423 1,501 1,504 1,557 1,668 1,703 1,736 1,873 1,919 1,945 1,987 1}


{12 1,137 1,253 1,557 1,748 1,769 1,894 1,974 1,1092 1,1160 1,1235 1,1259 1,1271 1,1435 1,1453 1,1522 1,1559 1,1595 1,1602 1,1716 1,1720 1,1756 1,1765 1,1772 1,1803 1,1832 spam}


We expect that both istances have the respective label in the last position ("ham" and "spam") but the "ham" label is removed in the first istance; this is the same for all istances of the "ham" label.
No problem with all istances for the "spam" label.


Is this normal?

Viewing all articles
Browse latest Browse all 16689

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>