hello all,
i start a data mining research with weka( my first research) and struggle with some problem.
at first, my data set is an imbalanced dataset(class "yes" with about 96% of instances and class "no" with 4%). therefore, a lot of algorithm are unuseful. naive bayes gives an excellent result,or this is at least what i was thaught. i divided the data set to 2 parts:
the first one includes data from years 2008 to 2012, and the other one includes data from 2013.
i applied the naive bayes algorithem in the first data set(years 2008-2012) using cross-validation and get good results.
i also applied the naive bayes on the second data set using naive bayes cross-validation and get good results.
the problem start when i tried to use the model learned from the first data set on the second data set, thani get terrible results.
at first i thaught the problem is because the attributes distribution different between the data sets, but the mean and stdv is differ only in some tenth precentge. i think that the reason to the poor result is because that the probability of the minority value of my class decrease between the years(from 6% to 4% during all the years). what approach i need to use to be able forecasting the data set of year 2013 using the data set from 2008-2012?
thank you all,and sorry about my poor english/
i start a data mining research with weka( my first research) and struggle with some problem.
at first, my data set is an imbalanced dataset(class "yes" with about 96% of instances and class "no" with 4%). therefore, a lot of algorithm are unuseful. naive bayes gives an excellent result,or this is at least what i was thaught. i divided the data set to 2 parts:
the first one includes data from years 2008 to 2012, and the other one includes data from 2013.
i applied the naive bayes algorithem in the first data set(years 2008-2012) using cross-validation and get good results.
i also applied the naive bayes on the second data set using naive bayes cross-validation and get good results.
the problem start when i tried to use the model learned from the first data set on the second data set, thani get terrible results.
at first i thaught the problem is because the attributes distribution different between the data sets, but the mean and stdv is differ only in some tenth precentge. i think that the reason to the poor result is because that the probability of the minority value of my class decrease between the years(from 6% to 4% during all the years). what approach i need to use to be able forecasting the data set of year 2013 using the data set from 2008-2012?
thank you all,and sorry about my poor english/