Project # 2
Due Date: 26th December (noon)

You are required to analyze a real world data mining problem using techniques studied in this course. The problem can be related to banks, hospitals, cinemas, restaurants, super-stores, academic institutions, etc. In case you can't find data on your own then you can pick any competition of interest from the following website

http://www.kaggle.com/

The website lists several projects posted by real companies who also pay a cash reward to the best solution of their problem. The projects' deadline (posted on Kaggle) may be different but you need to make sure that you submit your Project # 2 to me by 26th of December.

In addition to submitting a comprehensive report describing your analysis, you would be required a detailed presentation (15-20 mins.) on the problem you chose, the nature of data and how you cleaned/prepared it, and your findings. The presentations will be held on 26th and 29th of December but you would be required to submit your report on 26th of December by noon time.

As is the case with Project 1, it is a group-based project (max 3 person) but if you want to do it alone then it's fine as well. The report will be submitted via turnitin and the zero-tolerance policy of IBA towards plagiarism is applicable in this case. Any two reports found similar would result in a straight F for both groups and further action would be decided by the Examination department.



Project # 1
Due Date: 13th November

You are required to analyze PAKDD 2009 competition data. It belongs to credit risk domain and your task is to identify a classification technique (along with a suitable set of features/attributes) that provides the best solution to the problem of identifying bad customers. You can read more about the project on its website.
http://sede.neurotech.com.br:443/PAKDD2009/

The data set (provided below) is a slight modification of the original data set. The training and testing data sets have 40,000 and 10,000 records respectively.

Training Data:

Testing Data:

Variable List:

You need to explain each and every step you took to clean, normalize, discretize the data. How did you perform feature selection and why you decided to retrain/remove certain features. Why you stick to a particular discretization method in case you descritize the data or why you didn't feel the need to discretize it. How missing values were handled, and so on and so forth.

The evaluation of different models (with different combination of features) would be based on Area Under the Curve in ROC Curve option available in KNIME. But feel free to make use of any available tools be it MS Excel, Weka, SQL Server, R, etc. Just make sure that they are mention in your explanation.

In addition to ROC Curve value, also make use of F-measure for both bad and good customers in deciding the best model.

You will be a submitting a report describing your analysis. Few sample research papers posted in the reading list section would make it clear how to present your findings. Also prepare a short-presentation (max 5 mins.) that you will present on 14th November. The deadline for report, however, is 13th November.

It is a group-based project (max 3 person) but if you want to do it alone then it's fine as well. The report will be submitted via turnitin and the zero-tolerance policy of IBA towards plagiarism is applicabile in this case. Any two reports found similar would result in a straight F for both groups and further action would be decided by the Examination department.