Assignment #2

Association Rules and Decision Trees

 

 

Requirements:

Two algorithms are to be implemented:  Association Rules and Decision Trees.  For this assignment, it is acceptable to assume you have enough memory.  For assignment #3 we will remove that restriction, but for now it is OK. 

Part I.   Association Rules

Implement a single dimensional Boolean association rule algorithm as described in Chapter 6.  

There are a total of 100 items, which are listed in “Products.txt” file. The input file, “InputData.txt” will contain records.  Each record corresponds to a “market basket” A transaction may have one or more items involved that are tab delimited. Each transaction has fewer than of 10 items in it. A typical dataset would be of the form:

File:

5          6          9          13       

6          11        14       

This means that the first basket has four items (5, 6, 9, and 13).  If you look in the products file you will see these correspond to “Kraft Macaroni & Cheese”, “Mazola Corn Oil”, “Pillsbury Biscuits”, and “Yo-Ho Potato Chips” and the second basket has three items in it (6, 11 and 14).

First sort item sets by support and take only the top 10 items. As you build your frequent itemsets eliminate all itemsets, which do not have an itemcount of at least 5.  Once you have your top ten itemsets, generate only the top ten rules with the highest confidence.

Part II.  Decision Trees

Implement a decision tree algorithm as described in Section 7.3.  Output the decision tree in a readable fashion that makes it clear what is in each node of the decision tree. 

Also, show how accurate your decision tree is.  Do this by testing your decision tree.  Build your decision tree on the first 50% of the data records and test your decision tree on the second half.  For this, your dataset will be the same dataset as used in Assignment #1. 

Use any variables you would like and use any smoothing or generalization in order to improve your results.  Your goal is classify an abalone as “Male” or “Female”.  

 

Deliverables:

I. Summary (10%)

First, start with your project status. State clearly what is working and what is not working.  Make sure you include the status of association rules and the status of decision trees.  Next, write a careful summary of problems you have encountered, things you wish you would have been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Do not simply list the problems, talk about the largest problems encountered and what you did to fix them. Make sure this summary is extremely well written and is very easy to read. A typical summary will be one to three pages in length.

II. Design Document (15%)

The design document should be written prior to the coding. Clearly describe all the objects you will create and all the methods. Clearly describe how they will interact with each other. Do not simply list objects and messages. The idea is that your design document should be so detailed that a programmer can take it and develop the code solely from the design document.

III. Source Code (5%)

Give a listing of your source code. It is strongly recommend that you use some form of source code utility that will print your listing in a very clear fashion. Make sure your code is VERY well documented. In general, indent properly, use good variable names, use nouns for object names and verbs for methods. An object called "do_sorting" is really not a meaningful object. Your methods should be short -- usually if a method has more than 15-20 lines of code it is trying to do too many things. Document each object and method thoroughly. Use the same names for objects and methods as given in your design document.

IV. Output (70%)

For the association rules indicate all of your top ten itemsets with their associated itemcount.  Then indicate the top ten rules with the highest support.  

For decision trees, output the decision tree in an easy-to-read and understandable fashion.  Also output the accuracy of the decision tree with the following statistics:

Your output must include the following:

·        Total Classified Correct:

·        % Classified Correct:

·        Total Classified Incorrect:

·        % Classified Incorrect:

·        Total Males: 

·        Total Females:

·        Total Males Correct:

·        % Males Correct:

·        Total Males Incorrect:

·        % Males Incorrect:

·        Total Females Correct:

·        % Females Correct:

·        Total Females Incorrect:

·        % Females Incorrect

Submission Details:

 

Please submit your assignment in an envelope with each section clearly labeled.  Also, e-mail a zip file to the TA that includes all your source code with a README file that gives clear directions on how to run your source code. The file name should be A2_nnnn.zip (where nnnn is your name).