Assignment #2
Association Rules
and Decision Trees
Requirements:
Two algorithms are to be implemented: Association Rules and Decision Trees. For this assignment, it is acceptable to assume you have enough memory. For assignment #3 we will remove that restriction, but for now it is OK.
Part I. Association Rules
Implement a single dimensional Boolean association rule algorithm as described in Chapter 6.
There are a total of 100 items, which are listed in “Products.txt” file. The input file, “InputData.txt” will contain records. Each record corresponds to a “market basket” A transaction may have one or more items involved that are tab delimited. Each transaction has fewer than of 10 items in it. A typical dataset would be of the form:
File:
5 6 9 13
6 11 14
This means that the first basket has four items (5, 6, 9, and 13). If you look in the products file you will see these correspond to “Kraft Macaroni & Cheese”, “Mazola Corn Oil”, “Pillsbury Biscuits”, and “Yo-Ho Potato Chips” and the second basket has three items in it (6, 11 and 14).
First sort item sets by support and take only the top 10 items. As you build your frequent itemsets eliminate all itemsets, which do not have an itemcount of at least 5. Once you have your top ten itemsets, generate only the top ten rules with the highest confidence.
Part II. Decision Trees
Implement a decision tree algorithm as described in Section 7.3. Output the decision tree in a readable fashion that makes it clear what is in each node of the decision tree.
Also, show how accurate your decision tree is. Do this by testing your decision tree. Build your decision tree on the first 50% of the data records and test your decision tree on the second half. For this, your dataset will be the same dataset as used in Assignment #1.
Use any variables you would like and use any smoothing or generalization in order to improve your results. Your goal is classify an abalone as “Male” or “Female”.
Deliverables:
I. Summary (10%)
First,
start with your project status. State clearly what is working and what is not
working. Make sure you include the
status of association rules and the status of decision trees. Next, write a careful summary of problems
you have encountered, things you wish you would have been told prior to being
given the assignment. It is suggested that you keep a log of events that
occurred while working on the assignment so that you will have useful information
for your summary. Do not simply list the problems, talk about the largest
problems encountered and what you did to fix them. Make sure this summary is
extremely well written and is very easy to read. A typical summary will be one
to three pages in length.
II. Design Document (15%)
The design document should be written prior to
the coding. Clearly describe all the objects you will create and all the
methods. Clearly describe how they will interact with each other. Do not simply
list objects and messages. The idea is that your design document should be so
detailed that a programmer can take it and develop the code solely from the
design document.
III. Source Code (5%)
Give a listing of your source code. It is strongly recommend that you use some form of source code utility that will print your listing in a very clear fashion. Make sure your code is VERY well documented. In general, indent properly, use good variable names, use nouns for object names and verbs for methods. An object called "do_sorting" is really not a meaningful object. Your methods should be short -- usually if a method has more than 15-20 lines of code it is trying to do too many things. Document each object and method thoroughly. Use the same names for objects and methods as given in your design document.
IV. Output (70%)
For the association
rules indicate all of your top ten itemsets with their associated
itemcount. Then indicate the top ten
rules with the highest support.
For decision trees, output
the decision tree in an easy-to-read and understandable fashion. Also output the accuracy of the decision
tree with the following statistics:
Your output must
include the following:
·
Total Classified
Correct:
·
% Classified
Correct:
·
Total Classified
Incorrect:
·
% Classified
Incorrect:
·
Total
Males:
·
Total Females:
·
Total Males
Correct:
·
% Males Correct:
·
Total Males
Incorrect:
·
% Males
Incorrect:
·
Total Females
Correct:
·
% Females
Correct:
·
Total Females
Incorrect:
·
% Females
Incorrect
Submission Details:
Please
submit your assignment in an envelope with each section clearly labeled.
Also, e-mail a zip file to the TA that includes all your source code with a
README file that gives clear directions on how to run your source code. The file
name should be A2_nnnn.zip (where nnnn is your name).