Assignment #3

Scaleable Association Rules

 

Requirements:

For this assignment, you will implement a scaleable association rule algorithm.  The algorithm used in assignment 2 requires repeated scans of the input data file.  For large datasets this is not feasible.  Hence, the algorithm referred to as “partitioning” may be used (p. 237 with a flow diagram found in figure 6.7).  In this algorithm, some data is initially read into memory.  Subsequently, local itemsets for this block of data are computed.  Ultimately, the local itemsets are used as candidates and final global itemsets are obtained.   To test your program, you will define a user input, M, that will set a threshold of how many records will fit into main memory.  For instance if this is set to 1000, you will read one thousand records and then compute the local itemsets, then the next 1000, and so on.  

 

Click here for the input data set.

 

 

Deliverables

I. Summary (10%)

First, start with your project status. State clearly what is working and what is not working.   Write a careful summary of problems you have encountered, things you wish you would have been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Do not simply list the problems, talk about the largest problems encountered and what you did to fix them. Make sure this summary is extremely well written and is very easy to read.   A typical summary will be one to three pages in length.

II. Design Document (15%)

The design document should be written prior to the coding. Clearly describe all the objects you will create and all the methods. Clearly describe how they will interact with each other. Do not simply list objects and messages. The idea is that your design document should be so detailed that a programmer can take it and develop the code solely from the design document.

III. Source Code (5%)

Give a listing of your source code. Include a floppy disk with your source code and a README file on how to compile and execute your source code.  It is strongly recommend that you use some form of source code utility that will print your listing in a very clear fashion. Make sure your code is VERY well documented. In general, indent properly, use good variable names, use nouns for object names and verbs for methods.


IV. Output (70%)

For the association rules indicate the itemsets with a support higher than S.  Test values for support will be provided on the web site.  Your output should include only the global frequent itemsets found (no need to generate the actual rules this time).  You should also output the time of execution as well.   Also, include a table that shows execution time for M=1000, 5000, 10000.