Project #1

Most data mining projects spend a large amount of time on data preparation and cleanup. This project requires you to write a generic data cleaner-upper for use with data mining projects.

Input:

In this project, you are given two files. The first file is the data input file with space delimited columns. Character strings are delimited by double quotes. A single data record will be stored on a single line of the file. The first record will contain column headers for the columns in the file. The column headers will be space delimited. An example file might be:

sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings
M 455 365 95 51.4 224.5 101 150 15
F 53 420 135 37.7 256.5 141.5 210 9

The second file has the data preparation rules for the first file. The rules will refer to the column names in the header of the data file. Assume only one rule would be put on any given column. The following rule types may exist in the file...

Rule Type 1: Generalize

Syntax: Generalize <column_name>: n1 = e1; ni = si, ei; nk =
    where "k" is number of intervals.

where <column name> is a valid column name, si and e1 are start and end values of a range.

        Example: Manual Generalize rings: Small = 10; Medium = 10,20; Large = 20.

        This indicates that all values for "rings" will be transformed into a new character string as Small, Medium or Large.

An arbitrary number of values may be assigned to a range but they will fit on a single record of the rules file. You may assume ranges will be defined in ascending order.

Rule Type 2: Normalize (Generally applicable to numeric data only)

Syntax: Normalize <column name>: [minmax, zscore]

        Example: Normalize length    : minmax

                        Normalize diameter: zscore

This will use normalizations as given in equations 3.2 and 3.3 of the book.

Rule Type 3:

Syntax: Smooth <column name>: [bins = xx, < mean | minmax >]

            Example: Smooth rings: bins = 3, minmax

This will smooth a numeric column into bins of size xx using either average smoothing or minmax smoothing as given in 3.2.2 (in the first paragraph)

Read the rule file first, then apply the rules to and input file and output a file that applied the rule transformations.

Deliverables:

With every assignment in this course, a design document, output, source code listing, and detailed summary of your work on the summary are required. It is important that these are labeled clearly. If you are taking the class on campus you are required to submit these assignment in a clearly labeled envelope with each section carefully delimited.

I. Summary (10%)

First, start with your project status. State clearly what is working and what is not working. Next, write a careful summary of problems you have encountered, things you wish you would have been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Do not simply list the problems, talk about the largest problems encountered and what you did to fix them. Make sure this summary is extremely well written and is very easy to read. A typical summary will be one to three pages in length.

II. Design Document (10%)

The design document should be written prior to the coding. Clearly describe all the objects you will create and all the methods. Clearly describe how they will interact with each other. Do not simply list objects and messages. The idea is that your design document should be so detailed that a programmer can take it and develop the code solely from the design document.

III. Source Code (5%)

Give a listing of your source code. It is strongly recommend that you use some form of source code utility that will print your listing in a very clear fashion. Make sure your code is VERY well documented. In general, indent properly, use good variable names, use nouns for object names and verbs for methods. An object called "do_sorting" is really not a meaningful object. Your methods should be short -- usually if a method has more than 15-20 lines of code it is trying to do too many things. Document each
object and method thoroughly. Use the same names for objects and methods as given in your design document.

IV. Output (75%)

Give a listing of all output generated when you run your code on the specified input file.

Some additional values to output:

Preprocess data from an input file for us

Here are the Input files Data Input File, the Rules File. and a Header file saying what the data is all about