Most
data mining projects spend a large amount of time on data preparation and
cleanup. This project requires you to write a generic data cleaner-upper for
use with data mining projects.
Input:
In this project, you are given two files. The first file is the data input
file with space delimited columns. Character strings are delimited by
double quotes. A single data record will be stored on a single line of the
file. The first record will contain column headers for the columns in the
file. The column headers will be space delimited. An example file might be:
sex length diameter height whole_weight shucked_weight viscera_weight shell_weight rings M 455 365 95 51.4 224.5 101 150 15 F 53 420 135 37.7 256.5 141.5 210 9
The second
file has the data preparation rules for the first file. The rules
will refer to the column names in the header of the data file. Assume only one
rule would be put on any given column. The following rule types may exist in
the file...
Rule
Type 1:
Generalize
Syntax:
Generalize <column_name>: n1 = e1; ni = si,
ei; nk = s
where "k" is number of intervals.
where <column name> is a valid column name, si and e1 are start and end values of a range.
Example: Manual Generalize rings: Small = 10; Medium = 10,20; Large = 20.
This indicates that all values for "rings" will be transformed into a new character string as Small, Medium or Large.
An
arbitrary number of values may be assigned to a range but they will fit on a
single record of the rules file. You may assume ranges will be defined in
ascending order.
Rule
Type 2:
Normalize (Generally applicable to numeric data only)
Syntax:
Normalize <column name>: [minmax, zscore]
Example: Normalize length :
minmax
Normalize diameter: zscore
This
will use normalizations as given in equations 3.2 and 3.3 of the book.
Rule
Type 3:
Syntax: Smooth <column name>: [bins = xx, < mean | minmax >]
Example: Smooth rings: bins = 3, minmax
This
will smooth a numeric column into bins of size xx using either average
smoothing or minmax smoothing as given in 3.2.2 (in the first paragraph)
Read the rule file first, then apply the rules to and input file and output a
file that applied the rule transformations.
Deliverables:
With
every assignment in this course, a design document, output, source code
listing, and detailed summary of your work on the summary are required. It is
important that these are labeled clearly. If you are taking the class on
campus you are required to submit these assignment in a clearly labeled
envelope with each section carefully delimited.
I.
Summary (10%)
First,
start with your project status. State clearly what is working and what is not
working. Next, write a careful summary of problems you have encountered,
things you wish you would have been told prior to being given the assignment.
It is suggested that you keep a log of events that occurred while working on
the assignment so that you will have useful information for your summary. Do
not simply list the problems, talk about the largest problems encountered and
what you did to fix them. Make sure this summary is extremely well written and
is very easy to read. A typical summary will be one to three pages in length.
II.
Design Document (10%)
The
design document should be written prior to the coding. Clearly describe all
the objects you will create and all the methods. Clearly describe how they
will interact with each other. Do not simply list objects and messages. The
idea is that your design document should be so detailed that a programmer can
take it and develop the code solely from the design document.
III. Source Code (5%)
Give
a listing of your source code. It is strongly recommend that you use some form
of source code utility that will print your listing in a very clear fashion.
Make sure your code is VERY well documented. In general, indent
properly, use good variable names, use nouns for object names and verbs for
methods. An object called "do_sorting" is really not a meaningful
object. Your methods should be short -- usually if a method has more than
15-20 lines of code it is trying to do too many things. Document each
object and method thoroughly. Use the same names for objects and methods as
given in your design document.
IV.
Output (75%)
Give
a listing of all output generated when you
run your code on the specified input file.
Some additional values to output:
Number of records processed
For any Generalized rule, output the column name and the number of records in each group.
for any normalized rule or smoothing rule, give the mean and standard deviation for the entire attribute.
Preprocess
data from an input file for us
Here
are the Input files Data Input File, the Rules
File. and a Header file saying what the data is
all about