CS 529 / Project / Part 1
Requirements

This is the first stage of our IR system. An input set of documents will be provided. Run the Porter algorithm (posted on the web site in Java) on the documents. Modify the porter algorithm to do the following:

For numeric or monetary terms, keep the entire term (e.g., 35, 35.66, $45.29)
For hyphenated terms, keep terms that have a single letter (a-z) such as "F-16" as one term but separate all other cases into two terms.
Keep dates with
a. MM/DD/YYYY
b. MM-DD-YYYY
c. MM/DD/YY
d. MM-DD-YY
e. Month Name DD, YYYY (e.g., January 11, 1995)
Avoid parsing executables found in text - like .tar files or other binaries, or images
Only parse text between <TEXT> and </TEXT>
Do not include stop words (use the stop word list given in stopword.txt)

Output the list of the top 50 stems found with the number of occurrences of each stem. Output the list sorted by stem and also output it sorted by document frequency.

Also output the time taken to run the stemmer and the number of times a given rule in the stemmer has fired (for example if a rule exists to remove "ing" and the stemmer encounters "jumping" and it is stemmed to "jump", increment the counter for the "ing" rule)

Deliverables

As with every assignment in this course, a design document, output, source code listing, and detailed summary of your work on the summary are required. It is important that these are labeled clearly. If you are taking the class on campus you are required to submit these assignment in a clearly labeled envelope with each section carefully delimited.

If you are taking the class in a remote site simply send e-mail to the TA with four attachments. Each attachment should be given a name that indicates exactly what it is. Make sure you get an e-mail from the TA that indicates your assignment has been successfully received.

I. Design Document (10%)
The design document should be written prior to the coding. Clearly describe all the objects you will create and all the methods. Describe each parameter to each method and what it will be used for. Clearly describe all variables associated with each object. Do not simply list objects and messages. The idea is that your design document should be so detailed that a programmer can take it and develop the code solely from the design document.
II. Source Code (5%)
Give a listing of your source code. Use a source code listing utility print your listing neatly with line numbers and emphasized keywords. If you are e-mailing the file use the source code utility to generate a PDF file and e-mail the PDF file. Make sure your code is VERY well documented. Our grading standard will be "could a programmer who has been hired to fix your bugs figure out how to do this without having to look at the code)"
III. Output (75%)
Give a listing of all output. Submit the following listings:
a. All stems identified for a small one document test file
b. All stems identified for the larger sample file. The top-50 stems listed in order of the stem name and then the list ordered by term frequency.
c. A list of the top 5 rules ordered by stem occurrence.
d. Time taken to run the stemmer.
e. Now, modify the stemmer so that it only has the top 5 rules. Now, reprint the list of top 50 stems (in alphabetic and term frequency order).
f. Time taken to run the stemmer with only the top 5 rules.
g. Now modify the stemmer so it has no special modifications for numbers, times, or dates.
h. Output the results of the stemmer
IV. Summary (10%)
Write a careful summary of problems you have encountered, things you wish you had been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Feel free to indicate how you approached the assignment, as well as time spent on the assignment. Do not simply list the problems, talk about the largest problems encountered and what you did to fix them. Also, identify problems you encountered that you wish you had been told about in class. Make sure this summary is extremely well written and is very easy to read. A typical summary will be one to three pages in length.
Assignment Submission
1. Design Document.
2. Complete source listing of your code.
3. Required output
4. Summary
Coding Conventions
An object oriented approach is required. Objects should do as little as possible so that these objects may be reused by other parts of your projects. Methods should be clearly labeled as well. Methods should, in general be less than one page of code. If the method is that large it should probably be partitioned or call other functions while to simplify its tasks. Object names should make sense and be designed with reuse in mind. An object called "DoStemmerForAssignment1" is really not so great. An object hierarchy with the Stemmer at the root and PorterStemmer as a subclass, makes more sense.
contact lee@iit.edu for any questions, comments, suggestions