| Requirements |
 |
|
This
is the first stage of our IR system. An input set of documents
will be provided. Run the Porter algorithm (posted on the
web site in Java) on the documents. Modify the porter algorithm
to do the following:
 |
For
numeric or monetary terms, keep the entire term (e.g.,
35, 35.66, $45.29) |
 |
For hyphenated terms, keep terms that have a single letter
(a-z) such as "F-16" as one term but separate
all other cases into two terms. |
 |
Keep
dates with
a. MM/DD/YYYY
b. MM-DD-YYYY
c. MM/DD/YY
d. MM-DD-YY
e. Month Name DD, YYYY (e.g., January 11, 1995) |
 |
Avoid
parsing executables found in text - like .tar files or
other binaries, or images |
 |
Only
parse text between <TEXT> and </TEXT> |
 |
Do
not include stop words (use the stop word list given in
stopword.txt)
|
Output
the list of the top 50 stems found with the number of occurrences
of each stem. Output the list sorted by stem and also output
it sorted by document frequency.
Also
output the time taken to run the stemmer and the number of
times a given rule in the stemmer has fired (for example if
a rule exists to remove "ing" and the stemmer encounters
"jumping" and it is stemmed to "jump",
increment the counter for the "ing" rule)
|
 |
| Deliverables |
 |
|
As with
every assignment in this course, a design document, output,
source code listing, and detailed summary of your work on
the summary are required. It is important that these are labeled
clearly. If you are taking the class on campus you are required
to submit these assignment in a clearly labeled envelope with
each section carefully delimited.
If you
are taking the class in a remote site simply send e-mail to
the TA with four attachments. Each attachment should be given
a name that indicates exactly what it is. Make sure you get
an e-mail from the TA that indicates your assignment has been
successfully received.
| I.
Design Document (10%) |
| The
design document should be written prior to the coding.
Clearly describe all the objects you will create and all
the methods. Describe each parameter to each method and
what it will be used for. Clearly describe all variables
associated with each object. Do not simply list objects
and messages. The idea is that your design document should
be so detailed that a programmer can take it and develop
the code solely from the design document. |
 |
| II.
Source Code (5%) |
| Give
a listing of your source code. Use a source code listing
utility print your listing neatly with line numbers and
emphasized keywords. If you are e-mailing the file use
the source code utility to generate a PDF file and e-mail
the PDF file. Make sure your code is VERY well documented.
Our grading standard will be "could a programmer
who has been hired to fix your bugs figure out how to
do this without having to look at the code)" |
 |
| III.
Output (75%) |
Give
a listing of all output. Submit the following listings:
| a. |
All
stems identified for a small one document test file |
| b. |
All
stems identified for the larger sample file. The
top-50 stems listed in order of the stem name and
then the list ordered by term frequency. |
| c. |
A
list of the top 5 rules ordered by stem occurrence. |
| d. |
Time
taken to run the stemmer. |
| e. |
Now,
modify the stemmer so that it only has the top 5
rules. Now, reprint the list of top 50 stems (in
alphabetic and term frequency order). |
| f. |
Time
taken to run the stemmer with only the top 5 rules.
|
| g. |
Now
modify the stemmer so it has no special modifications
for numbers, times, or dates. |
| h. |
Output
the results of the stemmer |
|
 |
| IV.
Summary (10%) |
| Write
a careful summary of problems you have encountered, things
you wish you had been told prior to being given the assignment.
It is suggested that you keep a log of events that occurred
while working on the assignment so that you will have
useful information for your summary. Feel free to indicate
how you approached the assignment, as well as time spent
on the assignment. Do not simply list the problems, talk
about the largest problems encountered and what you did
to fix them. Also, identify problems you encountered that
you wish you had been told about in class. Make sure this
summary is extremely well written and is very easy to
read. A typical summary will be one to three pages in
length. |
|
 |
| Assignment
Submission |
 |
1. Design
Document.
2. Complete source listing of your code.
3. Required output
4. Summary |
 |
| Coding
Conventions |
 |
| An object
oriented approach is required. Objects should do as little as
possible so that these objects may be reused by other parts
of your projects. Methods should be clearly labeled as well.
Methods should, in general be less than one page of code. If
the method is that large it should probably be partitioned or
call other functions while to simplify its tasks. Object names
should make sense and be designed with reuse in mind. An object
called "DoStemmerForAssignment1" is really not so
great. An object hierarchy with the Stemmer at the root and
PorterStemmer as a subclass, makes more sense. |
|
|