|
As with
every assignment in this course, a design document, output,
source code listing, and detailed summary of your work are
required. It is important that these be labeled clearly. If
you are taking the class on campus you are required to submit
these assignment in clearly labeled envelopes.
I.
Design Document (10%)
The design document should be written prior to coding. Clearly
describe all classes you will create. This requires a listing
of all attributes, methods, and algorithms that will be used.
Furthermore, you should clearly relate how your objects will
interact with each other. There should not be any code
in your design document. The goal of your design document
should be for any programmer to be able to implement the project
in a language of his choosing using only your design document
as a reference.
II.
Source Code (10%)
Give a listing of your source code. Make use of some form
of source code utility that will print your listing in a very
clear fashion. We have been lax in taking points off for badly
formatted source code thus far, but will start to do so. Line
numbers by themselves do not make for legible source
printouts --- make sure the utility you use highlights reserved
words, comments, etc. Additionally, documentation is very
important.
III.
Output (70%)
Using the vector space model cosine measure without relevance
feedback, run your indexer on the same document collection
used in assignment #2 (all the documents in the files available
here). Run your
indexer five separate times, once for each of the following
values of M:
1MB,
2MB, 4MB, 8MB, nMB (where n = the largest amount
of memory available on your machine)
After
each indexing phase, you should also run all the queries against
the collection to obtain average precision using treceval.
For each
run, output:
1. The value of M specified
2. The maximum number of bytes of memory used during indexing
(should not be bigger than M)
3. The size of your inverted index on disk (with and without
compression)
4. Time taken to build the inverted index
(with and without compression)
5. Time taken to run all queries (with and without compression)
6. Average precision as determined by treceval
It may
be necessary (depending on your implementation) to index twice
per value of M to obtain the differences in the values
required for items 3-5 above. You should not submit separate
reports for these runs --- you need only submit separate reports
for discrete values of M.
Sample
Report: You may view a sample report for a run using
M = 1MB here
(your values will be different). Remember, you will have to
generate five such reports, for differing values of M.
IV.
Summary (10%)
Write a careful summary of problems you encountered. Proofread
and spellcheck this summary! The first item in the
summary must include the status of your project, i.e., a listing
of what works and what doesn't work. Failing to mention a
non-working part of your assignment is grounds for massive
point deductions.
Also
include in your summary things you wish you had been told
prior to being given the assignment. It is suggested that
you keep a log of events that occurred while working on the
assignment so that you will have useful information for your
summary. Do not simply list problems, rather, talk about the
main problems encountered and what you did to resolve them.
Make sure this summary is extremely well written and very
easy to read. A typical summary will range from one to three
pages in length.
|