CS 529 / Project / Part 3
Requirements

This is the third stage of our IR system. We will no longer assume we have sufficient memory for the entire inverted index to fit in. Start with your completed assignment #2 and implement one of the index-merging algorithms discussed in class. Accept a command line parameter M which indicates how many bytes of memory are available for indexing. Make sure you only use M bytes during the indexing phase.

During query processing, again take a command line parameter M and use only M bytes of memory during this phase. You may assume you have sufficient memory for the array used to hold scores for the top 100 documents.

Note that your index should be compressed using some form of compression. The choice is up to you, but your grade will depend on the compression ratio you achieve.

Finally, run treceval on the results produced by your system to compute average precision.

Deliverables

As with every assignment in this course, a design document, output, source code listing, and detailed summary of your work are required. It is important that these be labeled clearly. If you are taking the class on campus you are required to submit these assignment in clearly labeled envelopes.

I. Design Document (10%)
The design document should be written prior to coding. Clearly describe all classes you will create. This requires a listing of all attributes, methods, and algorithms that will be used. Furthermore, you should clearly relate how your objects will interact with each other. There should not be any code in your design document. The goal of your design document should be for any programmer to be able to implement the project in a language of his choosing using only your design document as a reference.

II. Source Code (10%)
Give a listing of your source code. Make use of some form of source code utility that will print your listing in a very clear fashion. We have been lax in taking points off for badly formatted source code thus far, but will start to do so. Line numbers by themselves do not make for legible source printouts --- make sure the utility you use highlights reserved words, comments, etc. Additionally, documentation is very important.

III. Output (70%)
Using the vector space model cosine measure without relevance feedback, run your indexer on the same document collection used in assignment #2 (all the documents in the files available here). Run your indexer five separate times, once for each of the following values of M:

     1MB, 2MB, 4MB, 8MB, nMB (where n = the largest amount of memory available on your machine)

After each indexing phase, you should also run all the queries against the collection to obtain average precision using treceval.

For each run, output:
1. The value of M specified
2. The maximum number of bytes of memory used during indexing (should not be bigger than M)
3. The size of your inverted index on disk (with and without compression)

4. Time taken to build the inverted index (with and without compression)
5. Time taken to run all queries (with and without compression)
6. Average precision as determined by treceval

It may be necessary (depending on your implementation) to index twice per value of M to obtain the differences in the values required for items 3-5 above. You should not submit separate reports for these runs --- you need only submit separate reports for discrete values of M.

Sample Report: You may view a sample report for a run using M = 1MB here (your values will be different). Remember, you will have to generate five such reports, for differing values of M.

IV. Summary (10%)
Write a careful summary of problems you encountered. Proofread and spellcheck this summary! The first item in the summary must include the status of your project, i.e., a listing of what works and what doesn't work. Failing to mention a non-working part of your assignment is grounds for massive point deductions.

Also include in your summary things you wish you had been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Do not simply list problems, rather, talk about the main problems encountered and what you did to resolve them. Make sure this summary is extremely well written and very easy to read. A typical summary will range from one to three pages in length.

contact lee@iit.edu for any questions, comments, suggestions