CS 529 / Project / Part 4
Requirements

Now that you have a scaleable search engine working, this assignment makes it better. Add a scaleable (meaning that it does not have to fit into main memory) implementation of phrase processing and relevance feedback.

Phrases are to be identified as two term sequences that do not cross stopwords or punctuation marks and occurs more than 25 times in the document collection.

You will also need to implement a relevance feedback index which identifies, for a given document, terms that are found in the document. Use the following relevance feedback algorithm:

1) Using the Singhal pivoted document length normalization similarity measure, obtain an initial result set. We will assume that the top 10 documents from this result set are relevant and will refer to it in the remainder of the algorithm as the relevant set.

2) Sort the terms in each document in the relevant set in the following fashion.

First, for each term in a document in the relevant set, calculate a noise measure , where:

= number of terms in the collection
= number of occurrences of term in the collection
= number of occurrences of term in document
i

Sort the terms in the document using the following formula:

= the noise measure
= the number of occurrences of term in the relevant set
= the number of documents in the relevant set that contain term

After sorting the terms this way for a particular document in the relevant set, add the top 20 terms to the query as relevance feedback terms. Obtain feedback terms from each document in the relevant set in the same way.

3) Use the Singhal pivoted document length measure to obtain a new result set using the new expanded query.

Deliverables

As with every assignment in this course, a design document, output, source code listing, and detailed summary of your work are required. It is important that these be labeled clearly. If you are taking the class on campus you are required to submit these assignment in clearly labeled envelopes.

I. Design Document (10%)
The design document should be written prior to coding. Clearly describe all classes you will create. This requires a listing of all attributes, methods, and algorithms that will be used. Furthermore, you should clearly relate how your objects will interact with each other. There should not be any code in your design document. The goal of your design document should be for any programmer to be able to implement the project in a language of his choosing using only your design document as a reference.

II. Source Code (10%)
Give a listing of your source code. Make use of some form of source code utility that will print your listing in a very clear fashion. We have been lax in taking points off for badly formatted source code thus far, but will start to do so. Line numbers by themselves do not make for legible source printouts --- make sure the utility you use highlights reserved words, comments, etc. Additionally, documentation is very important.

III. Output (70%)

Report 1
Set the amount of memory to be used during the indexing phase to 8MB. Output the following items:

1. The maximum number of bytes of memory used during indexing (should not be bigger than 8MB)
2. The size of your inverted index on disk
3. The size of your relevance feedback index on disk
4. The top 50 phrases (by term frequency) found in your index

Report 2
Conduct query processing runs using indices constructed with the following specifications:

1) terms-only, without relevance feedback
2) phrases-only, without relevance feedback
3) terms + phrases, without relevance feedback
4) terms + phrases, with relevance feedback

For each run, use treceval to calculate average precision. Generate a table of average precision following the template given here.

Use the qrels file provided here - we know that it contains documents that do not exist in the collection, but we want to have a standard qrels file so that results can be compared. For each query, obtain the top 100 documents for the query if at all possible.

IV. Summary (10%)
Write a careful summary of problems you encountered. Proofread and spellcheck this summary! The first item in the summary must include the status of your project, i.e., a listing of what works and what doesn't work. Failing to mention a non-working part of your assignment is grounds for massive point deductions.

Also include in your summary things you wish you had been told prior to being given the assignment. It is suggested that you keep a log of events that occurred while working on the assignment so that you will have useful information for your summary. Do not simply list problems, rather, talk about the main problems encountered and what you did to resolve them. Make sure this summary is extremely well written and very easy to read. A typical summary will range from one to three pages in length.

Submission Date
This assignment is to be handed in on Wednesday, December 5th at the beginning of class. Off-campus students are required to e-mail the project to the TA before 6:25PM (CST) on the due date. Submissions on November 28th will receive 20 points extra credit.
contact lee@iit.edu for any questions, comments, suggestions