CS 529 / Project / Part 2
Requirements

This is the second stage of our IR system. Augment your parser to accept input documents, a list of queries, and identify ranked documents for these queries. You may use the SimpleIR examples given in class which already do this for the Dot Product measure. Incorporate the parser you developed in assignment 1.

As there was some uncertainty as to which document collection was to be used in project part I, it is reiterated here that the document collection consists of all documents in the files provided in the zip file posted here. Queries and treceval can be found on the main project page. The queries are to be delineated by the "top" elements, and are identified by their numbers. You should use only the title part of the queries for retrieving documents.

Your project should use the following similarity measures:
a.  number of terms that match the document
b.  vector space model, dot product
c.  vector space model, cosine
d.  vector space model, pivoted document length normalization (5 point bonus)
e.  probabilistic model (Robertson, p.38 in textbook)

Obtain average precision for your similarity measures using treceval. A description of the format and how to use treceval is given here.

Assume you have enough memory to store the entire inverted index (as does SimpleIR). For each query, use the top 20 documents obtained by the similarity measure and obtain all stems from those documents. Once you have all the stems, sort them on their frequency in the top 20 documents Hence, a term that occurs in all 20 documents will have a frequency of 20. Now take the top 5 stems based on this frequency and do relevance feedback using these five stems.

Relevance feedback should only be done for similarity measures (b), (c), and (d) above.

Report #1

Submit 8 runs in the following order: (a), (b), (b) with feedback, (c), (c) with feedback, (d), (d) with feedback, (e). For each of these runs, you should submit a treceval report. On a separate page, submit the following "Average Precision" table (with your own results):

Sim. Measure
Ave. Prec. without RF
Ave. Prec. with RF
a
0.2578
N/A
b
0.3333
0.3666
c
0.3500
0.3675
d
0.3560
0.3702
e
0.3124
N/A

Your software may produce the table automatically, or you may manually create the table from your results. The 8 treceval summary pages should be submitted one per page. Each page must be clearly labeled with the similarity measure used, and whether or not relevance feedback was implemented. A sample output from treceval can be found here. The average precision you should use is in bold in the sample treceval output (0.0926 in the example).

Report #2
On a new page, for the first query (query number 351), output the top five relevance feedback stems. Include their document frequencies in the top 20 documents for this query when using the vector space model (dot product). The format of this report should be as follows (with your own values):

Top five stems for query 351:

Stem Frequency
list
18
top
15
five
13
stems
13
here
8

Report #3
Make a table on a separate page that indicates, for each similarity measure, the time in seconds required to run the similarity measure with and without relevance feedback. The time in seconds should include the time to run all the queries, but should not include index time. This table should appear as follows (with your own values):

Sim. Measure
Ave. Prec. Without RF
Ave. Prec. With RF
a
30.0 s
N/A
b
45.2 s
49.0 s
c
48.5 s
52.3 s
d
48.0 s
50.7 s
e
37.0 s
N/A
Deliverables

Submit the design document, reports #1, #2, and #3. Output, source code listing, and detailed summary of your work are required. Your summary must indicate the status of your assignment. Clearly indicate what works and what does not work. Your status should appear as the first item in the summary.

If you are taking the class on campus you are required to submit this assignment in a clearly labeled envelope with each section carefully delimited. If you are taking the class at a remote site, you are encouraged to submit your project by way of IIT courier on or before the due date. Internet students should bring their projects to class and submit them in clearly labeled envelopes. If this is not possible, then send e-mail to the TA with all reports in a zip file, and a table of contents clearly indicating where all the attachments can be found. Make sure you get an e-mail from the TA that indicates your assignment has been successfully received.

The assignment is due at the beginning of class on Wednesday, October 17, 2001.

Final Deliverables
1. Design Document.
2. Complete source listing of your code.
3. Reports 1-3
4. Summary
contact lee@iit.edu for any questions, comments, suggestions