|
This
is the second stage of our IR system. Augment your parser
to accept input documents, a list of queries, and identify
ranked documents for these queries. You may use the SimpleIR
examples given in class which already do this for the Dot
Product measure. Incorporate the parser you developed in assignment
1.
As there
was some uncertainty as to which document collection was to
be used in project part I, it is reiterated here that the
document collection consists of all documents in the files
provided in the zip file posted here.
Queries and treceval can be found on the main
project page. The queries are to be delineated by the
"top" elements, and are identified by their numbers.
You should use only the title part of the queries for retrieving
documents.
Your
project should use the following similarity measures:
| a. |
number
of terms that match the document |
| b. |
vector
space model, dot product |
| c. |
vector
space model, cosine |
| d. |
vector
space model, pivoted document length normalization (5
point bonus) |
| e. |
probabilistic
model (Robertson, p.38 in textbook) |
Obtain
average precision for your similarity measures using treceval.
A description of the format and how to use treceval is given
here.
Assume
you have enough memory to store the entire inverted index
(as does SimpleIR). For each query, use the top 20 documents
obtained by the similarity measure and obtain all stems from
those documents. Once you have all the stems, sort them on
their frequency in the top 20 documents Hence, a term that
occurs in all 20 documents will have a frequency of 20. Now
take the top 5 stems based on this frequency and do relevance
feedback using these five stems.
Relevance
feedback should only be done for similarity measures (b),
(c), and (d) above.
Report
#1
Submit
8 runs in the following order: (a), (b), (b) with feedback,
(c), (c) with feedback, (d), (d) with feedback, (e). For each
of these runs, you should submit a treceval report. On a separate
page, submit the following "Average Precision" table
(with your own results):
|
Sim.
Measure
|
Ave.
Prec. without RF
|
Ave.
Prec. with RF
|
|
a
|
0.2578
|
N/A
|
|
b
|
0.3333
|
0.3666
|
|
c
|
0.3500
|
0.3675
|
|
d
|
0.3560
|
0.3702
|
|
e
|
0.3124
|
N/A
|
Your
software may produce the table automatically, or you may manually
create the table from your results. The 8 treceval summary
pages should be submitted one per page. Each page must be
clearly labeled with the similarity measure used, and whether
or not relevance feedback was implemented. A sample output
from treceval can be found here.
The average precision you should use is in bold in the sample
treceval output (0.0926 in the example).
Report
#2
On a new page, for the first query (query number 351), output
the top five relevance feedback stems. Include their document
frequencies in the top 20 documents for this query when using
the vector space model (dot product). The format of this report
should be as follows (with your own values):
|
Top
five stems for query 351:
| Stem |
Frequency |
| list |
18
|
| top |
15
|
| five |
13
|
| stems |
13
|
| here |
8
|
|
Report
#3
Make a table on a separate page that indicates, for each
similarity measure, the time in seconds required to run the
similarity measure with and without relevance feedback. The
time in seconds should include the time to run all the queries,
but should not include index time. This table should appear
as follows (with your own values):
|
Sim.
Measure
|
Ave.
Prec. Without RF
|
Ave.
Prec. With RF
|
|
a
|
30.0
s
|
N/A
|
|
b
|
45.2
s
|
49.0
s
|
|
c
|
48.5
s
|
52.3
s
|
|
d
|
48.0
s
|
50.7
s
|
|
e
|
37.0
s
|
N/A
|
|