|
Complex Document Image Processing (CDIP)
The IIT CDIP Test Collection
IIT CDIP 1.0 (Illinois Institute of Technology Complex Document Information Processing
Test Collection, version 1.0) is a data set supporting research in information retrieval,
document analysis, computational linguistics, data mining, and related fields. It consists of:
-
IIT CDIP Records 1.0: 6,910,192 XML records describing documents that were released in various
lawsuits against the US tobacco companies and research institutes
-
IIT CDIP Queries 1.0: 40 topic descriptions used in the TREC 2006 Legal Track (see below)
-
IIT CDIP Assessments 1.0: Relevance judgments on the 40 topics against pooling-based
samples of the records
The records contain both text and metadata. The text was produced by applying optical
character recognition (OCR) to document images in TIFF format. The metadata was
produced by the tobacco organizations using a variety of techniques. It includes a title,
a listing of the senders and recipients of the document, important names mentioned in the document,
controlled vocabulary categories, geographical and organizational context data,
and other information. Not all metadata fields are available for all documents, the
formatting is inconsistent, and there is an unknown level of errors and omissions.
IIT CDIP 1.0 was used in text retrieval experiments in the TREC 2006 Legal Track evaluation, as described in this paper:
-
Baron, J.; Lewis, D.; Oard, D. TREC 2006 Legal Track Overview. The Fifteenth
Text REtrieval Conference Proceedings (TREC 2006). To Appear.
Further details on the collection can be found in this paper:
-
Lewis, D.; Agam, G.; Argamon, S.; Frieder, O.; Grossman, D.; and Heard, J. Building
a Test Collection for Complex Document Information Processing. SIGIR '06, pp 665-666.
Here is an incomplete but useful description of the IIT CDIP Records 1.0 file.
Researchers may request access to the CDIP collection by completing and submitting the form at the bottom of the page.
Users and potential users of the collection are encouraged to join the
IIT CDIP discussion list
and direct questions about the collection details there.
Future Releases
The following updates to the collection will be released in 2007:
- IIT CDIP Queries 2.0: This will add a set of known item queries to the current TREC 2006 set.
Other releases of queries and assessments based on the TREC 2007 Legal Track are likely
towards the end of 2007. Details will be added here as they become available. Depending
on resource availability, we may also do additional cleanup of the metadata and release
that as available. Suggestions on ways to improve the collection are very welcome.
Gady Agam, IIT
Shlomo Argamon, IIT
Ophir Frieder, IIT
David Grossman, IIT
Dave Lewis, David D. Lewis Consulting
Contact address: 
Request Current Release
Request access
to the CDIP Trec collection by completing and submitting the following
form.
|