Complex Document Image Processing (CDIP)

The IIT CDIP Test Collection

IIT CDIP 1.0 (Illinois Institute of Technology Complex Document Information Processing Test Collection, version 1.0) is a data set supporting research in information retrieval, document analysis, computational linguistics, data mining, and related fields. It consists of:
  • IIT CDIP Records 1.0: 6,910,192 XML records describing documents that were released in various lawsuits against the US tobacco companies and research institutes
  • IIT CDIP Queries 1.0: 40 topic descriptions used in the TREC 2006 Legal Track (see below)
  • IIT CDIP Assessments 1.0: Relevance judgments on the 40 topics against pooling-based samples of the records
The records contain both text and metadata. The text was produced by applying optical character recognition (OCR) to document images in TIFF format. The metadata was produced by the tobacco organizations using a variety of techniques. It includes a title, a listing of the senders and recipients of the document, important names mentioned in the document, controlled vocabulary categories, geographical and organizational context data, and other information. Not all metadata fields are available for all documents, the formatting is inconsistent, and there is an unknown level of errors and omissions. IIT CDIP 1.0 was used in text retrieval experiments in the TREC 2006 Legal Track evaluation, as described in this paper:
  • Baron, J.; Lewis, D.; Oard, D. TREC 2006 Legal Track Overview. The Fifteenth Text REtrieval Conference Proceedings (TREC 2006). To Appear.
Further details on the collection can be found in this paper:
  • Lewis, D.; Agam, G.; Argamon, S.; Frieder, O.; Grossman, D.; and Heard, J. Building a Test Collection for Complex Document Information Processing. SIGIR '06, pp 665-666.

Here is an incomplete but useful description of the IIT CDIP Records 1.0 file.

Researchers may request access to the CDIP collection by completing and submitting the form at the bottom of the page. Users and potential users of the collection are encouraged to join the
IIT CDIP discussion list and direct questions about the collection details there.

Future Releases

The following updates to the collection will be released in 2007:
  • IIT CDIP Queries 2.0: This will add a set of known item queries to the current TREC 2006 set.
Other releases of queries and assessments based on the TREC 2007 Legal Track are likely towards the end of 2007. Details will be added here as they become available. Depending on resource availability, we may also do additional cleanup of the metadata and release that as available. Suggestions on ways to improve the collection are very welcome.

Gady Agam, IIT

Shlomo Argamon, IIT

Ophir Frieder, IIT

David Grossman, IIT

Dave Lewis, David D. Lewis Consulting

Contact address: Anti-spam image of the contact email address for the IIT CDIP collection. 
                      The address is the letters i, i, t, c, d, i, p in order followed by at iit dot edu.

Request Current Release

Request access to the CDIP Trec collection by completing and submitting the following form.
Your Full Name*
Your Email*

Reason
What is your planned usage of the test collection?