Duplicate Document Detection Test Collection
Abdur Chowdhury, Aleksander Kolcz, February 5, 2005
One of the major issues encountered in the 1990’s for retrieval systems was the shear growth in volume of data, as seen by the growth of the web. This growth in data for retrieval systems also increased the importance of problems such as duplicate data detection both from the users perspective and retrieval systems. To facilitate research in this area we propose a duplicate document test collection based on a standard NIST data set. First, we examine the complexities of determining what constitutes a duplicate document and second we present our methodology and test collection.
The definition of what constitutes a duplicate is unclear. For instance, a duplicate can be defined as the exact syntactic terms and sequence, with or without formatting differences. Thus, there are either no-differences or only formatting differences, and the content of the documents is exactly the same. This definition of duplicate is quite restrictive, but is the most common approach used by retrieval systems. Duplicate detection is achieved by calculating a unique hash value for each document. Each document is then examined for duplication by looking up the value (hash) in either an in-memory hash or persistent lookup system. Several common hash functions used are MD2, MD5, or SHA (1995). These functions or types of functions are used because they have three desirable properties, can be calculated on arbitrary data / document lengths, easy to compute, and have very low probabilities of collisions.
While this approach is both fast and easy to implement, it is very brittle, potentially any change in formatting (if formatting is used), word order, or slight variations in content produces a different document signature (hash). For example, a web page that displays the number of visitors along with the content of the page will continually produce different signatures even though the document is the same. A news article, which has a typo where the word “the” is repeated twice, will produce a different signature when the typo is fixed, even though the content would be considered duplicated (2004).
While formatting and small typos are easily interpreted to a human as a duplicate machines must be able to handle these types of errors without false positives or missing the duplicate. Thus, as this fuzziness is quantified we come to the issue of at what point are two document different. Since every application has a different goal, e.g. plagiarism detection seeks to find even small overlaps of documents, where duplicate detection seeks to find highly similar documents, we propose to focus only on the second type of problem, duplication.
To that end we use a common metric for similarity cosine. Additionally, we define our similarity focusing solely on the binary feature inclusion without magnitude:
![]()
Thus, we can define the similarity of any two documents as the intersection over the square root of the union of their features, in this case set of words. The next question is at what point do documents become just similar in terms of words and not in terms of content. Since there is no right answer we chose the use of a .9 similarity threshold. So, if two documents equal or exceed that similarity value we will consider them as duplicates.
Next, given the difficulty of developing a general purpose collection, we use the NIST WT10G collection. This collection is designed to emulate a small sub-collection of the web in terms of inter-link and server sizes. Given the ease of accessing this collection we have build a test collection using it.
The WT10G dataset from NIST was used for our web page similarity collection. While this 10GB 1.7 million document collection is synthetic in nature, it was developed to possess characteristics of the larger web for text retrieval effectiveness research. Additionally, the WT10G corpus has been examined for similarity of the collection to the web as a whole (e.g., using metrics such as link connectivity) and was found to be representative. These factors make it an attractive for use in duplicate similarity experiments in which web pages are being examined.
We used a random sample of 115,342 documents (approximately 7% of the total collection) as queries against the full WT10G collection. For each query, a set of documents for which the cosine similarity exceeded or equaled the 0.9 threshold was identified. Only 26,491 of the queries had a cosine neighborhood containing documents other than themselves and only those are included in this duplicate document test collection. Note that the computational of cosine similarity is quite expensive and would generally be unsuitable in an operational setting and the calculation of this collection took ~3 weeks.
Data File:
Similarity file format:
|
File |
Similarity File |
Similarity Score |
|---|---|---|
|
wtx085-b33-52 |
WTX032-B02-211 |
0.9561089365214079 |
Size ~85.1MB gziped: WT10G Duplicate Document Test Collection