File Deduplication Software Tips for Document Collections

file deduplication software

File Deduplication Software Tips for Document Collections

File Deduplication Software Tips for Document Collections 1000 570 Jason Krause


Big Data is making life increasingly complicated for lawyers. In eDiscovery, a big part of the challenge is simply eliminating duplicate copies of stuff – emails, Word documents, spreadsheets, files, and metadata – that is often found in duplicate, triplicate, or almost infinite numbers in any collection. File deduplication software is the answer.

Nextpoint File Deduplication Software

Fortunately, there are a couple of basic tools that de-duplicate or eliminate unnecessary copies of files. However, each of these tools has unique purposes and limitations that affect how well unnecessary files are eliminated.

Nextpoint provides deduping technology for eliminating redundant copies of documents. When you open a new Nextpoint case instance, file deduplication is turned on by default.

Deduplication settings exist in the application to allow your data to be deduped by MD5 hash values. Documents with the same content hash are always considered an Exact Match (when Deduplication is enabled). Container files such as email (pst, mbox) or .zip are only considered duplicates if the contained/children files are also duplicates.

You can also choose an additional level of deduplication by adding Email-Message ID to your dedupe criteria. When “Email-Message-ID” is added, documents/emails with the same “Email-Message-ID” are also considered to be an Exact Match – even if their content hashes and headers do not exactly match. You can review how file deduplication software settings work in Nextpoint.

Deduplication Is Unique in Every Case

But one common question we hear is, “why do I still have duplicate copies in my document collection if I already deduped?”

The answer is that a single document is often introduced into a document collection multiple ways. Different people will attach the same document to an email and send it to different recipients.

Once those separate emails are collected and introduced into a document collection, the attached document is also captured along with each email. Those emails are kept in different families, and deduplication will not eliminate copies in different families. That is because most customers need to keep duplicates in a collection to establish how the content was distributed and who may have been privy to the information.

The Importance of Metadata

Metadata is also used to either reject duplicates (in which case they are “near duplicates”) or merge them into a single copy. An example of merged metadata would be when two identical emails have been collected from different sources. 

Since the documents are exact duplicates, the values for coding fields such as “Email Datetime,” “Mailbox Path” and “Batch” will be concatenated and merged and a single document will exist.

An example of metadata rejecting a duplicate would be where Document 1a was coded with a value of “John’s Desktop” in a field called “source” and an identical Document 1b was imported.

Because of the different values in the “source” coding field, there you would end up with two unique documents. Within Nextpoint, reviewers can see in the sidebar how many copies of a document exist in a collection.

In this example the documents would be linked as “Documents with matching MD5 hashes” in the Related Documents section. Once marked as responsive, non-responsive, privileged, or otherwise, reviewers will know that it has been reviewed. However, there is obviously no need to then keep large numbers of copies.

Make Mine Custom, Please

If the standard deduplication settings do not quite fit your needs, Nextpoint offers custom deduplication services as well.

Document populations can be deduped by custodian, or across an entire population, and rules can be implemented to determine how to handle duplicate documents with different coding. That can include criteria like whether to merge metadata by removing one of the duplicates, or leaving them separate.

Deduplication is often one of the first and most important steps to narrowing a document collection. Sadly, many document reviewers are enamored of expensive new technology for winnowing large data sets, and forget the simple but effective deduping tools available.

Taking advantage of Nextpoint’s file deduplication software will help eliminate redundant copies of documents, and streamline your review process.