Advanced eDiscovery Data Filtering Techniques

Data Filtering Techniques

Advanced eDiscovery Data Filtering Techniques

Advanced eDiscovery Data Filtering Techniques 1200 630 Michael Beumer

eDiscovery processing is an often overlooked phase of litigation because it is considered a technical, not a legal practice. However, lawyers who understand processing and advanced data filtering techniques are in position to control the costs and control the scope of any matter.


eDiscovery Data Filtering In Three Parts

In the first two posts in our series on filtering, we looked at the technologies to employ, including DeNISTing and email threading, as well as the tough decisions lawyers need to consider in processing data for litigation. In this post, we will look at some more advanced data filtering techniques that make it possible to process evidence in the most defensible and cost-effective manner.


Defending Your Data Filtering Techniques

Defending eDiscovery Methods

Whenever a collection of documents is filtered prior to review, lawyers must validate the authenticity of the Electronically Stored Information (ESI) that has been processed. As with any evidence in litigation, there are two validation methods: 

  • Creating documentation (the chain-of-custody log)
  • Testimony by the ESI collector about what was done

Keep in mind that opposing counsel can ask questions or even serve discovery about your culling methods at any time. Keep a detailed log of all the methods used in a case. 

A chain of custody documents all steps taken from receipt of the source media to storage and processing to ensure proper steps were taken to preserve the integrity of the original data. Key steps to take: 

  • Write protect all files during collection. Write blocking prevents the alteration of the source media metadata when a device is attached to collect the source data.
  • Preserve all metadata. The software and processes used must protect the files during the collection process and maintain the integrity of both the system and file metadata associated with each record.

Unfortunately, there are no standard protocols for forensic inspection. It is essential for the parties to agree to the filtering technology and tactics employed, ideally during the Rule 26(f) Conference.


Know your MD5 HashKnow Your eDiscovery Data

In an era of deepfakes and Photoshop, it is increasingly difficult to authenticate and validate evidence. In addition, lawyers will have to deal with custodians who willfully or negligently delete documents and files that are potentially relevant to their matter. 

The best defense is to understand hash values. Hash values are an algorithm generated from a string of text, which represents a unique value for a specified set of data, like a digital fingerprint or a VIN number for a car. 

Digital documents are, at the most basic level, just numbers. Because these formats all break down to numbers, we can use algorithms to compare those numbers. 

Establishing a valid chain of custody means being able to show where the evidence has been, who has handled it, and its condition at all times in order to establish that there has been no alteration or tampering of the evidence prior to the time it is presented to the court.

Hash values are the underpinning of much of the technology we have been discussing. As described in our first filtering post, deNISTing is the process of removing all system files that are deemed to have no evidentiary value. Such files are identified by matching their cryptographic hash values. In addition, hash values are important for: 

  • Data Integrity: Assigning MD5 algorithms can help ensure that any changes to a document result in the generation of unique hash codes, exposing any attempts to manipulate potentially relevant evidence.
  • De-duplication uses hash values to identify and eliminate all duplicates. This is accomplished by creating and comparing hash values for each document; exact copies will have the same hash value.
  • Near-deduplication, in which documents that are almost 100% identical can be suppressed or removed from the collection.


Technology Assisted ReviewCan Computers Replace Lawyers?

In cases with extremely large data collections, Technology-Assisted Review (TAR) or predictive coding may be needed to supplement the filtering technologies we have described already. Predictive coding is the use of machine learning in document review, usually applying statistical analysis to measure effectiveness. 

Deduplication and deNisting are technical engineering filters. Technology-Assisted Review (TAR) is a process for coding documents using a computerized system that harnesses human judgments. 

Reviewers use a smaller set of documents and to identify potentially relevant documents and then train the computer to identify similar documents. These processes generally incorporate statistical models or sampling techniques to guide the process and to measure effectiveness.


These May Not Be the Droids You’re Looking For

Reasonableness standard

Our apologies to George Lucas for the subhead, but simply try to remain practical and reasonable in your efforts. In the vast majority of matters, the volume of collected data can be sufficiently handled with the “traditional” technical filtering and culling methods. TAR or predictive coding is by no means a panacea, and will increase ediscovery costs. But in the largest of cases, these advanced techniques can be used to enhance traditional methods.

Ralph Losey, a practicing litigator who has written extensively on the topic, suggests that in cases with high volumes of documents, predictive coding can be used to identify documents likely to be irrelevant as a second filter, after deNISTing, de-duping and other technical filters. 

Legal teams can identify a pool of documents unlikely to be relevant and discard them. It will take negotiation to determine which documents to discard, but Losey suggests retaining documents with a 90% or higher probability of relevance when using a process called Continuous Active Learning.  

In addition, there are special considerations which may trip up even the most well-meaning eDiscovery filtering and processing. Advanced search technology can provide language identification analysis which can check for any foreign languages before moving forward with your review. It is also important to identify when a collection includes encrypted data which for which you do not possess the encryption code. If your interviews with custodians indicate relevant information may be encrypted, forensic services may be required to crack the encryption code, adding time and expense to your process.


Closing the Circle on Filtering

There is obviously more to be said on these topics, but I hope this is a useful introduction to some of the advanced eDiscovery techniques. And don’t forget about our earlier post of the series, The Art of eDiscovery Data Filtering and Culling, and The Human Role in Date Filtering.

In the meantime, if you have questions regarding data reduction methods, reach out to the experts at Nextpoint. We are here to help


Much more on filtering and culling standards in our new eBook: The Art of eDiscovery Data Filtering.