You can tell a lot about files on a computer before you even open them.
That’s the premise that makes data reduction possible prior to discovery review—in a stage commonly referred to as early data assessment.
As summarized by Inside Counsel,
“The idea behind EDA is to determine the types of data to be potentially preserved, gathered and analyzed, maybe to identify gaps or overlaps in the data, and …to help scope the project.”
To determine which data should be “preserved, gathered and analyzed,” data specialists can start by examining the metadata—descriptive information a computer system maintains about each of its documents. Here are some techniques and strategies that help parties reduce their data burden:
1. Date range analysis
One of the quickest ways to exclude large amounts of unnecessary data from review is to establish a time range in which files could possibly be relevant to the matter at hand.
For example, if a discovery request seeks all documents relevant to a marketing promotion that ran from October 2013 to February 2014, you could try to reach an agreement with opposing counsel to exclude documents created before 2013 or after February 2014 from discovery.
EDA tools, such as those used by the Nextpoint Data Strategy team, can filter out documents created before or after a specified date range so you don’t waste time processing them later in discovery. They can also identify any chronological gaps that exist in your data.
Logic dictates that only files created or modified by users are potentially relevant to a review. But computers also contain lots of common system files that users never touch.
The process of filtering out those non-evidentiary files is called “deNISTing”—a reference to the National Institute of Standards and Technology, which maintains a master list of all known system files. DeNIST filters work by cross-referencing the list and eliminating the matches.
If identical copies of a document exist in a collection, it doesn’t make sense to waste time reviewing all of them. Deduplication filters make note of these copies and identify them for exclusion.
This is especially useful when reviewing emails. Each sent email will typically create two copies of itself—one in the sender’s sent-items folder and another in the recipient’s inbox. This phenomenon is multiplied when there are many recipients or file attachments for the original email. Deduplication ensures only the original, master copies are subject to review.
4. Email threading
This technique uses contextual information embedded in the metadata of email files to re-organize them into conversation threads. That means reviewers can see the chronological progression of a conversation, which can make a big difference in the speed and accuracy of an email review.
Imagine, for example, that your reviewers identify a “hot” keyword in a question posed in the first email in a conversation between three participants. Later in the thread, the second and third participants reply, “Yes,” and “Can you elaborate?”
Reviewed separately without the context provided by email threading, those second and third messages would seem insignificant. But if a reviewer read them in order, she would immediately know to flag them as responsive. In fact, the whole thread could be flagged for closer examination.
Threading can also enable teams to review less emails overall by focusing on the last email in a thread, which often contains a record of the entire conversation.
5. Custodial analysis
EDRM defines a custodian as “[A] person having administrative control of a document or electronic file.”
While the author of an email is the sender, the custodian of that email is the person with access to the mailbox file which contains the message—and they’re not always the same. At many businesses, the custodian might be an IT administrator, as opposed to an account rep.
A custodial analysis can determine if a custodian possesses data from any of important people in a case. It can also determine if multiple custodians are in possession of duplicate documents, preventing unnecessary over-collection.
6. Search term filtering
Just as in review, data can be pared down to some degree by keyword searches during early data assessment. One key advantage of doing so is minimizing the per-GB processing fees charged by eDiscovery providers.
Let’s say you’re looking into financial documents involving three people. You also know that one of those people frequently emailed gigantic spreadsheets about fantasy football with one of the mailbox custodians, which are clearly irrelevant to the discovery request. To avoid the cost of bringing these files into discovery, you could create search filters for any emails with a subject including “ESPN Fantasy Football” or sender domain “espn.com” and exclude them from review.
7. Selective Sets
Finally, advanced EDA tools often include “selective set” functionality, which enables data experts to filter data using clever combinations of all the techniques above.
What makes selective sets truly powerful is the ability to create a selective set of other selective sets—a process sometimes referred to as “stacking.”
For example, you may have two sets with conditions like this:
- Emails sent by or sent to Employee F or Employee G with XLS, XLSX, or CSV attachments
- Emails kept by Custodian D before June 1, 2015 with PDF, PSD, TIFF, DICOM, XCF, PPT, or PPTX file extensions
With the right software tools, a data analyst can search for documents that match Set 1 but do not match Set 2, or documents that match both Set 1 and 2.
In summary, using a broad range of EDA techniques and selective sets can help you greatly minimize your data burden and avoid a never-ending eDiscovery horror story. In our era of digital evidence, a little data science know-how can save a whole lot of money.
Have a question about EDA techniques? Contact our Data Strategy team for more information or a case-specific consultation.