Between collecting & preserving electronically stored information (ESI) and reviewing & producing it, many people regrettably overlook the critical steps involved in processing the data. If you’re tempted to remain oblivious to the workflow of ediscovery data processing, you will overlook opportunities to gain better insight into the collected files, to minimize the costs involved for review and production, and to streamline the logistics of the project. Here are the steps you need to keep in mind for a successful ediscovery processing workflow.
As a legal practitioner, you understandably want to start looking at files and documents as soon as possible so you can start developing your legal strategy. You want to see who was talking to whom, what they were saying, and generally gain a better, high-level understanding about the people, places, and events involved in the matter. This Early Case Assessment (ECA) also involves helping your clients understand the legal risks as well as the costs of the matter.
In today’s litigation world, a massive cost component revolves around the amount of data that must be collected, processed, reviewed, and produced. We refer to this as Early Data Assessment (EDA), where the data processing phase gives you the ability to comprehend the amount of data involved in the matter so you can better inform your client about the time, effort, and costs required.
Legal teams may prefer to use specialized ECA software to get the full benefits of the processing stage, especially when dealing with large data volumes. For example, Nextpoint recently launched Data Mining, an Early Case Assessment software that offers all the tools you need to follow this ediscovery processing workflow and dive deep into your data.
Step 1: Normalizing Your Data (There’s No Such Thing as Normal Data!)
The first step in an ediscovery processing workflow is to “normalize” all the collected data so that the review is consistent and straightforward. For example, when you’re looking at a glob of email, Word documents, PDF files, pictures, sound recordings, spreadsheets, and more, you need to ensure that you can read and see and hear all those files in an approachable manner. It seems like it shouldn’t be complicated, but it’s important to accurately identify every file type so they can all be properly formatted for your viewing pleasure.
For email, we also have to ensure all attachments are linked to their proper messages (what we call the parent/child relationship). Even more important, we need to “normalize” the time zones associated with all the messages. If we processed all the emails as per the time zone where you practice law, there’s a possibility that you would find emails sent after they were received, which is obviously confusing (even more confusion when we factor in Daylight Savings Time or international time zones). For this reason, we usually process everything according to Universal Coordinated Time (UTC), and you’ll need to be comfortable with that format.
Additionally, we have to make sure any zipped/compressed files are uncompressed and properly listed. And we have to extract any embedded objects that might have been inserted into Microsoft Word documents or Excel spreadsheets. Another important step is to assign each file a “DocumentID” or control number so we can provide analytics and audit trails in the platform. Note this is NOT a Bates number, since those are typically assigned when you generate a production set.
Step 2: Metadata Extraction and File Culling (De-Mystifying the Content)
While lawyers are understandably focused on reading the content of emails and documents, it’s critical that all of the metadata from those files is properly extracted so that it can all be populated into a database. The processing stage extracts all the information from the From, To, CC, BCC fields, along with the Sent & Received dates/times, the Subject line, and several more properties such as whether the message was opened or replied to, and what conversation thread it belongs in. Having all the metadata extracted into a spreadsheet-like database view means you can easily sort and filter data to focus on just the communications you need to investigate.
But even before you look at the metadata, there are several critical filters that the data must go through so you’re not wasting time looking at files that don’t contain any content. There are hundreds of thousands of computer “system” files that sometimes get swept up in the collection process, and there is typically no reason you need them for the purposes of litigation.
The data processing stage will “De-NIST” and cull out those files. The “NIST” here refers to the National Institute of Standards and Technology, which maintains the National Software Reference Library (NSRL) that catalogs the digital signatures of files in known software applications. Any software executable file or other system file that would appear as gibberish in a review database is De-NISTed according to that official list.
Your ediscovery processing workflow should also include deduplicating files, and this is where you need to provide some input to your vendor. Let’s say you’ve collected emails from 10 different individuals/custodians, and you realize each of those 10 individuals may have received the same email – do you want to read that same email message 10 times? Or would you rather the duplicates be removed with an indicator to each individual who received that message? These are important decisions you need to discuss with your vendor, who can help you understand your options so you get what works best for your review needs.
Lastly, this is the step where any non-searchable files are OCR’d so they are readable and searchable in the platform. There may be some scanned paper documents or pictures that contain text that humans can read, while the computer has to attempt to recognize that text for searchability. A computer can try to OCR handwriting, but just know that it won’t be perfect, which means your searches may be incomplete.
Step 3: Indexing and Searching (You Can’t Search What You Don’t Index)
When attorneys think about “searching” documents, they envision typing in a word and having the computer check for that word in every single file. You can’t be blamed for visualizing the task that way, but the reality is that it would take so long for a computer to search every document that it would be a time-wasting disaster.
Instead, when you type in a word and hit the search button, the computer is actually scanning an “index” or dictionary of words that has been generated based on all the words found in the files during the data processing stage. That way, it’s only searching for words found in the files you collected, and it only has to inquire with that index rather than laboriously explore every document every single time. This is much more efficient and gets you the results you’re looking for in fractions of a second. The index knows every file where a word is found, and so it can highlight your search terms in the files during your review.
But there’s a flip side to this – in order to be most efficient and avoid human impatience, many search indexes will ignore the most common words such as and, to, is, etc. These “noise” words or “stop” words show up at an astronomically higher rate than all other words. Since we’re usually not searching for those conjunctions, determiners, and prepositions, the indexes will just completely ignore them.
This is standard procedure, but you should be aware of these limitations if you ever come across a situation where you might need to search for those specific words. Craig Ball has an excellent example that in most ediscovery document review platforms, you won’t be able to find the phrase “to be or not to be” even if you put it in quotations, because those are noise words that would not be indexed in the data processing phase.
At this step, you should consider proactively giving your provider (like Nextpoint) a list of keywords or search terms that you’re interested in so you can receive a “hit report” after processing. This report can be helpful to show you how many occurrences of certain words are found in the data and allows you to filter chosen keywords before diving directly into a manual review.
Step 4: Data Mining and Analytics (Examine What The Data is Telling You)
Lastly, an ediscovery processing workflow should enable you to take advantage of deep-dive analysis of your data. Computational tools, like Nextpoint’s Data Mining, can be used to highlight interesting or significant patterns in your data to provide you with better angles to approach review. There are several advanced tools utilizing artificial intelligence (AI), machine learning (ML), natural language processing (NLP), and a host of other mind-blowing technologies. Just ask your vendor what basic analytical tools they have that can help you.
For example, immediately after data processing, Nextpoint provides you with a set of statistics on how many files and documents you’re faced with, how many email messages, how many attachments, and how many email threads or conversations in total. You can also view a visual, interactive timeline of files and email messages so you can focus on a specific date range. There are data widgets that break down the different file types found in your data, as well as the authors and email domains.
In addition to these features, Nextpoint offers Data Mining, the new groundbreaking technology for Early Case Assessment and comprehensive data analysis. The app generates snapshots of key themes in your data and offers advanced search features that can be used to create custom visual reports. As volumes of electronic data explode in the legal field, advanced tools like this are becoming key parts of handling potential evidence in litigation.
All of these tools provide you with a much better place to start your review rather than just blindly diving into a collection of files and clicking the first one, then the next, then the next. These analytics are available in the processing stage, and it would be incredibly beneficial for you to inquire with your platform provider about what tools they can provide for analyzing your data.
An Ediscovery Processing Workflow To Simplify Your Data Load
As you can see, the “processing” stage of ediscovery has a lot more happening under the hood, and while some of these tasks are standard and run-of-the-mill, it’s also important that you become comfortable with all the options and processes so you can make the best decisions for your clients and their data. With these strategies, there’s no need to be overwhelmed by discovery data – you can simplify and understand it before diving into document review.
Data Mining: The Processing Tool of the Future
Data Mining is Nextpoint’s new technology for ediscovery processing and Early Case Assessment. It can process massive amounts of data at speeds 30 times faster than current technologies. Click here to learn how Data Mining can simplify your ediscovery data.