Within eDiscovery, Processing comes after collection but before review. This step is all about taking the data and extracting metadata such as who created that document, when they created it, the file format and size, etc. Such metadata helps legal teams organize a seemingly endless sea of data into the right buckets so they can make informed decisions about what to do next.
Unfortunately, there are many things that can go wrong within the processing stage that an untrained eye wouldn’t notice. Some teams load data into a program, click a few buttons, and tada! They get their coveted metadata. Without a forensics data engineer dotting the I’s and crossing the T’s, you could be missing out on mission-critical information and not even realize it until further along in the discovery process.
Data Engineers have the ability, knowledge, and understanding of how to identify, isolate, and apply various remedies to a vast variety of processing errors. Handling these errors appropriately will ensure the maximum amount of text and metadata is extracted from the source data. Keeping an eye out for these common rookie mistakes can help you mitigate them early, saving time and money.
Mistake 1: “If there was an error, I would’ve gotten an error message!”
Errors can exist in an imaging set even when the processing tool does not throw any error message. A data engineer can identify these types of documents based on fielded metadata and remedy the imaging issues using 3rd party applications. Proceeding without isolating these errors can result in incorrect OCR text or blank OCR text for documents.
Potential Fixes:
To avoid some of these mistakes, familiarize yourself with all available system fields and error messages and when to look at each. This will be a huge help in isolating incorrect imaging/OCR (optical character recognition) results.
Additionally, an experienced Data Engineer will recognize when they see the same issues on the same file types over and over again. To remedy having to manually search every time data is processed, set up saved searches by keying on metadata fields (File Type, File Description, Doc Extension, etc). That will display documents that likely have errors without needing to sift through the system fields and error messages.
Mistake 2: “Maybe I did get an error message, but the data’s still all here. I should be able to move onto review now.”
Well…. Not exactly. ZIP files are a common format within forensics, and so extracting ZIP files in a forensically sound manner is a big part of processing. Oftentimes, there’s an exorbitant number of ZIP files in a case, so it’s easy for a few corrupt files to fall through the cracks.
Much the same way that a large haystack with a needle in it is identical to a haystack without one, the output of 10,000 perfectly converted files can look very similar to the output of 9,999 perfectly converted files plus one corrupt file. Data Engineers will know how to confirm if all of the data was actually extracted or if there are some files missing. They can also check to see if the content extracted is intact or corrupt. For a few files, this may be very obvious to even the inexperienced but when dealing with thousands of files it is easier for issues to slip by unless you know exactly what to look for.
Potential Fixes:
A good place to start is by running a “Sanity Check” by comparing the properties (File count, folder count, and file size) from within the zip file prior to extracting against the same properties of the extracted data. This comparison can either help confirm that you’ve done everything right, or shed a light on corrupt files and inconsistencies before they make it any further in the discovery process.
Mistake 3: “We’ve removed all the duplicates thanks to our metadata. Now we can throw the dupes out and move onto review.”
All eDiscovery professionals are familiar with deduping (or least the good ones are). Figuring out which documents are duplicates allows teams to better understand the scope of their review needs. How many attorneys are needed to review the necessary documents before a deadline? How costly will that be? In some cases, it may help determine if a client is better off litigating or settling out of court, so having accurate ideas of how many documents are duplicates is crucial.
However, discarding duplicates too early in the process can sometimes come back to haunt you. Some clients ask us which documents exist in the workspace that other vendors or internal team members have labeled as duplicates. Understandably so, since incorrect dededuplication can lead to drastically different decisions than what a team would make if they had the correct information.
Potential Fix:
The solution is to run custom SQL scripts that are able to scan an eDiscovery environment and find these documents that a rookie might have thrown out. We can double check this metadata to confirm whether or not these documents are in fact duplicates.
Ensuring your team is familiar with the backend SQL tables of your processing tool is an extreme benefit. The more comfortable and familiar a data engineer is with the backend, the more flexibility and time efficient custom solutions will be.
Mistake 4. “My role is eDiscovery processing. When it comes time for production, that’s someone else’s problem.”
An all-too-common issue for both legal professionals and eDiscovery professionals is not taking a holistic approach towards their discovery. They focus solely on the piece of the puzzle they’re responsible for without a strong sense of how that piece fits into a bigger picture. Data processing pros understand that it will eventually come time to produce this data, and those productions have to adhere to specific, previously-agreed-upon requirements. That could mean customized slipsheets, metadata formatting, production field creation, custom file-naming procedures and much more.
Potential Fix:
Make sure you’re communicating with the people on your team who will be handling the rest of discovery after you’re done with processing. Ask about what kind of file formats they’ll need, and learn as much as you can about the “big picture” goals of the case. Constantly learning and “getting into the weeds” on both the front end and back end of your processing tools will expand the number of tricks up your sleeve. With those tricks, data engineers are able to get the job done in a timely fashion where less experienced processing specialists may find limitations to what they’re able to achieve and spend more time than they have.