Data experts: Yes, Hillary’s emails can be deduped in 9 days

November 8, 2016 Joe Barbato

The FBI reviewed all of the 650,000 emails from the laptop belonging to ex-Congressman Anthony Weiner. Read about how they were able to finish the process so quickly.

hillary clinton image

FBI Director James Comey aroused suspicion late on the Sunday before Election Day when he announced that the agency had quickly reviewed a reported store of 650,000 emails recovered from a laptop belonging to disgraced ex-Congressman Anthony Weiner (husband of top Hillary Clinton aide Huma Abedin) and reconfirmed that Clinton would face no criminal charges. Donald Trump and his supporters were suspicious: how would it be possible to review so many emails in such a short time?

After the New York Times, Wired and even Edward Snowden confirmed that deduplication was at the core of the rapid review, we spoke with John Kosturos of RingLead to get more details on the data science behind the speedy review.

How could the FBI examine 650,000 emails and be sure that these are the same ones that they already had?

Kosturos: You’re talking about finding patterns in subject lines, email addresses, the body of an email, the text. What we do in identifying duplicates in a database is to put algorithms on field values to see if they’re identical or if they’re very similar. It’s similar to deduping a database. If you have access to the data in the emails where the subject is in the same place every time, the email address is available, the body of the email is available, then you could put basic pattern-matching algorithms on those emails en masse and do a realtime crawl through that inbox. Once you have the data, you can process it quickly because it’s just a series of language where you’re identifying: is the subject exactly the same? Is it the same email? Are the first 10 words the same? Being able to run a technology on that database, it probably only takes a few hours to crunch the results of any emails that were duplicates based on the script that was written.

If the data is not in such a consistent format, does that add a wrinkle to the process? If they came from Hillary Clinton’s server, they may have been downloaded by a different email client. Some were recovered, not from Hillary’s server, but from the people that she was emailing with.

Data can be in all sorts of formats. If you think about a company name, it can be spelled 20 or 30 different ways. If we’re trying to identify companies that are duplicates in a database, before we actually even do the lookup, we standardize the values and we’ve got some patented technology that will basically get rid of heading and move the words here, there.

The goal is to identify the true convention for that particular piece of data. If you’re talking about identifying emails that are similar, there may be some formatting that’s different or it may be in a different type of file. But once you can break it down into, “hey, this piece is what the subject would be,” and there might be some characters that you might have to strip out at the beginning or the end. Again, you can process that for the pattern matching. There is a level of data that needs standardization or normalization that you would want to apply prior to searching because the different formats wouldn’t be able to pick them up if they didn’t strip out what was not uniform.

Would that be a time consuming process if you’re talking about potentially 650,000 emails?

It just depends on how many different patterns you’re normalizing against. If you had four or five different formats, then you only have to create four or five different formats in your language. If you have millions of variations, you’d have to create millions of variations. But for hers, assuming there’s only three, four, five different formats that the data could be in, it wouldn’t take long to create that normalization structure. Then, once that’s finished, running the script against the emails? Two, three, four hours is tops I would say it takes to actually execute the job.

So when you hear that the FBI reviewed that volume of emails over the course of a week and came to this conclusion, that doesn’t raise any red flags in your mind or make you suspicious about whether that was a practical task for them to do?

No, not at all. I think that they can do it. The level of computing power and the data scientists that they have, they would be able to do that pretty quickly.

Would it surprise you that that volume of emails would turn out to all be perfectly matched? That they would come back in a week and say, “Really, there’s just nothing different. This is all clear.”

If it’s somebody else’s computer, they might not all be the same, but they could be parts of emails streams where somebody gets adds to the stream. Possibly a piece of the message is similar but not the whole thing. I would say there’s probably some overlap when the person was added to an email chain or forwarded a message.

Could there be uniformity if the machine that they’re looking at was just perfectly synchronizing with another database, i.e., email on another server?

Highly possible. With cloud computing, you can pretty much share a profile on any machine. I have people that are my assistants, they’ll log into my email so they can send emails for me and they’re on a different machine. With cloud computing, it wouldn’t be too far-fetched.

Does anything in this scenario tap into the kind of work that you do at RingLead? Are there any analogies to the sales and marketing universe?

For us, it’s educating all sales and marketing people that time is talent and energy is their money. By investing a small amount of time working on data — deduping it, normalizing, giving it a format that they can actually digest and utilize — the more time they’ll have to actually go out and work on that information. What RingLead does is help solve highly complex data challenges, that requires a high level expertise. If you don’t have that, you could spend hundreds or thousands of hours on a task that could be automated into just a few hours.

This FBI story is just a good example of a highly complex data challenge. If you went through and manually looked up all those emails, it would take you hundreds of hours, but if you work with a company like RingLead that has a background in data science and matching data, we can get it to a more digestible state in a much faster time.

The post Data experts: Yes, Hillary’s emails can be deduped in 9 days appeared first on RingLead.

Previous Article
War Over Cloud Storage: IBM Prepares for Big Data
War Over Cloud Storage: IBM Prepares for Big Data

The world’s largest tech companies are battling over real estate in the cloud. Whichever company can compil...

Next Article
Data Discussions: Tom Redman
Data Discussions: Tom Redman

Welcome back, data provocateurs! This recording is the second segment of our data discussion with Tom Redma...