Blogs

Guest Post: E-Discovery and The Enron E-Mail Dataset Research

10.21.09

GUEST BLOG FROM VICTORIA VANBUREN

Before Dave Grant joined Gardere as the Director of e-Discovery, he was responsible for e-Discovery at Enron in the last few years before its total melt down and was responsible for managing more than 1.25 million documents.   While at Enron, Dave responded to more than 100 subpoenas from various states and federal agencies. The Enron database has become a focal point of eDiscovery research.    This Guest Blog about the Enron database is part of a bigger picture regarding academic research for developing efficient tools to improve eDiscovery.

 

I welcome Victoria VanBuren as the first Guest Blogger with her blog concerning the Enron eMail database. Victoria runs the DISPUTING blog with Karl Bayer in Austin, and has a great knack for posting interesting blogs and finding blogs on important topics. She is also a co-founder and an active participant on theLinkedIn Commercial and Industry Arbitration and Mediation Group. In addition to being a lawyer, Victoria is working on a degree in computer science so and I’m sure we will see Guest Blogs from her in the future.     

 

GUEST POST: E-DISCOVERY AND THE ENRON E-MAIL DATASET RESEARCH

 

By Victoria VanBuren  

 

The U.S. Supreme Court granting of certiorari to former Enron CEO Jeffrey Skilling dominated the news headlines last week. Interestingly, the Federal Energy Commission (FERC), during its investigation into Enron’s involvement in the energy crisis of 2000-01, made available to the public a large database, called the “Enron Corpus.”  This dataset consists of about half a million e-mail communications from former Enron senior executives and energy traders.

 

Enron E-mail Dataset Research

 

Because of its size and public status, the Enron Corpus is a rare and valuable tool for experimenting on text classification methods. After FERC posted it to the web, this dataset has been the subject of research by computer science departments of several universities, including the Massachusetts Institute of Technology and Stanford University. The summer of 2009, the team at TREC Legal Track, an organization co-sponsored by the U.S. Department of Defense, started conducting research on the Enron Corpus with the purpose of improving large-scale search techniques.  

 

Our Research – Bayesian Text Classifier

 

The spring of 2009, computer science students at Texas State University David Villarreal, Thomas McMillen, Andrew Minnick, and I, under the supervision of computer forensic expert Wilbon Davis  utilized  the Enron Corpus to train a Bayes-based algorithm to classify the Enron e-mails into relevant and irrelevant to a given legal issue. This type of algorithm is commonly used by e-mail spam filters.

 

The Results

 

The team hoped that this mathematical approach would achieve better accuracy levels than the ~ 20% found using Boolean keyword searching, a method employed by many lawyers. Surprisingly, the Bayesian filter found e-mails to be known relevant at averages ranging between 43% and 66%. And as expected, the irrelevant accuracy results were even higher, averages ranging between 44% and 77%. Texas State University published the Technical Report last week and it can be downloaded for free here.           

 

 

 

The publications contained in this site do not constitute legal advice. Legal advice can only be given with knowledge of the client's specific facts. By putting these publications on our website we do not intend to create a lawyer-client relationship with the user. Materials may not reflect the most current legal developments, verdicts or settlements. This information should in no way be taken as an indication of future results.

Search Tips:

You may use the wildcard symbol (*) as a root expander.  A search for "anti*" will find not only "anti", but also "anti-trust", "antique", etc.

Entering two terms together in a search field will behave as though an "OR" is being used.  For example, entering "Antique Motorcars" as a Client Name search will find results with either word in the Client Name.

Operators

AND and OR may be used in a search.  Note: they must be capitalized, e.g., "Project AND Finance." 

The + and - sign operators may be used.  The + sign indicates that the term immediately following is required, while the - sign indicates to omit results that contain that term. E.g., "+real -estate" says results must have "real" but not "estate".

To perform an exact phrase search, surround your search phrase with quotation marks.  For example, "Project Finance".

Searches are not case sensitive.

back to top