LTH Insights/VALS AI RELEASES BENCHMARKING REPORT ASSESSING CAPABILITIES OF TOP LEGAL GENAI PLATFORMS

Vals AI Releases Benchmarking Report Assessing Capabilities of Top Legal GenAI Platforms

logo

Once the initial hype around generative AI started to subside and AI conversations in the legal industry matured, much attention shifted to evaluating the true efficacy of GenAI for legal work. 

In October 2024, artificial intelligence benchmarking company Vals AI, who regularly benchmarks public enterprise LLMs, announced it was undertaking a new benchmarking study to evaluate the accuracy and efficiency of some of the legal industry’s most-used GenAI platforms, with support from Legaltech Hub.  

“As soon as Jeroen and I met Vals we knew there was a huge opportunity to survey legal AI tools. Firms were doing a ton of work to validate that these actually worked before even being able to start assessing their suitability for use cases,” Legaltech Hub CEO and Co-Founder Nikki Shaver said. “It didn’t make sense for firms to be doing that work—independent benchmarking was necessary.” 

Today, Vals AI released the results of its benchmarking study. 

The Vals Legal AI Report (VLAIR) assesses four legal AI platforms on seven discreet skills. The participating platforms were Harvey Assistant by Harvey, CoCounsel by Thomson Reuters, Oliver by Vecflow, and Vincent AI by vLex. The seven skills assessed were: Data Extraction, Document Q&A, Document Summarization, Redlining, Transcript Analysis, Chronology Generation, and EDGAR Research. 

As part of the terms of participation, vendors were allowed to withdraw their products from evaluation for any single task or from the entire study prior to publication. Harvey withdrew from the EDGAR Research task, while vLex withdrew from the Chronology Generation task. 

Initially the study intended to include Legal Research as a skill. That will now be addressed in a separate report, which will include all the same vendors, as well as Lexis+AI by LexisNexis, who withdrew from all other skills and therefore does not appear in the first report. 

 

Prior to VLAIR, the most well-known benchmarking study of GenAI legal tools was "Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools" out of Stanford HAI in May 2024. The study found that popular AI-powered legal research platforms from LexisNexis and Thomson Reuters had significant hallucination rates. There was controversy, however, about the study’s methodology, leading to follow-up studies and a fair amount of industry speculation.  

The Vals study was already in its planning stages when the Stanford HAI study came out, Shaver explained. “When that was released, we felt it had caused a problem of trust in the market. The Vals study became even more important, and vendors were keen to be involved because they recognized the benefit that greater transparency would bring them as well as buyers,” she said. 

The Methodology 

The study began by identifying key legal tasks where GenAI was thought to have the most potential impact, based on input from participating law firms (the Consortium Firms), which included Reed Smith, Fisher Phillips, McDermott Will & Emery, Ogletree Deakins, and others. The Consortium Firms created over 500 questions asked as part of the assessed tasks, as well as the ideal answers to those questions. 

“It made sense to involve law firms,” Shaver noted. “We know that this kind of study only makes sense if we are using real data and testing on tasks that lawyers would actually undertake in their practice.” 

The study aimed to prioritize real legal work over purely academic exercises, evaluating AI tools in real-world legal contexts to provide meaningful insights into their capabilities and limitations for law firms. 

To assess the quality of legal generative AI tools, the study established a Lawyer Baseline by evaluating work produced by independent lawyers, unaided by AI. Cognia Law sourced lawyers skilled in the relevant tasks, who answered the questions formulated by the Consortium Firms under real-world conditions—these lawyers were unaware they were participating in a benchmarking study and had two weeks to complete assignments, also reporting their time for latency comparisons. Some tasks involved reviewing documents ranging from 100 to 900 pages. 

Vals AI then used an automated evaluation system to score responses, ensuring consistency and eliminating human bias. Manual scoring would have taken over 400 hours. The AI evaluation model assessed correctness by comparing responses against predefined elements, determining pass/fail verdicts with explanations. Accuracy scores were calculated based on passed checks, with specialized methods for Data Extraction and EDGAR Research. 

Overview of the Results 

Overall, the study’s results support the conclusion that legal GenAI tools have value for lawyers and law firms, though some room for improvement remains.  

A few of the key summary findings include:  

 

  • Harvey received the top scores of the participating AI tools on five tasks and placed second on one task. In four tasks, they outperformed the Lawyer Baseline. 

  • CoCounsel is the only other tested GenAI tool to receive a top score, though it consistently ranked among the top-performing tools for its other evaluated tasks. 

  • The Lawyer Baseline outperformed the AI tools on two tasks (Redlining and EDGAR Research) and matched the best-performing tool on one task (Chronology Generation). In the four remaining tasks, at least one AI tool surpassed the Lawyer Baseline. 

 

 

EDGAR Research was one of the most challenging tasks and only one tool, Oliver by Vecflow, ultimately opted in for evaluation. It did not beat the Lawyer Baseline. This suggests that challenging research tasks remain an area in which legal GenAI tools still fall short on meeting law firm expectations—a notion that will be tested more fully in the second iteration of VLAIR that will focus solely on legal research. 

The study results also suggest that, for the time being, lawyers should remain actively involved in Redlining and Chronology Generation given the results of the GenAI tools in these areas. 

In contrast, GenAI tools showed the most potential in Document Summarization and Transcript Analysis as compared to the Lawyer Baseline, with all tools outperforming the lawyers on these tasks. Some tools also passed the Lawyer Baseline for Document Extraction and Document Q&A, again indicating potential value for law firms in these areas. 

While some of the tested GenAI tools responded much quicker than others, all of them not surprisingly performed exponentially faster than the lawyer control group—six times faster at their slowest and 80 times faster at their fastest. Therefore, even where tool performance is not at its highest, these GenAI tools can be a significant time-saver for lawyers if used as a starting point for more efficient work. 

You can download the full report here.

Going forward, Vals plans to undertake further iterations of VLAIR. In addition to the upcoming study focused solely on legal research, plans are underway to include new vendors, skills, and jurisdictional coverage. Lawyers, firms, and vendors interested in future participation should connect with Vals at contact@vals.ai.

Search Legaltech Jobs
Legaltech Jobs provides targeted job listings for alternative careers in law, including roles in legal technology, legal data, legal operations, legal design, and legal innovation. Click and browse to find your next opportunity!
Search Now

Loading...