Please Wait a Moment
29 April 2022 ·

Contract migration and data extraction – could this become your own success story?



Companies seeking to get a handle around their contracts, track the salient terms and data points, create templates for contract types and clause libraries, workflows -- and more – what can they do?  Ignore the task ahead and have worse problems later?  If you need someone who can guide you in the right direction, you might consider insights that follow.  Samir Bhatia faces questions like these.  And they’re typical, he says.

What is legacy contract migration and why do I need to do it?

If you are looking to install a Contract Lifecycle Management (CLM) system1 what good is it without any data or with documents serving as a document repository?  That setup is not much better than a shared folder and as such, it’s worthless.

The best way to get the maximum benefit is to migrate the legacy documents a.k.a. already executed contracts into the CLM, after being text searchable Optical Character Recognition (OCR)2. You need to associate these documents to an account within the CLM. To do this, you need to first know the following attributes:

  • contract type and the contracting party.
  • its effective date;
  • if it is automatically renewed;
  • what is the initial term; and
  • what is the renewal term, etc.

And to answer each one, you need to:

  • extract this meta-data (attributes) from within the contracts,
  • associate it with the document and the account name,
  • create the contractual hierarchy and ingest it all into the CLM.

The CLM can then give you reports on this data customized for you, trigger renewals, even track obligations and rights -- assuming that data has been accurately extracted and ingested (absorbed, documented) into the CLM.

What attributes do I extract and ingest into a CLM during the legacy contract migration process?

There is no real correct answer to this and it varies by company. Many like to know everything within ONE contract, but for legacy contract migration, this mindset needs to change. You need to ask instead, which reports are you asked to produce regularly or occasionally?  Most CLMs require some standard attributes such as mandatory fields requiring responses and some basic attributes that you should track (e.g., a set of attributes that will reveal if a specific contract is current at any given date). 

For example, you might need to determine the following contract attributes:

  1. Contract type – NDA, MSA, SOW, Order Forms, etc.
  2. Party name – Your entity name as you may have different entities, including some companies you may have purchased or merged into
  3. Counterparty name(s) – Who is the contract with?
  4. Effective date – What date was the contract effective from?
  5. Initial term – the term (years/months) since the Effective date
  6. Initial term end date – Hence when does this contract term expire.
  7. Renewal term period – in months or days.
  8. Term type – Auto-Renew? Perpetual? Evergreen? 
  9. Termination for convenience? 
  10. Master effective date – From Amendments, to match up to Masters to create contract hierarchy

The above apply to all contract types. But clients may want to track many other items (Force Majeure? Limitation of Liability? etc.). A detailed in-house discussion among groups needs to be conducted for this. Brightleaf does have a scoping spreadsheet that would facilitate this discussion.

Can I extract and track my business-specific information?

Yes, and in fact, 30-40 percent of the information that Brightleaf extracts are client business-specific information. For example, for a Class 1 Railroad3 company, we extracted, among other attributes, the mileage marker that the railroad track started from and ended upon; as well as length of track in miles and who was obligated to maintain that mile of track. 

Word of advice – don’t let AI software limit you from the information that you require and would like to track or report upon. Ask yourself, what is most important to you?  If your list of attributes grows too fast and exceeds budgeting limitations, some of these attributes could be done as a “phase two” of the project if it pertains to a contact with a client that is more relevant to you.

For example, when we extracted the length of track and the mileage markers for our Class 1 Railroad company,4 neither our AI software nor our process to understand these attributes limited the effort. We processed all using the phase 1 and phase 2 approach.

Can Artificial Intelligence (AI) software handle it all?

How can legal teams gain a better understanding of the information included in all your contracts? This is the most asked, if not, assumed (as yes) question! And a very pertinent one, especially for legal professionals who may hesitate to drop their darling child, litigation, into the hands of AI. 

And yet legal is not wrong to be cautions.  True, AI software is essential for meta-data extraction from legacy contracts. But AI is not at all enough for a complete solution. So, what else do you need on top of AI?  The answer – two more dimensions:  

  1. legacy contract review by qualified individuals (lawyers); and
  2. astringent process (people/process). With the control of the AI software, trained lawyers, and a stringent process, perfect data can be extracted. After all, putting bad data into a large CLM investment could be damaging – at very least, time wasting.

So, to further answer why AI software is necessary, but not sufficient, consider these possible nightmares:

  • Smudgy scans of documents. If the document has handwritten or smudgy information which will not accurately OCR even with the best of OCR/ICR engines, what can the AI software do? Nothing! You must have someone look at the erring documents to correct this information.  Time consuming!
  • What if the contract is missing information? Let’s say the effective date is missing; or it is partial or incorrect – someone put in 2021, versus 2022, an obvious mistake, especially at the beginning of the year. Someone omitted the month? Mistakes like this can lead to your needing legal review. Expensive!
  • What if there is conflicting information? One part of the document says that the contract is effective as of 10/19/19 for one year. And the termination clause says that this contract will terminate on 5/25/21. Which is true?  You need a thorough legal review. More expensive!

AI software can correctly do most scenarios once they have occurred. But exceptions and interpretations can only be handled by qualified legal personnel.

Another dimension of this is process, but what does this mean?  A stringent process requires trained, legal personnel to review EVERY attribute after the AI software finishes its job or function. There are other elements of a stringent process.

Let’s say you have 300,000 (300k) documents! What do I migrate into a new CLM system?

The first reaction of most clients we work with is to assume that EVERY document and EVERY data point needs to be extracted and everything migrated into a new CLM. This is great until you factor in the cost and effort of the project. Migrating 300k contracts into CLM, with an average of ten basic attributes, turns into three million data points that first need to be extracted!

As discussed above, the only way to get accurate results is to configure the AI software for the extraction and couple it with a legal team to check EVERY attribute against the original contract after extraction. A HUGE task with 3 million data points to be checked -- not to mention these:

  • de-duplicating the documents (de-duplication);5
  • segregating documents into different contract types (MSAs, Order forms, SoWs, etc.);
  • removing unwanted documents, drafts, partially signed documents; and
  • creating contractual hierarchy, etc.

When you make the effort to save cost and time, remember don’t expect to throw AI software at it and see perfect results!

So, the questions remain: how can you break down this volume of 300,000 documents with three million attributes down? Here are some things to consider.

  1. Are all the documents contracts that need to be in your CLM? Typically, PDFs only are signed documents. Can JPGs, MSG files, Doc and XLS files, etc. be ignored?
  2. Are they all signed documents?
  3. Are there duplicates?
  4. Are there partially signed documents?
    Is there a set of clients or contracts that I need to see in the system immediately?  Are they all current clients? Or are some from the 90s that don’t even exist anymore?

The answer: Take a subset of the documents based on the above. Then pare that down, “Maybe I can start with my highest spend clients” or “Maybe I can start from a certain year of contract signature and work to current.”

Phase-out the project: At the document level, as well as the number of attributes. You can start by extracting just the basic attributes across all the clients. If that is cost-prohibitive, then take a subset of attributes only for the high-spend clients. Or start with some important document types (maybe MSAs and SoWs are more important than NDAs). Then expand on the attributes and documents as the budget frees up.

For the “irrelevant” documents, you could simply push them into the agreement record field as supporting documents without any metadata extracted. So that at least all of them are being stored in the system.

How do I prepare my legacy documents for ease of migration into a CLM?

The larger the company, the more the contracts – that is a given. The larger the company, the more the divisions. The larger the company, the more these divisions work in silos. All companies run on contracts – buy-side and sell-side.

How can you organize all the contracts?

Create a shared repository – either a shared network folder or in Sharepoint or a document sharing platform. Assign a person responsible for a department/division, or a geographic location that is responsible for “herding” the contracts from different individuals.

Design a consistent folder structure – e.g.

  • \All Contracts\Country\Division\MSA\Masters (or Amendments) or
  • \All Contracts\Country\Division\Client Name\MSA (or Supplier etc.).

Assign the responsible person a deadline to collect all the documents. During that time, some curation can be done. Deduplication? Remove unsigned or partially signed documents? The thought process is to decide which documents need to be migrated into the CLM.

With the documents now collected centrally, work with all groups to decide the data points (attributes) that need to be extracted from these to migrate into the CLM agreement record.

Success Story

Class 1 Railroad Company Gets an All-New Technique for Contract Compliance

A Class 1 railroad company having rail tracks in North America that provides freight and passenger train services covering most of the United States of America.

Problem client faced

The company always intended to digitize their legacy contracts to better manage their contractual obligations hidden in thousands of contracts, some of which were hard to trace because of the large number of documents.  Management knew that as time passed, the problem would increase.

Digging through these files every time for information was extremely tedious, because about 500,000 legacy contracts existed and some of them dated back to the 19th century.  The company had no idea how much revenue existed within those legacy contracts and they had no clue which contract was due or when its maintenance obligation had expired.

This turmoil caused them to start looking for a reliable vendor to help them with this urgent business need. They selected Brightleaf for its reputation in the marketplace.

Brightleaf solution

Brightleaf management spent time with the company’s business process owners to understand the business requirements. They used proprietary software, their staff, and their own process to extract vital information from their contracts ranging from Industrial Track Maintenance, Rail Crossing Agreements, to Haulage and Real Estate.

After their client approved Brightleaf’s understanding of their business requirements, Brightleaf customized its engine using Natural Language Processing and other semantic techniques to automate the extraction of metadata information from each type of contract. Using its team of experts and following its six-sigma methods, Brightleaf delivered a high-quality error-free data.

The Project not only helped the company manage its obligations better but also assisted in tracking its lost revenues. The client realized that some of these legacy contracts were not billed for years so they gained huge revenue after Brightleaf analyzed the documents and provided the information to the company.


  1. What is a Contract Lifecycle Management (CLM) system?  SAP Ariba article
  2. What is OCR  --  a NECC article
  3. Definition of Class 1 railroad companies – Wikipedia
  4. Class 1 Railroad company
  5. De-duplication:  What Is It and Why Should I Use It?  JDSUPRA article (Farrell Fritz, P.C.)


Samir Bhatia, Founder of Brightleaf Solutions, is an experienced entrepreneur with a strong an extensive track record in client management, product management, sales and marketing, product development and operations.  His focus is on introduction and growth of technology businesses (products and services).  He has strong interest in working with companies intending to or having offshore (India) services/technology arm. 


Brightleaf Solutions, Inc. delivers Artificial Intelligence (AI) powered solutions for data extraction from contracts using their own proprietary semantic intelligence and natural language processing technology.  They use internal software applications to extract and migrate legacy data into a Contract Lifecycle Management (CLM) system for tracking and reporting. Their broader scope of work involves customizing data for all types of contracts including all meta-data, terms and conditions, legal provisions, and obligations.  Extracted data-points are checked by the legal team against the original documents using a Six-Sigma quality process, which delivers highly accurate results.  Brightleaf was voted one of the top five data mining companies, and the only one in the legal space.

Samir Bhatia
Related topics

More resources