CLIENT'S REQUEST: Proof of concept (PoC) in Machine Learning to parse multiple legal documents.



    The process of manually parsing and interpreting various contracts can be a time-consuming and error-prone task. Contracts often come in diverse formats, making it challenging for organizations to efficiently extract relevant information and ensure consistency across documents.

    Harnessing the power of Machine Learning (ML) to address the challenges posed by manual contract parsing and standardization.


    We used Natural Language Processing (NLP) algorithms and techniques to label a new dataset from contract documents, convert unstructured legal contract text into structured data using trained models, and demonstrate this on a sample set of documents.

    To use the power of Machine Learning (ML) and train a ML Model to accurately parse legal contracts we put a significant effort to create a proper dataset of labeled data, implement the training and testing processes using efficient Python tools for predictive data analysis and integrate it into a PoC, end-to-end application.

    For the purposes of this PoC, we sought to extract the following information:

    • the name or title of the agreement;
    • the effective date;
    • all the parties to the contract;
    • all the addresses to the contract.

    Prior to training the model, the labeled dataset underwent essential preprocessing steps such as:

    1. cleaning up the data to remove any unnecessary elements;
    2. handling cases where data was missing;
    3. breaking down the text into smaller units through tokenization.

    In selecting the right NLP model, we did a series of trials and comparisons.

    We experimented with different models by exposing them to the dataset. By comparing their performance metrics, we gained insights into their individual capabilities and selected the model that had the best balance between resource consumption and achieved result’s accuracy.

    The Model was trained using 90% of the dataset. And its performance was subsequently evaluated on a separate 10% portion of the same dataset.


    The results obtained by our trained NLP model are quite impressive, with an accuracy rate of around 90%.

    The accuracy of the ADDRESS labelling is slightly lower at around 70%, which can be explained by the fact that the number of labelled data is more than twice as small in comparison – around 150 documents.

    It’s important to note the context in which these results were obtained.

    Our dataset wasn’t particularly large to begin with, yet our model still managed to perform remarkably well. In addition, the documents we worked with spanned a wide range of legal areas, which posed a significant challenge in uncovering meaningful patterns and relationships.

    Given these factors, our ability to achieve such high accuracy rates underscores the robustness of the model and highlights the effectiveness of our approach in dealing with the intricacies of diverse, real-world legal documents.



90 %

  • Legal Services