Machine Learning Based Antibody Design

Technology #19148

Questions about this technology? Ask a Technology Manager

Download Printable PDF

Image Gallery
The integration of novel machine learning and data generation enables a new framework for antibody design. I (Left panel) Antibodies contain three hypervariable complementarity-determining regions (CDRs) that are major determinants of their taget affinity anc specificity1. (Middle panel) We emp oy macline learning methods to iteratively improve antibody cesigns by a cycle of testing antibody affinity against targets and controls, labeling the sequencing data from these distinct populations. using these sequencing data to train our models, and creating novel antibodies to test by model generalization and high-throughput oligonucleotide synthesis. (Right panel) Deep learning has been successfully adapted to biological tasks and can infer functional properties directly from sequence.Panning results are consistent across replicates and can separate antibody sequences by affinity (Left panel) COR sequences have almost identical enrichment from Pre-Pan to Pan-i across two technical replicates. We plot the counts of sequences obtained by concatenating the three co sequences as representative proxies for each underlying compete antibody sequence. (Middle panel) Antibody sequences that were not enriched in Pan-i compared to Pre-pan were labelec non-binders. (Right panel) Antibody secuences that were enriched in Par-i were assigned three labels: weak-binders KB), mid-binders (c), and strong-binders (0) depending upon their enrichment ir Pan-i ACNNoutperformsothermethodsinclassifyingweakvs.strongbindersandperformanceimproveswith moretrainingdata.(LeftPanel)ACNN (seq64x25_4) outperformsothermethodsinidentifyinghighbinders,and pedormarceisrardomwhentraininglabelsarerandomlypermutedshowingthattheCNNisnotsimp’ymemorizingtheinpuL (‘Rightpanel).Trainingonrandomdownsamp(ingolthetrainingdatashowamonotonicincreaseinclassificationperformance withincreasingamountsoftrainingcataA CNN accurately predicts the binding affinity to influenza hernagglutinin. Each point represents a sequence he’d out from Iraining. The x-axis denotes the observed binding affinity and y-axis shows the prediction from a CNN trained to predict affinity to influenza hemagglulinin from amino acid sequence.ACNN can identify sequences with higher scores than it has observed in training. (Left Panel) Our optimal CNN (seq 64x254) when trained on labeled B and C antibody sequences was able to distinguish D sequences from held-out C sequences. The median score of the test set for C and D sequences demonstrate that that the median value of novel D sequences has a higher median than C sequences. (Mann—Whitney U test p-vaiue = 1 .4x1O2) (Right panel). ROC classification performance for training on labeted B and C and testing on held-out C vs. D using CNN and KNN machine learning methods anc a CNN control with permuted training iabetsA CNN can suggest novel high-scoring sequences. We .sed the optimal CNN (seq_64x2_5_4) trained on labeled B and C antibody sequences to suggest alternative residues that would lead to higher-scoring sequences starting from a high scoring sequence (below x-axis). The suggestions are summarized above the axis with residue letters proportional in size to their suggested probability of incorporation.
Professor David Gifford
Department of Electrical Engineering and Computer Science, MIT
External Link (
Haoyang Zeng
Department of Electrical Engineering and Computer Science, MIT
Ge Liu
Department of Electrical Engineering and Computer Science, MIT
Managed By
Jon Gilbert
MIT Technology Licensing Officer
Patent Protection

Machine Learning Based Antibody Design

Provisional Patent Application Filed


A high-throughput methodology for intelligently designing antibodies or proteins that can be used in a variety of therapeutic settings. These settings range from the development of Chimeric Antigen Receptors for T cells (CAR-T cells) to the engineering of highly specific antibodies or proteins for diagnostic and/or therapeutic purposes. 

Problem Addressed

Currently, existing antibody design methods use randomization schemes to create new peptide sequences, and to test their affinity towards targets. However, these methods are time-consuming and often expensive. Using deep learning models that can integrate experimental data to create antibody sequences would greatly streamline the development of effective antibodies. 


To train machines to identify antibody sequences with high affinity for the target molecule, an iterative process is used to improve antibody designs. First, millions of computationally designed antibody sequences are developed using large-scale commercial oligonucleotide synthesis. They are then tested with phage or yeast display assays. The resulting data is fed into various machine learning models, using high-performance graphic processing units, to generate new antibody sequences for testing. Antibody properties that are selected in the machine learning models include improved affinity to the target protein, as well as the absence of cross-reactivity to other proteins. This iterative framework is able to identify highly effective antibodies with a minimal number of experiments. This approach not only streamlines the antibody designing process, but also provides scientists with the ability to predict and improve the affinity antibody with a known sequence to any target of interest. 


  • High throughput testing millions of new antibody sequences
  • Expansion of the antibody sequence space by a factor of ten to one hundred compared to current methods
  • Ability to predict and engineer sequences for CAR-T cells and other therapeutic applications
  • Faster, more effective, lower cost method to engineer antibodies suitable for in vivo therapeutic studies