AI for protein structure: past, present and future

Last updated: March 2023

  1. Background: why do we care about protein structure?
  2. A little bit of history
  3. Exploiting sequence data
  4. Deep learning and structure prediction
  5. An introduction to AlphaFold 2
  6. Advances over AlphaFold
  7. An overview of doors opened
  8. Open challenges
  9. Conclusions
  10. Bibliography

I started these lecture notes to prepare for lectures in AI for protein structure at the Systems Aproaches to Biological Science Doctoral Training Centre, and at the 58th International School of Crystallography. One thing led to the other and I found myself having written twenty pages of lecture notes and nearly 100 references. The career-savvy move would have been to publish it as a review, but as I suspect it would have been outdated by the time it was on print, I instead decided to publish it as a blog post that I keep continuously updating. I hope this is useful. If you spot any errors, please contact me.

1. Background: why do we care about protein structure?

Proteins are intricate molecular machines whose function underlies all of life. Almost any impressive behaviour of biological systems – how our eyes transform light into nerve impulses, how our immune systems recognise and fend off potential invaders, or how our muscles convert chemical energy into physical force – ultimately relies on the ability of proteins to catalyse chemical reactions, transport nutrients, transduct signals or even serve as a structural building blocks in complex scaffolds. Proteins are made up of amino acids, which are linked together in long chains to form a polypeptide. There are twenty natural or proteinogenic amino acids, each with a characteristic chemical behaviour – meaning that proteins can leverage all kind of chemical groups to exert their function. The enormous biochemical versatility of proteins has meant they are involved in virtually every aspect of cellular life and are the focus of intense research across many scientific disciplines.

The sequence of amino acids in a protein uniquely determines its three-dimensional structure, which in turn dictates its function in the cell. For example, structural proteins like collagen and keratin assemble in extended, braided scaffolds that resemble ropes and confer enormous resilience to the tissues they cover. Transporters like hemoglobin implement “switches” that alternate between different states: one with avidity for the transported molecule (for example, the “relaxed” state of hemoglobin binds oxygen strongly in the lungs) and another which binds it less strongly (for example, the “tense” state hemoglobin has little affinity for oxygen and releases it in the tissues). Enzymes, on the other hand, have active sites which provide a chemical environment easing the hardest steps in a chemical reaction, thus optimising the process. Understanding the relationship between protein structure and function is a fundamental problem in molecular biology and is essential for developing treatments for a wide range of diseases.

Determining the structure of a protein is, unfortunately, very challenging. As of March 2023, there were over 230 million protein sequences deposited in the TrEMBL database [1], but given the difficulties in experimentally characterising protein structure, little over 200,000 structures were available in the Protein DataBank [2]. Bridging the gap between sequence and structure availability enables exploring the functional landscape of these proteins, and is a fundamental step to understand the connection between genotype (which determines the sequence of the protein) and phenotype (the observable, physical and macroscopic behaviours determined by the genetic information). This pursuit has motivated significant efforts to develop computational methods able to produce accurate predictions of protein structure directly from sequence. The progress of this pursuit is the focus of this lecture notes.

In more practical terms, here are a few problems where knowing a protein’s structure is paramount:

  • Drug discovery. Most drugs target proteins, either by blocking something they do, or interacting with them and making them change their ways. One of the most successful approaches to find new drugs is structure-based drug discovery [3], where a molecule is designed to fit within the desired binding pocket in a protein. A classical hurdle in structure-based drug discovery has been access to a good enough structure of the protein target [4].

  • Molecular simulations. Our tools to investigate the behaviour of biomolecules at the molecular level – how they interact, how they change, how they form part of complex systems, etc. – are very limited. This gap is typically filled by simulations, often using molecular dynamics [5]. However, to sample the phenomena we are interested in, these simulations require an initial structure that closely resembles the process we want to study.

  • Evolutionary studies. Genetic information is subject to mutations, meaning that sequences rapidly diverge – but as protein structure is indispensable for function, it tends to be much more strongly conserved. Understanding how protein structure varies across the tree of life allows unearthing complex evolutionary pressures and relations.

  • Protein design. One of the most interesting pursuits in modern synthetic biology is protein design, where an artificial protein is devised to out some desired function. For example, while there are many types of natural enzymes, there are many reactions for which no natural enzymes are known. Designing artificial enzymes for these reactions would significantly enhance the production of many valuable chemicals. A key challenge in coming up with enzymes, and other proteins is to validate whether the proposed protein sequences fold to the desired structure.

  • Property predictions. Many interesting properties of proteins are highly dependent on the structure of the protein. Protein stability, solubility, function, and many others, depend on the structure of the protein, often because they result from the thermodynamics of the protein, which are strongly governed by the structure of the native state.

These arguments support the importance of predicting the structure of a protein from its amino acid sequence. In the next sections, we will explore how this can be done.

2. A little bit of history

Protein structure prediction methods have traditionally been divided into two main areas. The first family of methods, known as template-based modelling (TBM), rely on finding a protein structure that has a structure very similar to the target sequence; we say that this new protein has a high degree of homology to our original molecule. The rationale is that, since protein structure is a lot more conserved than sequence, many diverse protein sequences will fold to similar structures, increasing the chance that someone, somewhere, has identified a similar structure before. Sequence similarity is often used as a proxy for homology, and it has generally been accepted that proteins with over 25-30% sequence identity have enough homology to be similar in structure [6]. After a good reference structure, known as a template, has been identified, it is used as an initial guess and refined to account for the differences with the target. The first successful attempt to predict the three-dimensional structure of a protein was an example of template-based modelling: the experimental X-ray structure of hen-egg lysozyme was used to propose a conformation for bovine alpha-lactalbumin, a protein with 40% sequence identity to the former [7]. Template-based modelling has been considered generally successful, even for complex applications like structure-based drug discovery [4], whenever good templates were available.

Template-free modelling or free modelling (FM) attempts to predict a protein’s structure without a suitable template, often because one is not available. These methods have invariably been inspired by Anfinsen’s thermodynamic hypothesis that the native structure of a protein corresponds to the lowest free energy conformation of the chain. Free modelling protocols typically use a custom score, for example the Rosetta energy function [8], and employ an optimisation algorithm to find conformations that minimise said score. These methods have tended to be highly diverse, including approaches that tend towards ab initio protocols inspired in molecular dynamics [9], and hybrid methods using knowledge-based potentials. Although free modelling has traditionally been recognised as different than template-based modelling, in practice concepts and methods have permeated between the two. For example, fragment generation processes for free modelling have tended to use methods very similar to threading and template search (and similarly template-based modelling has inherited energy functions and minimization procedures from free modelling). Nevertheless, free modelling has remained relatively unsuccessful and has only recently (in the early 2020s) become accurate enough to be generally useful.

Progress in structure prediction has been driven and measured by the Critical Assessment of protein Structure Prediction (CASP), a biennial, community-led experiment. In every edition of CASP, independent research groups in protein structure prediction are asked to predict the structures of several proteins that have been determined experimentally, but not made public at the time of the exercise. Comparison of predictions and experimental results, evaluated by an independent team of assessors, produces an unbiased estimate of the state of the field. CASP is not only a bona fide appraisal of the state-of-the-art, but also an opportunity for the latest advances to permeate throughout the community. The CASP assessment exercise has witnessed multiple step changes in accuracy as novel ideas were incorporated into the participant’s pipelines. For example, significant progress was made between CASP4 and CASP5 as a result of the widespread use of metaserver refinement [10], and the success of deep learning in CASP13 led to generalised adoption of this technique in CASP14 [11].

One important idea in template-free structure prediction is fragment replacement, an algorithm proposed by Bowie and Eisenberg [12] in 1994, the same year that CASP was initiated. Rather than trying to explore the full energetic landscape, the authors proposed a stochastic search where small peptide fragments sampled from a curated structural database were randomly replaced at set positions of the protein, to rapidly change the conformation. After each substitution, conformations could be scored with a custom potential, and accepted or rejected with a Monte Carlo criterion. The authors’ intuition was that, whereas structure databases are unlikely – even today – to have explored every possible protein fold, the local conformations of small groups of amino acids have far fewer available conformations, and are probably more thoroughly sampled. This approach was highly competitive with other methods (missing reference). Fragment replacement was a key step in the popular protein struture prediction method Rosetta [13], which was the top performer in most CASPs until recent editions, and remains widely used at the time of writing.

**Historical evaluation of progress throughout the fourteen editions
of CASP.** The vertical axis displays the global distance test
<a class="citation" href="#zemla2003lga">[14]</a>, a measure of structural similarity that is less
sensitive to large distortions in small portions of the protein than the
more common RMSD; and the horizontal axis displays target difficulty, in
terms of the number of available templates. This figure has been taken
from John Moult's keynote at the CASP14
conference. Figure 1. Historical evaluation of progress throughout the fourteen editions of CASP. The vertical axis displays the global distance test [14], a measure of structural similarity that is less sensitive to large distortions in small portions of the protein than the more common RMSD; and the horizontal axis displays target difficulty, in terms of the number of available templates. This figure has been taken from John Moult’s keynote at the CASP14 conference.

Another central idea was predicting pairs of residues spatially close in the structure of the protein and incorporating them as an optimisation constraint, an approach first proposed by Gobel and co-workers [15], also in 1994. The authors’ approach was data-driven, and aimed to leverage the fact that, even in 1994, protein sequences were easier to determine than three-dimensional structures. Their method postulated that, since evolution holds a prime interest in maintaining the structures of proteins, if two amino acids are in contact in the protein chain, they will be likely to experience correlated mutations. The authors suggested constructing a multiple sequence alignment (MSA) from the target protein and using the correlation between columns as a proxy for structural proximity. Although the algorithm proposed by Gobel et al. was beset by biases that precluded good performance, the idea of using multiple sequence alignments to predict distance constraints remained.

Powered by these and other ideas, the first five editions of CASP saw steady improvement in the free modelling category (see Figure 1), progressing from near-random models of all new folds in CASP1, to reasonable predictions of proteins of less than 100 residues in CASP6 [10]. In contrast, progress comparatively stalled between CASP6 and CASP10 [16], and would only resume after significant advances in contact prediction led to notable improvement in free modelling in CASP11 [17]. Novel statistical algorithms, such as direct coupling analysis (missing reference) and inverse covariance analysis [18] (discussed shortly below) were able to overcome the phylogenetic and indirect coupling biases present in prior approaches [19]. Powered by rapidly growing sequence databases, more accurate predicted contacts were incorporated to most pipelines, either contributing to the energy function [20] or used directly in modelling [21].

The methods described thus far were all made relatively obsolete with the advent of AlphaFold 2 in the early 2020s (missing reference). Nevertheless, like any scientific advance, many of the central ideas in AlphaFold 2 rely on previous advances. In the following section, we discuss coevolutionary analysis, one of the most powerful ideas in structure prediction.

3. Exploiting sequence data

In the previous section we introduced the idea of predicting amino acids that were in close proximity from analysis of multiple sequence alignments. This method forms part of a family of sequence analysis algorithms known as coevolutionary analysis, which underlies the success of AlphaFold 2. We explore this concept in more detail here.

The idea behind coevolutionary analysis is that interactions between amino acids in the sequence leave a footprint in the evolutionary history of a protein. For example, if the structure of a protein relies on a salt bridge between an arginine and an aspartate, a mutation on the former will likely be accompanied by a mutation on the latter: if the arginine is mutated to a hydrophobic amino acid, then the aspartate will have to become hydrophobic as well. Should this not happen, the protein would be unable to fold, “selecting out” the organism. Hence, the only sequences present in nature are those with correlated mutations. Coevolutionary analysis employs powerful statistical methods to extract the causative correlations between amino acids in a protein.

The initial attempts at extracting amino acid interactions from the multiple sequence alignment were beset by the presence of three intrinsic biases in the data. The first one is the entropic bias, which arises from limits in sampling. The second one is phylogenetic bias, which refers to spurious correlations due to the evolutionary tree of the organisms, rather than the interactions between amino acids. The third, and most important bias is chaining bias, which refers to the difference between true evolutionary covariation, or causative correlations, and indirect correlations. For example, if residues A and B are in contact in the structure, but also residues B and C, then there is a chaining effect between residues A and C even though these two are not directly interacting. Correcting these biases has been an active area of research in structural bioinformatics.

In the early 2010s, two families of algorithms were discovered that could correct for these biases correctly. The first family, directed correlation analysis (DCA) [19] used a global graphical model of the protein sequence where every position was correlated with every other position, and the parameters were estimated from the multiple sequence alignment under strong regularization. The second family, inverse covariance analysis (ICA) [18], constructs a massive correlation matrix of every possible amino acid at every position of the sequence, and estimates correlations between pairs of variables while controlling every other variable, using the inverse of the covariance matrix. These approaches had an enormous impact on the quality of protein structure prediction.

Exploiting evolutionary contacts to predict protein structure has been one of the most influential ideas in structural bioinformatics. Progress in the field was powered by the release of multiple packages implementing complementary statistical approaches, including FreeContact [22], PSICOV [18] and CCMPred [23], as well as metapredictors combining several approaches, such as Meta-PSICOV [24]. Coevolution analysis has also been used to solve problems like predicting flexible conformations [25], allostery [26] and mutation effects [27]. In proteins for which significant sequence information is not available, it is also possible to generate artificial sequence data from deep mutational scans (missing reference).

The concept of exploiting large amount of sequence information remains strong in modern approaches to protein structure prediction. As we will see, both AlphaFold and AlphaFold 2 rely strongly on leveraging large multiple sequence alignments to make predictions of protein structure.

4. Deep learning and structure prediction

The introduction of deep learning in protein structure prediction sparked a dramatic increase in accuracy between CASP12 and CASP14. In this section, we introduce the concept of deep learning and discuss how it was first applied to contact prediction using coevolutionary analysis. The next section will then discuss the architecture of AlphaFold 2.

Deep learning is an approach to statistical learning where multiple layers of a model are stacked sequentially, allegedly achieving an understanding of complicated concepts by building them from simpler ones in a hierarchical fashion [28]. In computer vision, for example, it has been proposed that deep learning systems can, in their first layers, learn simple concepts like edges and corners, which by composition can describe complex shapes like human faces (e.g. Figure 2{reference-type=”ref” reference=”fig:deeplearning”}). Deep learning approaches have for the most part relied on layers of neural network, although the term does not preclude other types of statistical learning techniques from being used as building blocks (e.g. deep Gaussian processes [29]).

**Illustration of the hierarchical nature of deep learning.** Directly
mapping an image (a collection of three $$N\times M$$ arrays with
real-valued numbers for the red, blue and green intensity at every
pixel) to its class is a complicated function. Deep learning can
surmount this problem by learning a series of nested simple mappings
which can be composed to perform powerful inferences
<a class="citation" href="#zeiler2014visualizing">[30]</a>. This image has been reproduced from
<a class="citation" href="#waldrop2019news">[31]</a>.{#fig:deeplearning width=”0.9\linewidth”}

Deep neural networks are often heavily overparametrised. For example, AlphaFold 2 has approximately 93 million parameters [32], while it has been trained on approximately 170,000 protein structures with an average of 400-500 residues per protein. In classical statistical learning theory, one would expect the high ratio of parameters to training examples – more than one parameter per residue – to result in overfitting, preventing any useful performance when extrapolating from the training set. In practice, neural networks with very high overparametrisation are still able to generalise, and it has even been proposed that the higher the overparametrisation, the higher the generalisation ability of the neural network (missing reference). The uncanny generalization ability of deep neural networks has been the object of intense theoretical study, with numerous proposed theories such as a tendency towards simple functions [33], a bias introduced by the stochastic gradient descent algorithms used to train them [34], or the flatness of typical loss function minima [35]. However, a definitive explanation to how or why deep neural networks can generalise from data despite overparametrisation is still lacking.

Deep neural networks proved to be much more efficient than fixed statistical approaches at extracting coevolutionary information from multiple sequence alignments. The first significant application of deep learning to protein structure prediction is thought to be RaptorX-Contact (missing reference), a predictive pipeline which used deep residual convolutional neural networks trained on multiple sequence alignments to predict binary interresidue contacts. The twelfth CASP assessment, when this method was presented, saw an impressive improvement in precision for contact prediction: from 27% in CASP11 to 47% in CASP12 [36], which also translated into some improvement for template-free predictions (missing reference). The success of RaptorX at CASP12 showcased deep learning as a potential powerful tool in protein structure prediction.

The next conceptual improvement was the realisation that deep neural networks were powerful enough to remove the concept of binary contacts (note that, traditionally, two amino acids were considered to be in contact if their -carbons, or -carbons in glycine, were less than 8.0 Å apart), which CASP assessments had considered for over a decade, and embrace a much powerful representation: distances. In CASP13, three predictors, RaptorX [37], DMPfold [38] and AlphaFold [39] used predicted distance distributions with great success. The methods were not only able to predict contacts with much higher precision, from 47% in CASP12 to 70% in CASP13 [40], perhaps due to the higher expressiveness of the distance prediction task, but also in terms of template-free structure prediction where “for the first time in CASP, there [was] at least one model that roughly captures global topology for all [targets] [41].

Despite the progress up to CASP13, in hindsight the overarching methodology for template-free protein structure prediction had remained roughly the same. Assumptions about the free energy of the protein were still encoded in a potential, which was now derived from the probability distributions of distances [39], and a structure was determined by minimising this potential. Methods like AlphaFold and DMPfold also allowed faster minimisation, since the distance restraints tended to smoothen the energy surface, permitting gradient-based methods that are much faster than fragment-based algorithms [6]. However, this methodology was lacking a final conceptual breakthrough: making the methodology end-to-end.

5. An introduction to AlphaFold 2

**Architectural diagram of the AlphaFold 2 protein structure
prediction system.** The protein sequence is contrasted against sequence
and structure databases, in order to produce, respectively, a multiple
sequence alignment (MSA) and a list of potential templates. This is
information is passed to the Evoformer module, which extracts
information from the MSA into the geometric representation of the
protein (and viceversa), and ultimately to the structure module, which
builds the final structure. Image modified from Jumper *et
al. *<a class="citation" href="#alphafold2">[32]</a>.{#fig:af2-architecture width=”\linewidth”}

AlphaFold 2 represented a dramatic change in template-free protein structure prediction: instead of following standard pipelines, where multiple predictions were combined in a final minimization steps, the model was trained in an end-to-end fashion. In deep learning parlance, end-to-end means that a fully differentiable algorithm can map directly from a given input (in protein structure prediction, a sequence or multiple sequence alignment) to the desired output (in protein structure prediction, the atomic coordinates of the protein). Since the algorithm is fully differentiable, it is possible to optimise the entire process and, assuming that a powerful enough architecture is in use, it could learn the best way to map inputs to outputs, without the biases of whatever the authors consider most appropriate. The end-to-end methodology was one of the keys to the success of AlphaFold 2.

In 2005, when examining the progress of the structure prediction field after a decade of CASP, John Moult rejoiced that the most successful approaches may produce “for proteins of less than about 100 residues [...] may produce one or a few approximately correct structures (4-6 Å -RMSD)” [10]. Fifteen years later, AlphaFold 2 managed to predict most structures in CASP14 to GDTTS over 80, indicating overall agreement in both the protein backbone and the side chains, and even in the worst cases, a structure that is either globally consistent with the structure [42] or with a dynamically similar structure (missing reference). Although end-to-end deep learning methods had been proposed previously e.g. (missing reference), nothing had come even close to the success of AlphaFold 2.

The architecture of AlphaFold 2, depicted in Figure 3{reference-type=”ref” reference=”fig:af2-architecture”}, is also quite different to previously proposed models. The flow of information can be expressed as follows. The input sequence is used to generate multiple sequence alignments and to identify potential templates, as in most pipelines. This information is in turn used to construct two representations. In this context, the term representation or embedding refers to a tensor in a high-dimensional space that is obtained from passing some information through a neural network . The neural network depends on some parameters , which are adapted iteratively during training. Intuitively, since the model is end-to-end, given enough data it should be able to “learn” a good function to represent the input data. These representations are thought to capture all the information corresponding to the multiple sequence alignment and an initial proposal of the three-dimensional structure. This information is then transferred to the “Evoformer” module, which undertakes the function of standard coevolution methods; and to a “structure module” that takes in coevolution information and outputs a structure. These modules include multiple feedback loops where the information is recycled, leading to a powerful machine that extracts as much information as possible from the multiple sequence analysis and projects it all into a final prediction of the structure.

The central idea behind the Evoformer is that the information flows back and forth throughout the network. Before AlphaFold 2, most deep learning models would take a multiple sequence alignment and output some inference about geometric proximity. Geometric information was therefore a product of the network. In the Evoformer, instead, the pair representation is a both a product and an intermediate layer. At every cycle, the model leverages the current structural hypothesis to improve the assessment of the multiple sequence alignment, which in turns leads to a new structural hypothesis, and so on, and so on. Both representations, sequence and structure, exchange information until the network “reaches” a solid inference. This is enabled by a particular deep neural network architecture known as transformer [43] which can dynamically regulate the flow of information.

The structure module considers the protein as a “residue gas”. Every amino acid is modelled as a solid triangle, representing the three atoms of the backbone. These triangles float around in space, and are displaced and rotated by the network via affine matrices to form the structure. At the beginning of the structure module, all of the residues are placed at the origin of coordinates. At every step of the iterative process, AlphaFold 2 produces a set of affine matrices that displace and rotate the residues in space. One of the key ingredients is the SE(3)-equivariant point attention (IPA). This function enables the network to unify geometrically equivalent conformations of a protein (e.g. two structures that have been displaced or rotated would be considered to be the same one), effectively allowing the network to “learn more” from the same amount of data. The final output of the structure module, after three recycling iterations, is an all-atom structure of a protein.

Some groups have also considered alternatives versions of the AlphaFold 2 architecture that may show better performance for other problems [44]; or that may provide results at a much faster pace [45]. Nevertheless, the performance of AlphaFold 2 has led many to declare the problem of predicting the structure of individual protein chains essentially solved.

6. Advances over AlphaFold

The AlphaFold 2 paper, alongside the code and weights of the model, was published in Nature in July 2021 [32], sparking significant research in structural bioinformatics. In this section, we outline some of the current research advances over AlphaFold 2.

The performance of AlphaFold 2 is reliant on the input multiple sequence alignments, so much work was dedicated to improve these. Multiple sequence alignments are typically produced using hidden Markov models, with tools like HHblits. Some groups used deep learning-based methods to produce MSAs (missing reference), or enriched the databases in some way. Other groups introduced additional constraints, or generated a large number of structures and used some sophisticated quality assessment method to identify the best models. While the special issue of the Proteins journal describing the methods in CASP15 have not yet been published, the results suggest that while there is some improvement, especially in multimer prediction, the standard AlphaFold 2 protocol was better that about half of the submitted predictions [46].

One of the problems with protein structure prediction methods like AlphaFold 2 is that they require deep multiple sequence alignments which can be mined using coevolutionary analysis. Unfortunately, this is a problem for many proteins for which the multiple sequence alignment is not very deep, or particular proteins, such as antibodies, where the evolutionary history is not meaningful. One potential approach to solve this is to substitute the multiple sequence alignment by the embedding of a language model (missing reference), which though slightly worse in quality than AlphaFold 2 is able to produce useful predictions at a much faster rate. Nevertheless, it seems like the models do not perform well for orphan proteins, which have led some authors to suggest these language models are just memorising multiple sequences alignments [46].

Some other methods have been developed that try to reproduce AlphaFold 2, such as OpenFold [47] and UniFold [48]. The OpenFold paper, in particular, offers a fascinating analysis of how AlphaFold learns to predict protein structures, based on one technical fact: the model can be trained to 95% accuracy in only about 5% of the total training time (about 4-5 days on 25 GPUs), allowing the team to experiment with multiple settings. For example, they observed that the model could be trained on proteins containing only -helices, and still generalise to -sheets reasonably well (and vice versa).

7. An overview of doors opened

The ability to accurately predict the structure of proteins has had an enormous impact on structural bioinformatics, and on biology in general. In this section, we look at several fields where the impact of AlphaFold 2 has been truly groundbreaking.

  • Proteome-wide structure prediction. Concomitantly with the publication of AlphaFold 2, DeepMind released a database with predictions of full proteomes [49]. This database has since scaled to nearly 200 million proteins. A similar effort has been carried out by Facebook (now Meta) AI Research, which used their faster ESMfold pipeline to predict 600 million protein structures, including many meta-genomic proteins [50]. These efforts have enabled evolutionary studies (missing reference), but also offered perspective into the limits of protein-based architecture [51].

  • Protein search. Many successes in computational biology stem from algorithms to perform sequence searches, enabling to find evolutionarily and functionally related sequences. Since protein structure is more highly conserved than sequence, it is to be expected that comparing a protein’s structure with other sequences may reveal functional features that would otherwise have been blurred by mutational drift. These efforts have powered multiple approaches to rapidly search large databases of protein structures (missing reference).

  • Experimental model building. Computational researchers tend to visualise experimental data as “perfect” – nothing further from the truth. Whether obtained from X-ray crystallography, nuclear magnetic resonance, or cryoscopic electron microscopy, all structural data is ultimately a model with reasonable agreement with the data. In many cases, the information is insufficient. In X-ray crystallography, for example, one of the problems is that a big part of the data is intrinsically missing. The diffraction pattern is an image of the Fourier transform of the crystal structure, but due to the experimental setup, the phase information is lost – one of the major headaches in crystallography (the phase problem). Using a predicted AlphaFold structure as an initial model to “guess” phase information has proven a very successful strategy to solve the phase problem (missing reference). In cryoscopic electron microscopy, the resolution is often very low and it may be difficult to appreciate details close to atomic level. This can be remediated by combining the rough map from the microscopy experiment with high-resolution predictions from AlphaFold, which was successfully applied to model the structure of the nuclear pore complex (missing reference), a massive cellular macrostructure composed of  500 protein chains.

  • Protein design. One of the challenges in creating artificial proteins for custom functions is that it is hard to verify their structure. Early studies showed that AlphaFold can successfully be used to predict the structure of designed proteins (missing reference), and it has become a de facto test to check the validity of proposed designs (missing reference). Alternatively, a research team from MetaAI has trained a model on 12 million predicted AlphaFold 2 structrues with high confidence, finding that the model is able to find structures that – when predicted with AlphaFold 2 – fold to a given design with high confidence[52].

  • Downstream tasks. Protein structures contain valuable information about the function and physicochemical properties of these molecules. Multiple papers have developed methods that incorporate structural features into machine learning models for predicting function [53], binding sites [54], flexibility [55] or immunogenicity (missing reference), with high levels of success.

8. Open challenges

While AlphaFold 2 has had an enormous impact on research in structural biology and bioinformatics, there are still many unsolved problems that are at the forefront of modern research:

  • Multiple conformations. AlphaFold 2 has been trained to predict a protein crystal structure, but most proteins have multiple possible conformations [56]. These conformations have functional value, for example as proteins change their activity in response to signals, and thus understanding them has enormous biological value. There is evidence that AlphaFold 2 is biased towards particular types of conformations, and that it tends to perform worse when proteins have a wider conformational ensemble [57]. Multiple methods have been proposed to predict multiple conformations, for example by altering the input multiple sequence alignment, be it by clustering [58], subsampling [59], or alanine scanning [60]. One of the main challenges with these approachs is to validate them, as there is limited high-quality data on multiple conformations, and when available, it is often biased towards specific protein families.

  • Post-translational modifications. Extrapolations from proteomics experiments suggest that as much as 70% of all proteins are subject to post-translational modifications [61]. Since AlphaFold 2 only predicts proteins consisting of the 20 proteinogenic amino acids, it can’t consider the importance of processes like phosphorylation which have a dramatic effect on protein structure and conformation [62].

  • Mutations. The structure and behaviour of proteins can be significantly altered by even minute changes in conformation – a single mutation is sometimes enough to make a protein misfold or lose its function e.g. in sickle-cell anemia [63]. Several papers have examined this fact, and though one recent piece of work found some correlation between small structural deformation upon a change in an amino acid [64], the overall consensus seems to be that AlphaFold 2 knows little about the effect of mutations (missing reference). Some studies have however found that they could use the predicted structure to estimate changes in the free energy of folding of a protein upon mutations (missing reference).

  • Protein-ligand interactions. Many proteins, particularly those of clinical significance, interact with ligands in some way. Understanding how (and how strongly) a ligand binds to some protein, as well as how the protein reacts to the interaction with the ligand, are problems with enormous importance for drug discovery and other biological progress. Some progress has been made in the latter problem, with a method to “correct” AlphaFold predictions by comparing them to experimental structures where we can see the effect of the ligands on the protein’s conformation [65]. Much less progress has been made in predicting the conformation of a ligand in a protein, or predicting how strongly it is made – a problem exacerbated by the biases in the datasets [66].

  • Protein-protein interactions. Many proteins form multimers, and most proteins interact with another protein in some form. Predicting which proteins interact is useful to understand biological regulation networks in the cell, and elucidating the geometry of the interaction is useful for structure-based drug discovery of protein therapeutics, which are one of the most rapidly growing medicine classes. The same team at DeepMind that developed AlphaFold also produced AlphaFold-Multimer [67], a version that can predict protein complexes (and, by extension, protein-protein interactions). While this model outperformed every previous protein-protein docking tool, its accuracy lags behind the success at single chain prediction, to the point that it is not considered solved [51]. Nevertheless, there are specific subproblems in protein-protein interaction prediction that have seen notable progress, such as protein-peptide interactions (missing reference).

  • Protein-nucleic acid interactions. Many proteins also interact with DNA or RNA. For example, one of the most important elements in gene regulation is the rate of transcription initiation, much of which is controlled by the binding of proteins known as transcription factors to promoter regions in DNA. Recent work has adapted RoseTTAFold [44], a variant of AlphaFold, to predict protein-DNA and protein-RNA complexes [68]. While this model achieves very high average accuracy for the geometry, it struggles to model the interface: about half of the models are missing more than half of the contacts between protein and DNA.

  • Protein folding. There is a stark difference between protein folding and protein structure prediction, which has often been overlooked by popular and even scientific media. Protein structure prediction is the prediction of the equilibrium three-dimensional structure of a protein; protein folding is a description of the mechanism whereby an unfolded protein attains this three-dimensional structure [69]. While AlphaFold 2 and derivatives provide a very effective solution to the former, they have not provided any insights on the latter. A study has looked into the intermediate conformations while folding, and found essentially no correlation with experimental data on protein folding mechanisms [70].

9. Conclusions

Protein structure prediction has been one of the most active fields of research in computational biology. For nearly thirty years, progress in predicting the three-dimensional structure of proteins was slow and stalling. In the early 2020s, technical improvements that eventually crystallised into the AlphaFold 2 architecture, led to the problem of predicting the conformation of single chains being essentially solved. This achievement has had a transformative effect on the field of structural biology, and many problems – obtaining the structure of almost any protein domain, a la carte molecular replacement for experimental model building, etc. – that were thought to be almost unsolvable are now commonplace.

Despite the success, there are many open challenges for structural bioinformatics. Work in structure prediction has continued, with some improvements over AlphaFold, although there are no major breakthroughs in the quality of productions. Many other problems still remain, such as predicting multiple conformations of proteins; exploring how proteins interact with ligands, nucleic acids and other proteins; describing the dynamic interactions of proteins; and, of course, one of the most long-standing problems in protein science: explaining how proteins dynamically attain their equilibrium three-dimensional structure.

10. Bibliography

  1. Nelson, D.L., Lehninger, A.L., Cox, M.M., 2008. Lehninger’s Principles of Biochemistry. Macmillan.
  2. Jhoti, H., Leach, A.R., 2007. Structure-based Drug Discovery, 1st ed. Springer.
  3. Consortium, U.P., 2019. UniProt: a worldwide hub of protein knowledge. Nucleic Acids Research 47, D506–D515.
  4. Dill, K.A., MacCallum, J.L., 2012. The protein folding problem, 50 years on. Science 338, 1042–1046.
  5. Young, D.C., 2009. Computational Drug Design: a guide for computational and medicinal chemists. John Wiley & Sons.
  6. Berman, H.M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T.N., Weissig, H., Shindyalov, I.N., Bourne, P.E., 2000. The Protein Data Bank. Nucleic Acids Research 28, 235–242.
  7. Greener, J.G., Kandathil, S.M., Jones, D.T., 2019. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nature Communications 10, 1–13.
  8. Kryshtafovych, A., Schwede, T., Topf, M., Fidelis, K., Moult, J., 2019. Critical assessment of methods of protein structure prediction (CASP)—Round XIII. Proteins: Structure, Function, and Bioinformatics 87, 1011–1020.
  9. Moult, J., 2005. A decade of CASP: progress, bottlenecks and prognosis in protein structure prediction. Current Opinion in Structural Biology 15, 285–289.
  10. Del Alamo, D., Govaerts, C., Mchaourab, H.S., 2021. AlphaFold2 predicts the inward-facing conformation of the multidrug transporter LmrP. Proteins: Structure, Function, and Bioinformatics.
  11. Kryshtafovych, A., Fidelis, K., Moult, J., 2014. CASP10 results compared to those of previous CASP experiments. Proteins: Structure, Function, and Bioinformatics 82, 164–174.
  12. Browne, W.J., North, A.C.T., Phillips, D.C., Brew, K., Vanaman, T.C., Hill, R.L., 1969. A possible three-dimensional structure of bovine α-lactalbumin based on that of hen’s egg-white lysozyme. Journal of Molecular Biology 42, 65–86.
  13. Pearce, R., Zhang, Y., 2021. Toward the solution of the protein structure prediction problem. Journal of Biological Chemistry 100870.
  14. Xu, J., 2019. Distance-based protein folding powered by deep learning. Proceedings of the National Academy of Sciences 116, 16856–16865.
  15. Senior, A.W., Evans, R., Jumper, J., Kirkpatrick, J., Sifre, L., Green, T., Qin, C., Žı́dek Augustin, Nelson, A.W.R., Bridgland, A., others, 2020. Improved protein structure prediction using potentials from deep learning. Nature 577, 706–710.
  16. Outeiral, C., Nissley, D.A., Deane, C.M., 2022. Current structure predictors are not learning the physics of protein folding. Bioinformatics.
  17. Bowie, J.U., Eisenberg, D., 1994. An evolutionary approach to folding small alpha-helical proteins that uses sequence information and an empirical guiding fitness function. Proceedings of the National Academy of Sciences 91, 4436–4440.
  18. Göbel, U., Sander, C., Schneider, R., Valencia, A., 1994. Correlated mutations and residue contacts in proteins. Proteins: Structure, Function, and Bioinformatics 18, 309–317.
  19. Simons, K.T., Kooperberg, C., Huang, E., Baker, D., 1997. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. Journal of Molecular Biology 268, 209–225.
  20. Alford, R.F., Leaver-Fay, A., Jeliazkov, J.R., O’Meara, M.J., DiMaio, F.P., Park, H., Shapovalov, M.V., Renfrew, P.D., Mulligan, V.K., Kappel, K., others, 2017. The Rosetta all-atom energy function for macromolecular modeling and design. Journal of Chemical Theory and Computation 13, 3031–3048.
  21. Pereira, J., Simpkin, A.J., Hartmann, M.D., Rigden, D.J., Keegan, R.M., Lupas, A.N., 2021. High-accuracy protein structure prediction in CASP14. Proteins: Structure, Function, and Bioinformatics.
  22. Zemla, A., 2003. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Research 31, 3370–3374.
  23. Kinch, L.N., Li, W., Monastyrskyy, B., Kryshtafovych, A., Grishin, N.V., 2016. Evaluation of free modeling targets in CASP11 and ROLL. Proteins: Structure, Function, and Bioinformatics 84, 51–66.
  24. Jones, D.T., Buchan, D.W.A., Cozzetto, D., Pontil, M., 2012. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190.
  25. Ovchinnikov, S., Kim, D.E., Wang, R.Y.-R., Liu, Y., DiMaio, F., Baker, D., 2016. Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta. Proteins: Structure, Function, and Bioinformatics 84, 67–75.
  26. Kosciolek, T., Jones, D.T., 2016. Accurate contact predictions using covariation techniques and machine learning. Proteins: Structure, Function, and Bioinformatics 84, 145–151.
  27. Zeiler, M.D., Fergus, R., 2014. Visualizing and understanding convolutional networks. In: European Conference on Computer Vision. Springer, pp. 818–833.
  28. Goodfellow, I., Bengio, Y., Courville, A., 2016. Deep Learning. MIT Press.
  29. Damianou, A., Lawrence, N.D., 2013. Deep gaussian processes. In: Artificial Intelligence and Statistics. PMLR, pp. 207–215.
  30. Valle-Perez, G., Camargo, C.Q., Louis, A.A., 2018. Deep learning generalizes because the parameter-function map is biased towards simple functions. arXiv preprint arXiv:1805.08522.
  31. Hochreiter, S., Schmidhuber, J., 1997. Flat minima. Neural computation 9, 1–42.
  32. Soudry, D., Hoffer, E., Nacson, M.S., Gunasekar, S., Srebro, N., 2018. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research 19, 2822–2878.
  33. Schaarschmidt, J., Monastyrskyy, B., Kryshtafovych, A., Bonvin, A.M.J.J., 2018. Assessment of contact predictions in CASP12: co-evolution and deep learning coming of age. Proteins: Structure, Function, and Bioinformatics 86, 51–66.
  34. Abriata, L.A., Tamò, G.E., Dal Peraro, M., 2019. A further leap of improvement in tertiary structure prediction in CASP13 prompts new routes for future assessments. Proteins: Structure, Function, and Bioinformatics 87, 1100–1112.
  35. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žı́dek Augustin, Potapenko, A., others, 2021. Applying and improving AlphaFold at CASP14. Proteins: Structure, Function, and Bioinformatics.
  36. Evans, R., O’Neill, M., Pritzel, A., Antropova, N., Senior, A.W., Green, T., Žı́dek Augustin, Bates, R., Blackwell, S., Yim, J., others, 2021. Protein complex prediction with AlphaFold-Multimer. bioRxiv.
  37. Baek, M., DiMaio, F., Anishchenko, I., Dauparas, J., Ovchinnikov, S., Lee, G.R., Wang, J., Cong, Q., Kinch, L.N., Schaeffer, R.D., others, 2021. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373, 871–876.
  38. Kandathil, S.M., Greener, J.G., Lau, A.M., Jones, D.T., 2021. Ultrafast end-to-end protein structure prediction enables high-throughput exploration of uncharacterised proteins. bioRxiv.
  39. Ołdziej, S., Czaplewski, C., Liwo, A., Chinchio, M., Nanias, M., Vila, J.A., Khalili, M., Arnautova, Y.A., Jagielska, A., Makowski, M.others, others, 2005. Physics-based protein-structure prediction using a hierarchical protocol based on the UNRES force field: assessment in two blind tests. Proceedings of the National Academy of Sciences 102, 7547–7552.
  40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I., 2017. Attention is all you need. In: Advances in Neural Information Processing Systems. pp. 5998–6008.
  41. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žı́dek Augustin, Potapenko, A., others, 2021. Highly accurate protein structure prediction with AlphaFold. Nature 1–11.
  42. Waldrop, M.M., 2019. News Feature: What are the limits of deep learning? Proceedings of the National Academy of Sciences 116, 1074–1077.
  43. Marks, D.S., Hopf, T.A., Sander, C., 2012. Protein structure prediction from sequence variation. Nature Biotechnology 30, 1072–1080.
  44. Schwarz, D., Merget, B., Deane, C., Fulle, S., 2019. Modeling conformational flexibility of kinases in inactive states. Proteins: Structure, Function, and Bioinformatics 87, 943–951.
  45. Toth-Petroczy, A., Palmedo, P., Ingraham, J., Hopf, T.A., Berger, B., Sander, C., Marks, D.S., 2016. Structured states of disordered proteins from genomic sequences. Cell 167, 158–170.
  46. Hopf, T.A., Ingraham, J.B., Poelwijk, F.J., Schärfe, C.P.I., Springer, M., Sander, C., Marks, D.S., 2017. Mutation effects predicted from sequence co-variation. Nature Biotechnology 35, 128–135.
  47. Kaján, L., Hopf, T.A., Kalaš, M., Marks, D.S., Rost, B., 2014. FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics 15, 1–6.
  48. Seemayer, S., Gruber, M., Söding, J., 2014. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics 30, 3128–3130.
  49. Jones, D.T., Singh, T., Kosciolek, T., Tetchner, S., 2015. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999–1006.
  50. Lin, Z., Akin, H., Rao, R., Hie, B., Zhu, Z., Lu, W., Smetanin, N., Verkuil, R., Kabeli, O., Shmueli, Y., dos Santos Costa, A., Fazel-Zarandi, M., Sercu, T., Candido, S., Rives, A., 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379, 1123–1130.
  51. Elofsson, A., 2022. Protein Structure Prediction until CASP15. arXiv preprint arXiv:2212.07702.
  52. Baek, M., McHugh, R., Anishchenko, I., Baker, D., DiMaio, F., 2022. Accurate prediction of nucleic acid and protein-nucleic acid complexes using RoseTTAFoldNA. bioRxiv 2022–09.
  53. Ahdritz, G., Bouatta, N., Kadyan, S., Xia, Q., Gerecke, W., O’Donnell, T.J., Berenberg, D., Fisk, I., Zanichelli, N., Zhang, B., others, 2022. OpenFold: Retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. bioRxiv 2022–11.
  54. Li, Z., Liu, X., Chen, W., Shen, F., Bi, H., Ke, G., Zhang, L., 2022. Uni-Fold: an open-source platform for developing protein folding models beyond AlphaFold. bioRxiv 2022–08.
  55. Tunyasuvunakool, K., Adler, J., Wu, Z., Green, T., Zielinski, M., Žı́dek Augustin, Bridgland, A., Cowie, A., Meyer, C., Laydon, A., others, 2021. Highly accurate protein structure prediction for the human proteome. Nature 596, 590–596.
  56. Rapaport, D.C., Rapaport, D.C.R., 2004. The art of molecular dynamics simulation. Cambridge University Press.
  57. Lane, T.J., 2023. Protein structure prediction has reached the single-structure frontier. Nature Methods 1–4.
  58. Hekkelman, M.L., de Vries, I., Joosten, R.P., Perrakis, A., 2023. AlphaFill: enriching AlphaFold models with ligands and cofactors. Nature Methods 20, 205–213.
  59. Olsen, J.V., Mann, M., 2013. Status of large-scale analysis of post-translational modifications by mass spectrometry. Molecular & cellular proteomics 12, 3444–3452.
  60. Bagdonas, H., Fogarty, C.A., Fadda, E., Agirre, J., 2021. The case for post-predictional modifications in the AlphaFold Protein Structure Database. Nature structural & molecular biology 28, 869–870.
  61. McBride, J.M., Polev, K., Reinharz, V., Grzybowski, B.A., Tlusty, T., 2022. AlphaFold2 can predict single-mutation effects on structure and phenotype. bioRxiv 2022–04.
  62. Akdel, M., Pires, D.E.V., Pardo, E.P., Jänes, J., Zalevsky, A.O., Mészáros, B., Bryant, P., Good, L.L., Laskowski, R.A., Pozzati, G., others, 2022. A structural biology community assessment of AlphaFold2 applications. Nature Structural & Molecular Biology 1–12.
  63. Ma, W., Zhang, S., Li, Z., Jiang, M., Wang, S., Lu, W., Bi, X., Jiang, H., Zhang, H., Wei, Z., 2022. Enhancing Protein Function Prediction Performance by Utilizing AlphaFold-Predicted Protein Structures. Journal of Chemical Information and Modeling 62, 4008–4017.
  64. Liu, Z., Pan, W., Li, W., Zhen, X., Liang, J., Cai, W., Xu, F., Yuan, K., Lin, G.N., 2022. Evaluation of the Effectiveness of Derived Features of AlphaFold2 on Single-Sequence Protein Binding Site Prediction. Biology 11, 1454.
  65. Ma, P., Li, D.-W., Brüschweiler, R., 2023. Predicting protein flexibility with AlphaFold. Proteins: Structure, Function, and Bioinformatics.
  66. Bordin, N., Sillitoe, I., Nallapareddy, V., Rauer, C., Lam, S.D., Waman, V.P., Sen, N., Heinzinger, M., Littmann, M., Kim, S., others, 2023. AlphaFold2 reveals commonalities and novelties in protein structure space for 21 model organisms. Communications biology 6, 160.
  67. Hsu, C., Verkuil, R., Liu, J., Lin, Z., Hie, B., Sercu, T., Lerer, A., Rives, A., 2022. Learning inverse folding from millions of predicted structures. In: International Conference on Machine Learning. PMLR, pp. 8946–8970.
  68. Saldaño, T., Escobedo, N., Marchetti, J., Zea, D.J., Mac Donagh, J., Velez Rueda, A.J., Gonik, E., Garcı́a Melani Agustina, Novomisky Nechcoff, J., Salas Martı́n N, others, 2022. Impact of protein conformational diversity on AlphaFold predictions. Bioinformatics 38, 2742–2748.
  69. Wayment-Steele, H.K., Ovchinnikov, S., Colwell, L., Kern, D., 2022. Prediction of multiple conformational states by combining sequence clustering with AlphaFold2. bioRxiv 2022–10.
  70. Stein, R.A., Mchaourab, H.S., 2022. SPEACH_AF: Sampling protein ensembles and conformational heterogeneity with Alphafold2. PLOS Computational Biology 18, e1010483.
  71. Del Alamo, D., Sala, D., Mchaourab, H.S., Meiler, J., 2022. Sampling alternative conformational states of transporters and receptors with AlphaFold2. Elife 11, e75751.
  72. Klarner, L., Reutlinger, M., Schindler, T., Deane, C., Morris, G., n.d. Bias in the Benchmark: Systematic experimental errors in bioactivity databases confound multi-task and meta-learning algorithms. In: ICML 2022 2nd AI for Science Workshop.

Carlos Outeiral

I am an Eric Schmidt AI in Science fellow at the University of Oxford, where I am also a Stipendiary Lecturer in Biochemistry. I write about AI, protein science and the future of techbio.