Computational Biology and Systems Pharmacology

In a nutshell

Our research enables both rational drug discovery and design, based on detailed knowledge of the structures and functions of molecular targets and the complex biochemical sequences, complexes, and systems in which they play their parts.

Computation is also applied to quantitative and systems pharmacology—developing mathematical models to simulate, analyze, and predict the interplay between the multiple variables affecting a treatment’s efficacy, from a drug’s rates of absorption, distribution, metabolism, and excretion (pharmacokinetics) based on patient genetics and health status, to the effect of a given dose size over time against a particular disease phenotype (pharmacodynamics) and even interactions with other drugs and off-target effects.

Researchers here develop and apply computation and bioinformatics to:

Create analytical and predictive models, integrating data from in vitro, animal, and clinical studies, to more efficiently evaluate drugs against tuberculosis, thus speeding development of faster acting and more effective regimens.
Design new biomolecules that can act as sensors and actuators, dynamically probing and controlling complex biological processes in living cells.
Analyze genetic data from thousands of different types of bacteria that live on or in our bodies to discover potential new drugs among the small molecules they generate to interact with each other and with their human hosts.
Help to determine the precise, accurate functions of millions of unknown enzymes—molecules that catalyze biochemical reactions essential to life—discovered via genome sequencing.
Integrate data from experimental sources, biophysical theories, and statistical analyses to create 3D models of macromolecular complexes—vital sub-cellular machinery comprised of hundreds of varied proteins.

Discovering potential drugs among molecules expressed by human-associated bacteria

The bacteria on and in our bodies outnumber our own cells by about 10 to 1. This microbial community (microbiome) and its relationship to human health and disease are under increasing study.

The National Institutes of Health-funded Human Microbiome Project (HMP) generated data on hundreds of species and thousands of strains of human-associated bacteria, with further analyses showing the import of commensal bacterial communities and how changes in their composition are associated with disorders, including pathogenic infections (bacterial vaginosis) and inflammatory diseases (Crohn's disease).

Concurrently, bacteria found in soil and marine environments have long been known to generate small molecules (e.g., peptides) to regulate or fend off fellow microbes and mediate host-organism responses to their presence. Indeed, such molecules have been a major source of antibiotics, immunosuppressants, anti-cancer agents, and other drugs.

Department scientists are pioneering efforts to discover and characterize drug-like small molecules expressed by bacteria in the human microbiota. They seek to find the specific molecular interactions that underlie health-enhancing microbiome effects. Those molecules could be optimized to treat disease as new drugs or via engineered bacterial strains (probiotics).

Researchers here have developed an algorithm called ClusterFinder that analyzes bacterial genomic data for biosynthetic gene clusters (BGCs), related genes that encode functionally similar drug-like polypeptides.

Numbers and locations of biosynthetic gene clusters found in metagenomic sequencing data from the Human Microbiome Project (drawn from 752 samples collected from five body sites of healthy subjects).

Using ClusterFinder, department scientists and their collaborators performed the first systematic, global identification and significance analysis of such gene clusters in the human microbiome, identifying 14,000 BGCs in nearly 2,500 human-associated bacteria genomes from different body sites then studying their representation in hundreds of metagenomic samples (mixed bacterial community genomes) from healthy subjects in the Human Microbiome Project. Reasoning that the mostly widely distributed small molecule products of BGCs are most likely to mediate evolutionarily conserved and thus significant microbe-host and microbe-microbe interactions, they applied another algorithm to identify more than 3,000 such BGCs present in healthy individuals' microbiota which were lacking known functions.

Further analysis revealed that some of those gene clusters express thiopeptides, small molecules previously isolated from soil and marine bacteria that are currently in clinical trials as a new class of antibiotics. The researchers purified one such product from a common vaginal bacterium, Lactobacillus gasseri, and showed that it had potent antibacterial activity against a range of pathogenic vaginal bacteria.

Computational design and engineering of biological sensors and actuators

In living cells and organisms, key protein molecules sense environmental signals in the form of small molecules and respond by activating or adjusting biological pathways-complex biochemical sequences that, for example, regulate blood sugar levels, alter the behavior of bacterial communities in our bodies (see section above), or cause unneeded and aberrant cells to self-destruct.

Department scientists are pioneering the computer-aided design of new biomolecules that act as sensor/actuators, allowing them to detect diverse, small molecule signals (even currently undetectable ones such as new man-made compounds useful for biological engineering, or environmental toxins), and dynamically control complex biology in living cells. This will enable them to probe diseases, re-engineer metabolic processes, or even create new types of biological programs from scratch.

Diagram of sensor/activator design strategy. Protein-protein interfaces are engineered so they interact in the presence of a small molecule binding target, thus activating reporter output, i.e. green fluorescent protein, catalytic action, gene expression. Design of specific binding sites uses computation to program defined binding site geometries (oval) into the interface.

To achieve this, department scientists are developing a technology for reprogramming protein-protein interactions, using computation to precisely alter protein binding interfaces (re-designing them or transplanting binding sites from other molecules) such that they can only join together in the presence of a specific small molecule, ranging from an introduced drug to an endogenous neurotransmitter. The conjoined proteins can thus act as a sensor (e.g., being attached to a split reporter, such as a fluorescent protein that glows when its split halves are combined to signal the presence and quantity of a small molecule in real time in living systems). Because the engineered systems are modular, the sensor can also be linked to a split enzyme, which can then act as an actuator by catalyzing chemical reactions. Even further, the signal could activate a split gene expression system, potentially coupling the sensors to any selected biological response.

The work being done here draws on a number of disciplines, including mathematics used in robotics to precisely position robot arms given set joints (accounting for all the different ways in which such an arm must twist, bend, and rotate), which can be applied to the positioning of functional binding sites given a protein molecule’s chemical bond lengths, angles, and complex atomic interactions.

An early version of the technology designed a biosensor that was experimentally successful at detecting farnesyl pyrophosphate (FPP) in living bacteria. FPP is a central metabolic building block used to produce the anti-malaria drug artemisinin, thus its detection could be used to help engineer cells to optimize its production, as well as that of other drugs, valuable chemicals, and fuels.

Laying the foundation for genomic enzymology: Using curation and bioinformatics to integrate sequence, structure, and function

Genome sequencing technology has discovered the amino acid sequences for tens of millions of protein molecules in thousands of species, with databases growing at exponential rates. These include the sequences for millions of enzymes, which catalyze biochemical reactions essential to health and life.

If scientists could readily assign function to (annotate) those enzymes from genomic sequence and/or 3D structures, the knowledge could greatly enable therapeutic interventions, either by targeting key enzymes’ activity with small molecule drugs or engineering proteins based on discoveries about sequence-structure-function relationships for use as treatments.

However, the vast majority of enzyme sequences have uncertain, unknown, or incorrectly assigned function. Enzymes may be misannotated by less than reliable automated efforts that are based solely on their overall sequence similarity to those with known functions. But because functional properties are often unique to the sequence and structural characteristics of each set of evolutionarily related enzymes, there is no sequence similarity threshold for accurately inferring functional equivalence.

Web page for the dipeptide epimerase enzyme family in the SFLD.

Department scientists address this challenge by developing the Structure-Function Linkage Database (SFLD). The SFLD describes sequence-structure-function relationships within functionally diverse enzyme superfamilies. The enzymes in such a superfamily can catalyze very different overall reactions, but share a common evolutionary ancestor and an aspect of chemical function, such as a partial reaction (a mechanistic step, such as removing a proton from a carbon adjacent to a carboxylic acid) as carried out by a characteristic set of catalytic site amino acids (i.e., active site residues). Superfamilies are further hierarchically classified in the database into subgroups (based on sequence), then into specific functional families, which use the same mechanistic strategy to catalyze the same overall reaction. For enzymes of unknown function, identification of an SFLD superfamily to which they belong can provide clues that can help guide experiments to deduce their specific reactions.

In this sequence similarity network, the unknown protein sequence from *M. capsulatus* (yellow rectangle) clusters more with the dipeptide epimerases (light green) than with the chloromuconate cycloisomerases (pink), or other family subsets of the enolase superfamily.

The SFLD’s core focus is a dozen such superfamilies (comprising more than one-half million enzymes as of October 2014) manually curated by department scientists and external collaborators with deep expertise in each superfamily such that they are reliably annotated (with evidence coding) and can thus serve as a “gold standard” for developing and evaluating the automated methods that are ultimately needed to handle functional inference for the vast volumes of genomic data that continue to be generated.

The database provides multiple bioinformatics and visualization tools for researchers developing hypotheses about an unknown enzyme’s function. Examples include:

Algorithms that allow users to find and score regions of sequence similarity in their sequences of interest (i.e., BLAST, hidden Markov modes (HMMs)) compared to enzymes classified in the database
Links to structures and to homology models of undetermined enzyme structures created using the department-developed program Modeller
Downloadable Sequence Similarity Networks, which map proteins as nodes visually linked to each other to allow interactive exploration of sequence similarity among thousands of proteins, thus potentially mapping unknown sequences to SFLD family clusters

Developing software to model protein assembly structures by integrating data from multiple sources

Protein molecules routinely assemble into complexes to carry out their vital tasks. Detailed knowledge of such structures aids in understanding the healthy roles of proteins as part of larger systems, as well as providing effective targets for small molecular drugs and information about how such assemblies’ functions could be engineered. But such cellular machinery is too large, dynamic, flexible, and fragile to be fully captured by any one technique of structural biology.

Department scientists develop software for building 3D models of macromolecular assemblies by integrating all the information available about them. This combines physical theories, statistical analyses, and/or experimental data from sources such as:

Electron microscopy and small-angle X-ray scattering that capture relatively larger and coarser structural information
X-ray crystallography that provides atomic-level detail
Fluorescence resonance energy transfer (FRET), which determines structural distances, providing measures of dynamic conformational change
Cross-linking data that probes protein-protein interactions

Researchers here work to extend, enhance, and distribute the open source Integrative Modeling Platform (IMP), which approaches assembly modeling as a computational optimization problem where available information about the assembly is encoded into a scoring function that is used to evaluate candidate models generated by a configurational sampling algorithm. The scoring function is based on Bayesian approach, which seeks to extract the maximum useable information from available data.

IMP has been demonstrated by application to over 20 different macromolecular systems, including the nuclear pore complex (comprised of hundreds of proteins, the complex mediates transport of macromolecules in and out of cell nuclei) and the 26S proteasome (the cellular machine that degrades unneeded proteins). A major current application is construction of a pseudo-atomic model of the yeast spindle pole body, a microtubule organizing center.

Assembly of protein complexes from varied information — Department-developed software builds 3D models of protein complexes by combining different types of information generated by varied experimental and theoretical techniques noted in gray box (e.g., mass spectrometry proteomics can help determine relative quantities of assembly protein components). The data are converted into spatial constraints (e.g., avoiding physical overlap, accounting for non-covalent interactions, seeking lowest energy configurations), which are combined into a scoring function that guides sampling algorithms to obtain a detailed structural model.

Using quantitative systems pharmacology to speed development of better tuberculosis treatments

Tuberculosis (TB), infection of the lungs by varied strains of mycobacterium (MTB), remains a deadly worldwide scourge. In 2013, 1.5 million people died from TB, or about one person every 21 seconds. Tuberculosis is second only to HIV as the leading infectious killer of adults.

Scanning electron micrograph of Mycobacterium tuberculosis, the bacteria that causes tuberculosis.

TB poses a major challenge due to:

Disease dynamics—Bacteria sub-types respond differently to different drugs due to genetic and metabolic differences (e.g., rapidly dividing vs. slower-dividing, persistent populations).
Immune response—Immune cell aggregations called granulomas enclose the infection to prevent its spread but bacteria inside are less accessible to treatment and can go into drug-resistant hibernation.
Regiment length and complexity—Treatment takes at least six months of multi-drug therapy with side effects. This gives rise to poor compliance and treatment failure, which has led to multi-drug resistant strains requiring longer treatment and with high mortality.
Lack of reliable clinical endpoints—Current use of sputum bacterial cultures after two months to determine if new regimens should be put into costly larger trials is not reliably predictive of a durable cure.
Disease and drug interactions—In 2013, on-fourth of those who died from TB were HIV positive, yet treatments for both can negatively interact.

Department scientists and their interdisciplinary collaborators are tackling this multi-variant complexity with a quantitative and systems pharmacology approach. They are developing computer models that combine multi-scale data (from cell cultures, different species, and human lung tissue, via different lab techniques) to assess a given drug's distribution from blood to lung to infection site (pharmacokinetics), and the complex mechanisms of drug interactions with the different pathogen sub-types, with each other, and with animal models as compared to humans (pharmacodynamics). Such models are drawing on major international databases of non-clinical (animal) and clinical (human) study findings.

The goal is to create models to more efficiently evaluate new drug regimens for whether they will work faster, be effective against a drug-resistant TB, and/or be compatible with drugs for HIV. Such models will provide analytical tools to rationally guide which drugs and combinations, at which doses and dosing schedules, should move forward at each step of development from lab cultures, to animals, and to clinical studies.