Biochemical Society Transactions

Joint Sino–U.K. Protein Symposium: a Meeting to Celebrate the Centenary of the Biochemical Society

Collation and data-mining of literature bioactivity data for drug discovery

Louisa J. Bellis , Ruth Akhtar , Bissan Al-Lazikani , Francis Atkinson , A. Patricia Bento , Jon Chambers , Mark Davies , Anna Gaulton , Anne Hersey , Kazuyoshi Ikeda , Felix A. Krüger , Yvonne Light , Shaun McGlinchey , Rita Santos , Benjamin Stauch , John P. Overington


The challenge of translating the huge amount of genomic and biochemical data into new drugs is a costly and challenging task. Historically, there has been comparatively little focus on linking the biochemical and chemical worlds. To address this need, we have developed ChEMBL, an online resource of small-molecule SAR (structure–activity relationship) data, which can be used to support chemical biology, lead discovery and target selection in drug discovery. The database contains the abstracted structures, properties and biological activities for over 700000 distinct compounds and in excess of more than 3 million bioactivity records abstracted from over 40000 publications. Additional public domain resources can be readily integrated into the same data model (e.g. PubChem BioAssay data). The compounds in ChEMBL are largely extracted from the primary medicinal chemistry literature, and are therefore usually ‘drug-like’ or ‘lead-like’ small molecules with full experimental context. The data cover a significant fraction of the discovery of modern drugs, and are useful in a wide range of drug design and discovery tasks. In addition to the compound data, ChEMBL also contains information for over 8000 protein, cell line and whole-organism ‘targets’, with over 4000 of those being proteins linked to their underlying genes. The database is searchable both chemically, using an interactive compound sketch tool, protein sequences, family hierarchies, SMILES strings, compound research codes and key words, and biologically, using a variety of gene identifiers, protein sequence similarity and protein families. The information retrieved can then be readily filtered and downloaded into various formats. ChEMBL can be accessed online at

  • bioactivity data
  • ChEMBL
  • data-mining
  • drug discovery
  • structure–activity relationship (SAR)


Despite great advances in the development of biotechnology approaches, the majority of current therapeutics are still small-molecule drugs [1]; a small-molecule drug can be usefully defined as a non-polymeric low-molecular-mass (<1000 Da) organic compound. There are already several web-based databases and archives currently available that allow searching of small molecules to aid drug discovery. Examples of such databases include ChEBI [2], ZINC [3], PubChem [4], DrugBank [5], IUPHAR-DB [6] and KEGG [7]. However, there has been historically a lack of databases that contain the associated experimentally defined quantitative bioactivity data for these compounds. The link between the chemical and biological is therefore lacking, and this is arguably a significant constraint in the translation of investments in genomics programs through to new therapeutics. Quantitative bioactivity data are essential for assisting in the design and implementation of chemical biology experiments and also crucial for informed target selection during drug discovery. Following its initial development as part of a computational technology platform within a biotechnology company, the data within ChEMBL were released into the public domain in 2009. The trend of development of ‘open’ chemistry resources is the subject of a number of reviews [8,9].

The ChEMBL database

The ChEMBL database is implemented as an Oracle database with a chemical structure cartridge to allow chemical similarity and substructure searching. The core ChEMBL data are extracted manually from the primary literature, and curation is performed to enhance the usability of the original published data. For example, the molecular targets of a reported assay are assigned using UniProt identifiers as the reference protein resource [10], and provided with searchable amino acid sequences. Chemical structures are represented in a standard format and normalized using a series of business rules (for example, the representation of protonation states on salts or nitro groups). Duplicated compound structures from different publications are consolidated using the InChI representation [11]. A set of standard physicochemical descriptors are computed (e.g. logP, the number of rotatable bonds or polar surface area); these are typically used in selection of more drug-like compounds [12]. Assays are curated in thematic areas, and attempts are made to apply consistent descriptions to the same or related assays from different publications. A key advantage of the organization of the data is that assays from the same publication or depositor are run with the same protocol, and can be considered comparable, whereas data from nominally similar assays, but performed in different laboratories, at different times, can be pooled if required. One issue with early literature is that our understanding of the molecular basis of pharmacology was incomplete; for example, although there are many publications related to muscarinic antagonists from the 1980s, it was only much later that the field understood the molecular target and the discovery of five receptor subtypes [1315], that it was possible to potentially assign a specific gene to the bioassay. We have developed a way of encoding the necessary ambiguity in the target assignment for certain studies, and alerting the user to this issue.

On top of the database, we have built a web application to allow easy querying, analysis and retrieval of data, using a variety of search strategies. Additional programmatic access is now also available through a REST web-services API (see for details). Given the higher sensitivity of the searching of chemical structures, all data access to the database is routed through industry-standard TLS/SSL protocols (also known as https). Additionally, the database is freely available for download from ftp servers in a variety of formats (Oracle and MySQL), allowing local data integration, calculation of enhanced descriptor sets and so forth. Finally, the data are also available in a pre-configured virtual machine as a MySQL database for installation of ‘the Cloud’ (see for more details). The ChEMBL database is licensed under a non-restrictive Creative Commons licence (

An overview of the data within ChEMBL

In the current release (June 2011), including the pooled bioactivity data from PubChem, there were 955004 distinct compounds that modulate ~8370 targets. The data model maps the small-molecule structures to their targets and also their related functional effects, from either in vitro or in vivo experiments. It captures a broad range (but necessarily incomplete) of compound SAR (structure–activity relationship) data for synthetic small molecules, short peptides and natural products. The bioassay data reported are typically a mixture of human and model organism (primarily rat) binding and pharmacological data, as well as pre-clinical validation and safety pharmacology testing. We also capture and integrate compounds that reach the assignment of generic names [USANs (U.S. Adopted Names) and INNs (International Nonproprietry Names)], and also those that become approved drugs. Ultimately, as we increase the scope and depth of curation, the database will allow ready searching from pre-clinical discovery through clinical development and post-marketing studies.

We have additionally developed a series of thematic portals to the underlying unified ChEMBL data (Figure 1); these are targeted towards users with a particular focus to their research, either in a specific disease area or around a particular gene family. This latter view is particularly important given the common assays, selectivity concerns, shared features of molecular recognition and the historical success of particular gene families for drug discovery {e.g. rhodopsin-like GPCRs (G-protein-coupled receptors) [16] and protein kinases [17]}. The engineering of these portals allows their ready reconfiguration for expansion to cover either other disease areas, or to new gene families. As an independent example of the sort of integrative and analytical power when this chemogenomics paradigm is applied to a disease area and expanded to include expression, pathway, protein–protein interaction, see

Figure 1 Architecture of the ChEMBL database

Examples of these thematic access portals include the following.

(i) ChEMBL-NTD ( is a repository for open access/open data primary screening and medicinal chemistry data for ‘neglected diseases’, primarily the endemic tropical diseases of the developing regions of Africa, Asia and the Americas (for example, malaria). The primary purpose of ChEMBL-NTD is to provide an annotated and focused subset of the ChEMBL data, as a freely accessible and permanent public archive and download site. GlaxoSmithKline [18], Novartis [19] and St. Jude Children's Research Hospital [20] have all donated data so far.

(ii) Kinase SARfari ( is an integrated chemogenomics workbench focused on protein kinases; these multi-domain enzymes are a large family of signalling proteins that control many important cellular processes [21]. Given that the enzymes all use ATP as a cofactor, it was an initial surprise that ‘selective’ inhibitors could be developed [22], and subsequent exploration of the protein family from a medicinal chemistry perspective has led to the development of 11 currently approved drugs (primarily for oncology indications), and in excess of 320 protein kinase inhibitors that have been progressed as far as clinical trials. The system incorporates and links protein kinase sequence (extracted and curated from UniProt) and structure (extracted and curated from PDBe), alongside compound and screening data from ChEMBL (restricted to compounds that bind to the shared protein kinase domain). In total, there are 33389 compounds that are specific for the kinase regions of targets.

(iii) GPCR SARfari is a similar integrated chemogenomics workbench focused on rhodopsin-like GPCRs. This family of integral membrane receptors have proved to be the single most successful drug target family to date [16], with many clinically successful drug families targeting this family. Owing to the common architecture of the binding sites for GPCR ligands, and presence of a large number of family members, often showing tissue-specific expression, drugs often show binding promiscuity across the family [23] that can subsequently be optimized further to give new therapeutics. The interest in GPCRs is enhanced further by the presence of many orphan receptors, for which the endogenous ligand and role in biology still have to be established. Arguably, the lack of three-dimensional structures for this family have held back the application of a full range of approaches to drug discovery. However, this has changed substantially with now over 48 GPCR structure entries in the PDB, covering seven distinct receptors [2430]. Owing to the flexibility of the family, and the nature of ligand-induced structural changes associated with agonist or antagonist conformations, comparative structural studies and comparative modelling are becoming studied more, and more importantly are starting to have impact on compound design. The current GPCR SARfari portal ( incorporates and links GPCR sequence, structure, compounds and screening data. In total, there are 118834 compounds that are specific for the GPCR regions of targets.

The web front end to the ChEMBL database allows a series of flexible query strategies, for example browsing by target class to retrieve sets of known active compounds (Figure 2), and querying by substructure or chemical similarity to a known compound. Results are returned in a spreadsheet format, with a set of standard download formats (Figure 3). Compounds that have progressed through clinical development and are marketed drugs have specific annotation on delivery route, pharmaceutical properties and links through to current clinical trials at (Figure 4).

Figure 2 Browsing of data by target family
Figure 3 Bioactivity results page
Figure 4 Compound report and external database links


Given the data in a suitably ordered form, it is straightforward to extract data to investigate a wide range of areas, e.g. target-class/compound property trends [31] or prediction of novel targets [32]. Our own work has addressed the relationship of physical properties with the factors involved in drug attrition during pre-clinical and clinical development [33]. For example, Figure 5 displays a scatterplot of compound affinities as a function of molecular mass and calculated logP. As is clear from this plot, there is a clear general trend between lipophilicity and molecular mass for compounds within ChEMBL, but also between both molecular mass and lipophilicity and potency, so larger ‘greasier’ compounds are, on average, more potent. The mean logP and molecular mass of drugs is also marked [34]. It is of interest to note that, if focus is on optimizing potency during lead optimization, this strategy leads to larger more lipophilic compounds, with increased risk of eventual development failure. Strategies to counteract this trend include the incorporation of ligand efficiency concepts [35] for both molecular mass and lipophilicity [36] which can be used to optimize drug efficacy while balancing compound properties with affinity requirements.

Figure 5 Scatterplot of compound affinity (pIC50 by colour) against logP (AlogP) and molecular mass (MWt)

The broken lines indicate the boundary of two of the Rule of Five parameters, and the circled area indicates the typical properties of historically successful drugs.

Analysis of the compounds in the ChEMBL database also allows us to investigate how the properties of molecules synthesized to support drug discovery programmes has changed over the time that the data have been extracted from the literature. Figure 6 shows that the mean molecular mass, PSA (polar surface area) and AlogP of compounds reported in the scientific literature have all increased since the mid 1980s. There was a step increase in these properties in the early 1990s which can potentially be attributed to the use of combinatorial chemistry to synthesize large libraries of compounds, the ability to screen large numbers of compounds using high-throughput technologies, or a switch in target classes, among a very wide range of possible causes. The increases are quite significant; for example, mean AlogP increased from 2.5 in the mid-1980s to over 3 by the mid 1990s and similarly the mean molecular mass increased from ~340 to over 400 in the same period. Although there will be no one single factor to explain or predict the decreasing rate of discovery of new drugs, having data available to test and propose hypotheses is a key advance for the field [33].

Figure 6 Plot of publication year, molecular mass (square, MWT), AlogP (triangle) and PSA (circle)


This work is supported by the Wellcome Trust, and the member states of EMBL. F.A.K. is a student of Fitzwilliam College, R.S. is a student of Girton College, and B.S. is a student of Robinson College, University of Cambridge.


We greatly thank the authors of all of the papers we have currently abstracted, and the diverse community of ChEMBL users for their valuable feedback.


  • Joint Sino–U.K. Protein Symposium: a Meeting to Celebrate the Centenary of the Biochemical Society: A Biochemical Society Focused Meeting held at Shanghai University, Shanghai, China, 5–7 May 2011. Organized by Tom Blundell (Cambridge, U.K.), Zengyi Chang (Peking University, China), Ian Dransfield (Edinburgh, U.K.), Neil Isaacs (Glasgow, U.K.), Glenn King (University of Queensland, Australia), Sheena Radford (Leeds, U.K.), Zihe Rao (Nankai University, China), Yi-Gong Shi (Tsinghua University, China), Chihchen (Zhizhen) Wang (Institute of Biophysics, Chinese Academy of Sciences, China), Jiarui Wu (Shanghai Institute of Biological Sciences, China) and Xian-En Zhang (Ministry of Science and Technology, China). Edited by Zengyi Chang and Neil Isaacs.

Abbreviations: GPCR, G-protein-coupled receptor; PSA, polar surface area


View Abstract