Thursday 19 December 2013

 

Simplified molecular-input line-entry system(SMILES)

 

 

For further inquiry..

 
 
 

Graph-based definition

 The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes
 
 

Atoms

 
 
 
Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O.
An atom holding one or more electrical charges is enclosed in brackets, followed by the symbol H if it is bonded to one or more atoms of hydrogen, followed by the number of hydrogen atoms (as usual one is omitted example: NH4 for ammonium), then by the sign '+' for a positive charge or by '-' for a negative charge. The number of charges is specified after the sign (except if there is one only); however, it is also possible write the sign as many times as the ion has charges: instead of "Ti+4", one can also write "Ti++++" (Titanium IV, Ti4+). Thus, the hydroxide anion is represented by [OH-], the oxonium cation is [OH3+] and the cobalt III cation (Co3+) is either [Co+3] or [Co+++].
 
 

 Bonds

 

 Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES string, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2 (naphthalene: c1cccc2c1cccc2 (note the lower case for aromatic compounds)), and so on. After reaching 9, the label must be preceded by a '%', in order to differentiate it from two different labels bonded to the same atom (~C12~ will mean the atom of carbon holds the ring closure labels 1 and 2, whereas ~C%12~ will indicate one label only, 12). Double, triple, and quadruple bonds are represented by the symbols '=', '#', and '$' respectively as illustrated by the SMILES O=C=O (carbon dioxide), C#N (hydrogen cyanide) and [Ga-]$[As+] (gallium arsenide).
 
 
 

 Branching

 
Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.
 

 

 Isotopes

 
 Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.
 
 
 

 



 

 

MoleculeStructureSMILES Formula
DinitrogenN≡N                                           N#N
Methyl isocyanate (MIC)CH3–N=C=OCN=C=O
Copper(II) sulfateCu2+ SO42-[Cu+2].[O-]S(=O)(=O)[O-]
Œnanthotoxin (C17H22O2)




 
CCC[C@@H](O)CC\C=C\C=C\C#CC#C\C=C\CO
Pyrethrin II (C22H28O5)
 
COC(=O)C(\C)=C\C1C(C)(C)[C@H]1C(=O)O[C@@H]2C(C)=C(C(=O)C2)CC=CC=C
Aflatoxin B1 (C17H12O6)
 
O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Glucose (glucopyranose) (C6H12O6)
 
OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1
Bergenin (cuscutin) (a resin) (C14H16O9)
 
OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2
A pheromone of the Californian scale insect
 
CC(=O)OCCC(/C)=C\C[C@H](C(C)=C)CCC=C
 
 
 
 
 
 

About the PDB Archive and the RCSB PDB

The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome.
The PDB archive is available at no cost to users. The PDB archive is updated each week at the target time of Wednesday 00:00 UTC (Coordinated Universal Time). The most recent release is timestamped and linked on every page in the top right header.
The PDB was established in 1971 at Brookhaven National Laboratory and originally contained 7 structures. In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became responsible for the management of the PDB. In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. It consists of organizations that act as deposition, data processing and distribution centers for PDB data.
In addition, the RCSB PDB supports a website where visitors can perform simple and complex queries on the data, analyze, and visualize the results. Details about the history, function, progress, and future goals of the RCSB PDB can be found in our Annual Reports and Newsletters.
The PDB Advisory Notice defines the conditions for using data from the PDB archive.
RCSB PDB staff are located at Rutgers, The State University of New Jersey and the University of California, San Diego. Watch this video for a tour of the Rutgers site;

Some of the protein from the RCSB Protein Data Bank (PDB)
  • Amylase

Crystal structure of a catalytic-site mutant alpha-amylase from Bacillus subtilis complexed with maltopentaose.
  • Trysin

The Geometry of the Reactive Site and of the Peptide Groups in Trypsin, Trypsinogen and its Complexes with Inhibitors.
  • Pepsin

X-ray analyses of aspartic proteinases. II. Three-dimensional structure of the hexagonal crystal form of porcine pepsin at 2.3 A resolution.


  • HtrA

Solution structure of HtrA PDZ domain from Streptococcus pneumoniae and its interaction with YYF-COOH containing peptides.


  • Carboxypeptidase

Insight into the stereochemistry in the inhibition of carboxypeptidase A with N-(hydroxyaminocarbonyl)phenylalanine: binding modes of an enantiomeric pair of the inhibitor to carboxypeptidase A.



The RCSB Protein Data Bank: site functionality and bioinformatics use cases

AnnotationOntology/DatabaseDescription
Biological processGO ConsortiumControlled vocabulary that describes biological processes
Cellular componentGO ConsortiumControlled vocabulary that describes the cellular location
Molecular functionGO ConsortiumControlled vocabulary that describes the molecular function
Enzyme classificationEnzyme Commission (EC) system recommended by the IUBMBClassification system for the reactions catalyzed by enzymes
Transporter classificationTransporter Classification (TC) system recommended by the IUBMBClassification system for membrane transport proteins that
 incorporates both 
functional and phylogenetic information
Medical subject termsMeSH terms developed by the National Library of MedicineControlled vocabulary to describe medical terms; 
used for indexing of PubMed abstracts
Source organismNCBI taxonomyA curated set of names and classifications of 
organisms from superkingdoms to subspecies
Genome locationEntrez GeneLocation of genes in the genomes of various organisms for 
proteins in the PDB. 
The top level in the hierarchy is the organism's genome. 
Each genome expands into chromosomes, 
which in turn expand into a list of loci on 
the chromosomes.
Fold/domain classificationSCOPHierarchical structural classification of proteins that
 provides a 
description of structural and evolutionary relationships 
of proteins of known structure
Fold/domain classificationCATHHierarchical structural classification of protein domains.
 Each protein has 
been divided into structural domains and assigned 
into homologous superfamilies.