Thursday, 19 December 2013

Simplified molecular-input line-entry system(SMILES)

For further inquiry..

http://www.daylight.com/meetings/summerschool98/course/dave/smiles-intro.html

http://en.wikipedia.org/wiki/Simplified_molecular-input_line-entry_system

http://chemical-quantum-images.blogspot.com/2008/03/simplified-molecular-input-line-entry.html

is a specification in form of a line notation for describing the structure of chemical molecules using short ASCII strings
SMILES strings can be imported by most molecule editors for conversion back into two-dimensional drawings or three-dimensional models of the molecules

Graph-based definition

The chemical graph is first trimmed to remove hydrogen atoms and cycles are broken to turn it into a spanning tree. Where cycles have been broken, numeric suffix labels are included to indicate the connected nodes

Atoms

Atoms are represented by the standard abbreviation of the chemical elements, in square brackets, such as [Au] for gold. Brackets can be omitted for the "organic subset" of B, C, N, O, P, S, F, Cl, Br, and I. All other elements must be enclosed in brackets. If the brackets are omitted, the proper number of implicit hydrogen atoms is assumed; for instance the SMILES for water is simply O.
An atom holding one or more electrical charges is enclosed in brackets, followed by the symbol H if it is bonded to one or more atoms of hydrogen, followed by the number of hydrogen atoms (as usual one is omitted example: NH4 for ammonium), then by the sign '+' for a positive charge or by '-' for a negative charge. The number of charges is specified after the sign (except if there is one only); however, it is also possible write the sign as many times as the ion has charges: instead of "Ti+4", one can also write "Ti++++" (Titanium IV, Ti⁴⁺). Thus, the hydroxide anion is represented by [OH-], the oxonium cation is [OH3+] and the cobalt III cation (Co³⁺) is either [Co+3] or [Co+++].

Bonds

Bonds between aliphatic atoms are assumed to be single unless specified otherwise and are implied by adjacency in the SMILES string. For example the SMILES for ethanol can be written as CCO. Ring closure labels are used to indicate connectivity between non-adjacent atoms in the SMILES string, which for cyclohexane and dioxane can be written as C1CCCCC1 and O1CCOCC1 respectively. For a second ring, the label will be 2 (naphthalene: c1cccc2c1cccc2 (note the lower case for aromatic compounds)), and so on. After reaching 9, the label must be preceded by a '%', in order to differentiate it from two different labels bonded to the same atom (~C12~ will mean the atom of carbon holds the ring closure labels 1 and 2, whereas ~C%12~ will indicate one label only, 12). Double, triple, and quadruple bonds are represented by the symbols '=', '#', and '$' respectively as illustrated by the SMILES O=C=O (carbon dioxide), C#N (hydrogen cyanide) and [Ga-]$[As+] (gallium arsenide).

Branching

Branches are described with parentheses, as in CCC(=O)O for propionic acid and C(F)(F)F for fluoroform. Substituted rings can be written with the branching point in the ring as illustrated by the SMILES COc(c1)cccc1C#N (see depiction) and COc(cc1)ccc1C#N (see depiction) which encode the 3 and 4-cyanoanisole isomers. Writing SMILES for substituted rings in this way can make them more human-readable.

Isotopes

Isotopes are specified with a number equal to the integer isotopic mass preceding the atomic symbol. Benzene in which one atom is carbon-14 is written as [14c]1ccccc1 and deuterochloroform is [2H]C(Cl)(Cl)Cl.

Molecule	Structure	SMILES Formula
Dinitrogen	N≡N	N#N
Methyl isocyanate (MIC)	CH₃–N=C=O	CN=C=O
Copper(II) sulfate	Cu²⁺ SO4^2-	[Cu+2].[O-]S(=O)(=O)[O-]
Œnanthotoxin (C₁₇H₂₂O₂)		CCC[C@@H](O)CC\C=C\C=C\C#CC#C\C=C\CO
Pyrethrin II (C₂₂H₂₈O₅)		COC(=O)C(\C)=C\C1C(C)(C)[C@H]1C(=O)O[C@@H]2C(C)=C(C(=O)C2)CC=CC=C
Aflatoxin B1 (C₁₇H₁₂O₆)		O1C=C[C@H]([C@H]1O2)c3c2cc(OC)c4c3OC(=O)C5=C4CCC(=O)5
Glucose (glucopyranose) (C₆H₁₂O₆)		OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H](O)[C@@H](O)1
Bergenin (cuscutin) (a resin) (C₁₄H₁₆O₉)		OC[C@@H](O1)[C@@H](O)[C@H](O)[C@@H]2[C@@H]1c3c(O)c(OC)c(O)cc3C(=O)O2
A pheromone of the Californian scale insect		CC(=O)OCCC(/C)=C\C[C@H](C(C)=C)CCC=C

About the PDB Archive and the RCSB PDB

The Protein Data Bank (PDB) archive is the single worldwide repository of information about the 3D structures of large biological molecules, including proteins and nucleic acids. These are the molecules of life that are found in all organisms including bacteria, yeast, plants, flies, other animals, and humans. Understanding the shape of a molecule helps to understand how it works. This knowledge can be used to help deduce a structure's role in human health and disease, and in drug development. The structures in the archive range from tiny proteins and bits of DNA to complex molecular machines like the ribosome.

The PDB archive is available at no cost to users. The PDB archive is updated each week at the target time of Wednesday 00:00 UTC (Coordinated Universal Time). The most recent release is timestamped and linked on every page in the top right header.

The PDB was established in 1971 at Brookhaven National Laboratory and originally contained 7 structures. In 1998, the Research Collaboratory for Structural Bioinformatics (RCSB) became responsible for the management of the PDB. In 2003, the wwPDB was formed to maintain a single PDB archive of macromolecular structural data that is freely and publicly available to the global community. It consists of organizations that act as deposition, data processing and distribution centers for PDB data.

In addition, the RCSB PDB supports a website where visitors can perform simple and complex queries on the data, analyze, and visualize the results. Details about the history, function, progress, and future goals of the RCSB PDB can be found in our Annual Reports and Newsletters.

The PDB Advisory Notice defines the conditions for using data from the PDB archive.

RCSB PDB staff are located at Rutgers, The State University of New Jersey and the University of California, San Diego. Watch this video for a tour of the Rutgers site;

http://www.youtube.com/watch?v=akmXIy7XQwA

Some of the protein from the RCSB Protein Data Bank (PDB)

Amylase

Crystal structure of a catalytic-site mutant alpha-amylase from Bacillus subtilis complexed with maltopentaose.

Trysin

The Geometry of the Reactive Site and of the Peptide Groups in Trypsin, Trypsinogen and its Complexes with Inhibitors.

Pepsin

X-ray analyses of aspartic proteinases. II. Three-dimensional structure of the hexagonal crystal form of porcine pepsin at 2.3 A resolution.

HtrA

Solution structure of HtrA PDZ domain from Streptococcus pneumoniae and its interaction with YYF-COOH containing peptides.

Carboxypeptidase

Insight into the stereochemistry in the inhibition of carboxypeptidase A with N-(hydroxyaminocarbonyl)phenylalanine: binding modes of an enantiomeric pair of the inhibitor to carboxypeptidase A.

The RCSB Protein Data Bank: site functionality and bioinformatics use cases

Annotation Ontology/Database Description

Biological process GO Consortium Controlled vocabulary that describes biological processes

Cellular component GO Consortium Controlled vocabulary that describes the cellular location

Molecular function GO Consortium Controlled vocabulary that describes the molecular function

Enzyme classification Enzyme Commission (EC) system recommended by the IUBMB Classification system for the reactions catalyzed by enzymes

Transporter classification Transporter Classification (TC) system recommended by the IUBMB Classification system for membrane transport proteins that
incorporates both
functional and phylogenetic information

Medical subject terms MeSH terms developed by the National Library of Medicine Controlled vocabulary to describe medical terms;
used for indexing of PubMed abstracts

Source organism NCBI taxonomy A curated set of names and classifications of
organisms from superkingdoms to subspecies

Genome location Entrez Gene Location of genes in the genomes of various organisms for
proteins in the PDB.
The top level in the hierarchy is the organism's genome.
Each genome expands into chromosomes,
which in turn expand into a list of loci on
the chromosomes.

Fold/domain classification SCOP Hierarchical structural classification of proteins that
provides a
description of structural and evolutionary relationships
of proteins of known structure

Fold/domain classification CATH Hierarchical structural classification of protein domains.
Each protein has
been divided into structural domains and assigned
into homologous superfamilies.

Annotation	Ontology/Database	Description
Biological process	GO Consortium	Controlled vocabulary that describes biological processes
Cellular component	GO Consortium	Controlled vocabulary that describes the cellular location
Molecular function	GO Consortium	Controlled vocabulary that describes the molecular function
Enzyme classification	Enzyme Commission (EC) system recommended by the IUBMB	Classification system for the reactions catalyzed by enzymes
Transporter classification	Transporter Classification (TC) system recommended by the IUBMB	Classification system for membrane transport proteins that incorporates both functional and phylogenetic information
Medical subject terms	MeSH terms developed by the National Library of Medicine	Controlled vocabulary to describe medical terms; used for indexing of PubMed abstracts
Source organism	NCBI taxonomy	A curated set of names and classifications of organisms from superkingdoms to subspecies
Genome location	Entrez Gene	Location of genes in the genomes of various organisms for proteins in the PDB. The top level in the hierarchy is the organism's genome. Each genome expands into chromosomes, which in turn expand into a list of loci on the chromosomes.
Fold/domain classification	SCOP	Hierarchical structural classification of proteins that provides a description of structural and evolutionary relationships of proteins of known structure
Fold/domain classification	CATH	Hierarchical structural classification of protein domains. Each protein has been divided into structural domains and assigned into homologous superfamilies.