Glossary of Bioinformatics Terms

A Quick Guide to Some Common Terms Used in Molecular Biology Databases

 

General Terms from Sequence Databases and Other Gene Resources

accession number (GenBank) - The accession number is the unique identifier assigned to the entire sequence record when the record is submitted to GenBank. The GenBank accession number is a combination of letters and numbers that are usually in the format of one letter followed by five digits (e.g., M12345) or two letters followed by six digits (e.g., AC123456). The accession number for a particular record will not change even if the author submits a request to change some of the information in the record. Take note that an accession number is a unique identifier for a complete sequence record, while a Sequence Identifier, such as a Version, GI, or ProteinID, is an identification number assigned just to the sequence data. The NCBI Entrez System is searchable by accession number using the Accession [ACCN] search field.

accession number (RefSeq) - This accession number is the unique identification number for a complete RefSeq sequence record. RefSeq accession numbers are written in the following format: two letters followed by an underscore and six digits (e.g., NT_123456). The first two letters of the RefSeq accession number indicate the type of sequence included in the record as described below:

  • NT_123456 constructed genomic contigs

  • NM_123456 mRNAs (actually the cDNA sequences constructed from mRNA)

  • NP_123456 proteins

  • NC_123456 chromosomes

cDNA or complementary DNA - DNA that is synthesized in the laboratory from a messenger RNA template [Source: Genome Glossary]

CDS - The coding sequence or the portion of a nucleotide sequence that makes up the triplet codons that actually code for amino acids.

contig - Group of cloned (copied) pieces of DNA representing overlapping regions of a particular chromosome [Source: Genome Glossary]

expressed sequence tag or EST - A short strand of DNA that is a part of a cDNA molecule and can act as identifier of a gene. Used in locating and mapping genes.

gene locus (pl. loci) - Gene's position on a chromosome or other chromosome marker; also, the DNA at that position. The use of locus is sometimes restricted to mean expressed DNA regions.

Gene name - Official name assigned to a gene. According to the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee, it should be brief and describe the function of the gene.

Gene Ontology - A controlled vocabulary of terms relating to molecular function, biological process, or cellular components developed by the Gene Ontology Consortium. A controlled vocabulary allows scientists to use consistent terminology when describing the roles of genes and proteins in cells.

Gene symbol - Symbols for human genes are usually designated by scientists who discover the genes. The symbols are created using the Guidelines for Human Gene Nomenclature developed by the HUGO Gene Nomenclature Committee. Gene symbols usually consist of no more than six upper case letters or combination of uppercase letters and Arabic numbers. Gene symbols should start with the first letters of the gene name. For example, the gene symbol for insulin is "INS." A gene symbol must be submitted to HUGO for approval before it can be considered an official gene symbol.

GI (GenBank) - A GI or "GenInfo Identifier" is a sequence identifier that can be assigned to a nucleotide sequence or protein translation. Each GI is a numeric value of one or more digits. The protein translation and the nucleotide sequence contained in the same record will each be assigned different GI numbers. Every time the sequence data for a particular record is changed, its version number increases and it receives a new GI. However, while each new version number is based upon the previous version number, a new GI for an altered sequence may be completely different from the previous GI. For example, in the GenBank record M12345, the original GI might be 7654321, but after a change in the sequence is submitted, the new GI for the changed sequence could be 10529376. Individuals can search for nucleotide sequences and protein translations by GI using the UID search field in the NCBI sequence databases. Use NCBI's Sequence Revision History page to view the different gi numbers, version numbers, or update dates associated with a particular GenBank record.

MIM number (also MIM#, OMIM number, or McKusick Code) - The unique six-digit number assigned to each entry listed in the catalog of human genes and genetic disorders, Online Mendelian Inheritance in Man (OMIM). The first digit of a MIM number describes a gene's mode of inheritance as outlined in the table below:


First Digit

Format 
(where X is any digit)
Mode of Inheritance
1
1XXXXX
Autosomal dominant (for entries created before May 15, 1994)
2
2XXXXX
Autosomal recessive (for entries created before May 15, 1994)
3
3XXXXX
X-linked loci or phenotypes
4
4XXXXX
Y-linked loci or phenotypes
5
5XXXXX
Mitochondrial loci or phenotypes
6
6XXXXX
Autosomal loci or phenotypes (for entries created after May 15, 1994)

For more information about OMIM numbers see the OMIM FAQ

Protein ID (GenBank) - The Protein ID is an identification number assigned to the amino acid sequence data included within a sequence record. This sequence identifier uses the accession.version format. Each protein ID is made up of three letters followed by five digits, a period, and a version number. For example, in a sequence record M12345, the Protein ID for the sequence translation could be AAA35650.1. If the protein sequence data changes in any way (even by just one amino acid), the version number in the Protein ID will be increased by an increment of one, while the accession number base remains constant. For example, AAA12345.1 would become AAA12345.2. Each amino acid sequence change also results in the assignment of a new GI number to the altered protein translation.

version (GenBank) - Similar to the Protein ID for protein sequences, the version is a nucleotide sequence identification number assigned to each GenBank sequence. The format for this sequence identifier is accession.version (e.g., M12345.1). Whenever the author of a particular sequence record changes the sequence data in any way (even if just a single nucleotide is altered), the version number will be increased by an increment of one, while the accession number base remains constant. For example, M12345.1 would become M12345.2. Each sequence change also results in the assignment of a new GI number [link to GI entry]. Whenever an individual searches an NCBI sequence database, only the most recent version of a record is retrieved. Use NCBI's Sequence Revision History page to view the different GI numbers, version numbers, or update dates associated with a particular GenBank record.

sequence tagged site or STS - Short (200 to 500 base pairs) DNA sequence that has a single occurrence in the human genome and whose location and base sequence are known. Detectable by polymerase chain reaction, STSs are useful for localizing and orienting the mapping and sequence data reported from many different laboratories and serve as landmarks for developing physical maps of the human genome. Expressed sequence tags (ESTs) are STSs derived from cDNAs.

 


 

Protein Structure Terms

domain - a discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function [from NCBI BLAST Guide Glossary]

ligand - A small molecule noncovalently bonded to a larger macromolecule.

motif - A discrete portion of a protein assumed to fold independently of the rest of the protein and possessing its own function [from NCBI BLAST Guide Glossary]. Some common types of motifs are made up of two or more alpha helices or beta sheets.

Display Options - The following graphics of the small protein hen egg-white lysozyme illustrate the different display options available for viewing molecular structures in Protein Explorer.

The cartoons, ribbons, and strands display options are useful for viewing protein secondary structure (alpha helices and beta pleated sheets).

Cartoons

Ribbons

Strands

Backbone (default)

Wireframe

Stick

Ball and Stick

Spacefill

polypeptide chain - a chain of peptides or amino acids. A polypeptide chain usually consists of 100 or fewer amino acids. A protein is made up of one or several polypeptide chains. [Source: BioTech's Life Science Dictionary]

primary structure -the amino acid sequence of a polypeptide chain. Of the four levels of protein structure, this is the most basic. [Source: Primary through Quaternary Structure in the MIT Biology Hypertextbook]

secondary structure - the folded, coiled, or twisted shape of a polypeptide that results from hydrogen bonding between parts of a molecule. There are two types of secondary structure: alpha helix and a beta pleated sheet. [Source: Biotech's Life Science Dictionary]

tertiary structure - the three-dimensional structure of a polypeptide chain that results from the way that the alpha helices and beta pleated sheets are folded and arranged [Source: The Dictionary of Cell and Molecular Biology, Third Edition]

quaternary structure - the interconnection and arrangement of polypeptide chains within a protein. Only proteins with more than one polypeptide chain can have quaternary structure. [Source: Primary through Quaternary Structure in the MIT Biology Hypertextbook]

alpha helix - one of two types of protein secondary structure. An alpha helix is a tight helix that results from the hydrogen bonding of the carboxyl (CO) group of one amino acid to the amino (NH) group of another amino acid. [Source: Primary through Quaternary Structure in the MIT Biology Hypertextbook]

 


 

Sequence Similarity Searching (BLAST) Terms

[Definitions adapted from the NCBI BLAST Guide Glossary]

algorithm - a fixed procedure embodied in a computer program. The Basic Local Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI uses to search sequence databases for optimal local alignments with a query sequence. FASTA is another type of algorithm used for database similarity searching.

conservation - when the substitution of one amino for another preserves the physico-chemistry properties of the original residue. For example, when a hydrophobic amino acid residue is replaced by another hydrophobic residue.

E value - the number of different alignments with a score equal to or better than S that can be expected to occur simply by chance. Also referred to as the expectation value.

FASTA format - sequence format that begins with a single-line description followed by lines of sequence data. This format can be used as query input when searching bioinformatic tools such as BLAST or clustal W. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. It is recommended that all lines of text be shorter than 80 characters in length. Blank lines are not allowed in the middle of FASTA input.

An example of a protein sequence in FASTA format is:

>GI|129295|sp|P01013|OVAX_CHICK GENE X PROTEIN (OVALBUMIN-RELATED) QIKDLLVSSSTDLDTTLVLVNAIYFKGMWKTAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLTEWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP

gap - A space introduced into an alignment to compensate for insertions or deletions in one sequence relative to another

global alignment - when two nucleic acid or amino acid sequences are lined up along their entire length. See also local alignment

homology - similarity in sequence that is based on descent from a common ancestor

identity - the extent to which two sequences are invariant

local alignment - the alignment of portions (rather than the entire sequence length) of two nucleic acid or amino acid sequences

masking - the removal of repeated or low complexity regions from a sequence so that sequences are compared

orthologous - homologous sequences in different species that result from a common ancestral gene during speciation. Orthologous genes may or may not have similar functions.

paralogous - homologous sequences within a single species that are the result of gene duplication

query - the input sequence (in FASTA format or as bare sequence data) or sequence identifier with which all the sequences in a database are compared during a BLAST search

similarity - how related one nucleotide or protein sequence is to another. The extent of similarity between two sequences is based on the percent of sequence identity and/or conservation.

 
Reference: http://www.ornl.gov/sci/techresources/Human_Genome/posters/chromosome/genejargon.shtml