Blog: Research Data Point

Working together, communicating Data - Information - Knowledge - Insight

One simple mistake that ruins protein research projects [new tool]

How are you supposed to deal with unwieldy construct monsters that contain hundreds of characters? Such monsters being engineered amino acid sequences that protein scientists create when producing recombinant proteins. Researchers’ pragmatic solution is to shorten the sequence and generate an alias, a human readable abbreviation that they can manipulate electronically and use to label sample tubes. There’s so much utility in using short identifiers because they enable researchers to efficiently communicate about engineered protein constructs. In conversation with others, in lab notebooks and as labels on physical containers. The practice of using unsystematic abbreviations for expression constructs however, can create substantial confusion and has ruined many research projects (talking from experience here).
Many aspects of protein research critically depend on consistent use of identifiers. Gene and protein-based research projects and collaborations are doomed when construct names and designations for research samples get mixed up. This problem often exacerbates as throughput increases, and there is no generally agreed method describing how to properly call the many constructs, open reading frames and engineered proteins. Keeping track of these identifiers though is critical when handling multiple samples in parallel and when working together as a team. If not done properly results don’t match up with materials and research projects fail.

Based on our own experience and data in the literature, there are four main reasons why a protein other than the one anticipated (henceforth called an artifact) may be found in a crystal structure. (….) Fourth, human errors such as mislabeling of samples (unfortunately not an uncommon event in high-throughput environments) may produce crystals of the wrong protein.

taken from:

Niedzialkowska, E., Gasiorowska, O., Handing, K. B., Majorek, K. A., Porebski, P. J., Shabalin, I. G., Zasadzinska, E., Cymborowski, M. and Minor, W.
Protein purification and crystallization artifacts: The tale usually not told
Protein Science, 25: 720–733. (2016) doi:10.1002/pro.2861

Surprisingly, existing guidelines (see below under ‘Further reading’) for naming sequence variants don’t seem to be followed in protein research labs. Hence the generation of identifiers for engineered protein constructs is not standardized and researchers end up using different names for identical amino acid sequences, or using the same name for different sequences.

Mixing up abbreviated names and mislabeling sample containers can lead to embarrassing retractions and clean-up efforts.

Let’s have a look at how researchers typically go about assigning a name to a particular amino acid sequence. What is happening in many labs goes something like this: Researchers assign an abbreviated name that is based on own experience, institutional convention and some ad-hoc creativity. For instance, a researcher may come up with a name for an engineered protein construct by combining a name from components that matter. These may include:

  • some abbreviated name for the protein, often based on the gene (e.g. AKT1 for RAC-alpha serine/threonine-protein kinase),
  • affinity tags for purification or recognition (e.g. ‘FLAG’ designates an octapeptide with the sequence DYKDDDDK)
  • protease cutting sites (e.g. ‘rTEV’ designates ENLYFQ\S)
  • sequence modifications vs wild type (e.g. A241S designating a Alanine to Serine mutation)
  • a number to keep track of the variations that are not expressed within the name, or a numbering scheme that has been agreed on within an institution.

Such a process can produce identifiers such as 6His-tag-L11V-INS_HUM_6441, reminding researchers that they're dealing with a protein that is
1. derived from the human gene coding for Insulin, and that the gene has a
2. Leucin to Valine mutation at position 11, and is
3. equipped with a purification tag containing 6 Histidine residues, and is the
4. 6441th gene that was created within the organization

“Not too bad” one may think. The trouble is that a colleague may come up with alternative names such as His-L11V-Insulin or His-Insulin (mut1)_6441 or N-term-HIS-INS_HUM_6441. If not causing outright confusing when discussing these constructs, dealing with the ‘which construct are you talking about’-issue is mind numbing. There has to be a better way.
As a side note: While the identifier 6His-tag-L11V-INS_HUM_64411 is human-readable and explains ‘the gist of it’, the precise amino acid sequence cannot be authentically deduced by an algorithm. This is due to leaving out certain ‘unimportant’ sequence modifications that are deemed minor at the time of creating the sequence name. Such modifications may be linkers between the tag and the open reading frame, or what are sometimes called ‘cloning artifacts’ – stretches of amino acid sequence that are determined by experimental molecular biologic necessity. In order to get the accurate amino acid sequence one would need to look up – hopefully in a central online repository – its sequence (and ideally any associated experiments, their outcomes, material availability etc.).

Regardless, while we can’t deduce the sequence form the abbreviated name, at least we can reduce the number of variations of the identifier. This is what we’re attempting to do with pro2nick (pro2nick.proteindata.cloud).
The pro2nick web application clarifies the name of engineered proteins by generating recognizable nicknames. It reads an amino acid sequence and automatically assembles a nickname based on the most related gene and modifications thereto, such as tags, deletions and mutations (go here for a technical description on how pro2nick works).


To get an impression of its simplicity, we’d like to encourage you to give pro2nick a try and let us know how to further improve the functionality.
Adhering to pro2nick-generated names reduces uncertainty in your lab and clarifies communication with your colleagues. The outcome of your research project may depend on it.

Further reading and best practices for naming gene and protein sequence variants:

Guidelines for naming of genes and sequence variants are provided by the HUGO Gene Nomenclature Committee’s HGNC Guidelines

Den Dunnen JT, Dalgelish R, Maglott DR, Hart RK, Greenblatt MS, McGowan-Jordan J, Roux AF, Smith T, Antonarakis SE, Taschner PEM
HGVS recommendations for the description of sequence variants: 2016 update.
Hum. Mutat. 37:564-569 (2016)

and HGVS-recommendations on naming protein variants

Hester M. Wain, Elspeth A. Bruford, Ruth C. Lovering, Michael J. Lush, Mathew W. Wright and Sue Povey
Guidelines for Human Gene Nomenclature
Genomics 79(4):464-470 (2002)

Fewer screw-ups in the molecular biology lab: Automated naming of protein constructs from sequence with pro2nick

Molecular biologists find it awkward to communicate with colleagues because the naming of protein constructs is not standardized. Indeed, life scientists are often using different names when referring to identical amino acid sequences. This makes it difficult for researchers to effectively communicate about engineered protein constructs. The pro2nick web application clarifies the name of engineered proteins by generating recognizable nicknames for engineered protein sequences. Pro2nick reads an amino acid sequence and assembles a nickname based on the most related gene and modifications thereto, such as tags, deletions, and mutations. Adhering to pro2nick-generated names reduces uncertainty in the lab and clarifies communication between protein scientists and molecular biologists.