General Workflow

Part of the weekly release procedure by the PDB is to publish the sequences of the entries to be released the following Wednesday four days earlier. This pre-release is scheduled every Saturday at 3:00 UTC. CAMEO collects the pre-release and, after some pre-processing of the sequences and filtering steps,submits a selected set of targets to the registered servers. Participants have until the following Wednesday at 0:00 UTC to return their predictions. Once the reference structures have been released by the PDB the following Wednesday, the evaluation is performed.

Currently CAMEO contains a single category: complex structures modeling (3D). Protein model quality assessment (QE), the old single-chain protein structure (3D), protein contact prediction (CP) and ligand binding site (LB) have been discontinued.

CAMEO servers can be registered as public servers with its full name and results available to everyone, or as development servers, where the name is disguised ('serverX') and all scoring is performed and visible to other method developers, but not to the public. See our complete list of registered servers.

A CAMEO target is a pre-released PDB entry, which is submitted to registered servers. In CAMEO, a target consists of one or more peptide, protein, DNA or RNA sequence(s), and zero or more free ligands belonging to the same pre-released PDB entry. A target can, thus, be a monomer, a homo-oligomer or a hetero-oligomer, and contain ligand(s) or not.

CAMEO considers any pre-released sequence containing 30 or more amino acids to be a protein. Amino acid sequences strictly shorter than 30 residues are named peptides. Free ligands are small, non-polymer molecules that are pre-released as InChi codes and SMILES by the PDB.

CAMEO does not know the stoichiometry of complexes, as this information is not part of the PDB pre-release. Participants are expected to predict the complex stoichiometry as part of the modeling.

Pre-processing

After downloading the pre-released sequences from the PDB on Saturday, in order to submit a limited number of high-quality targets for modeling, CAMEO performs the following actions before submitting the sequences to the participants:

Filtering of the sequences
Clustering of similar targets
Classification of targets

1. Filtering of the sequences

CAMEO Complete Modeling only submits filtered nucleic and amino-acid sequences to the participants. The filtering step removes targets if any of their sequences:

contain unknown residues in the canonical sequence;
contain non-canonical or modified residues that don't have a parent amino or nucleic acid in the PDB Chemical Component Dictionary;
contain non-cannonical nucleic or amino acids (selenocystein, etc.) after the cleanup;
contain sequence caps (such as N-terminal acetylation, C-terminal amidation) or contain residues annotated as non-linking by the PDB Chemical Component Dictionary, or linking with a linking type inconsistent with the rest of the sequence;
or otherwise couldn't be fully converted to sequences of canonical amino acids.

2. Clustering of similar targets

In order to avoid "duplicate" submissions of very similar targets, CAMEO clusters the remaining targets.

First, CAMEO clusters individual polymer sequences from the targets:

All protein sequences of 30 amino acids or more are clustered with CD-HIT at 99% sequence identity and 90% alignment coverage. Before clustering, common expression artifacts (such as His-tags, linker peptides, and protease cleavage sites) are trimmed from protein sequence termini so that these artificial additions do not affect cluster assignments.
All other sequences (nucleic acids, and peptides shorter than 30 amino acids) are clustered based on exact identity.

Then, complexes are clustered based on the set of individual sequences they contain. Complexes that contain the exact same set of sequences are grouped together (clustered complexes).

Finally, a second level of clustering is added to the clustered complexes taking non-polymer ligands into account. Complexes in the same clustered complex are sub-divided into clustered complexes with ligands, each of them containing the exact same set of ligand.

3. Classification of targets

3a. Classification of individual polymer sequences

Target complexes are first classified by difficulty into easy, medium and hard targets.

First, all protein sequences of 30 amino acids or more are searched separately for templates with BLAST against the full list of protein sequences currently in the PDB. Templates are classified into one of three categories:

"Easy", if the template has 85% or more sequence identity to the target, and additionally:
1. at least 70% of the target sequence is covered by the template
2. and, for target sequences longer than 250 residues, less than 45 amino acids of the target are not covered by the taret-template alignment.
"Medium", if the template is not "easy" but has a BLAST hit with an e-value of 10⁴ or less, with the same coverage requirements as for easy targets (70% coverage and less than 45 amino acids not covered for targets longer than 250 residues.
"Hard" otherwise.

Second, all nucleic acid sequences and peptide sequences shorter than 30 amino acids are subjected to a template search against all the sequences currently in the PDB. Templates are identified based on exact sequence identity. If a template has the exact same sequence in the PDB it is classified as "easy".

3b. Classification of complexes

CAMEO uses the template information obtained from individual sequences and integrates it into a classification of whole complexes.

A template is considered for the complex only if it covers all the sequences of the target and has an a exhaustive (1:1) mapping between every sequence of the target and of the template. The template complex may contain no additional sequences not part the target (or not included in the mapping).

The complex difficulty is defined as follows:

"Easy" if a template can be found that is "easy" for all the individual sequences.
1. "Ligand" targets are target complexes classified as "easy" which contain a novel ligand, that is a ligand which is not present in any of the template that make the complex "easy". (The ligand may however exist in a "medium" or "hard" template. Or a similar ligand may be observed in an "easy" template.) Known crystallographic artifacts (such as GOL, PEG, 1PE and other buffer or crystallization components) are excluded from the ligand comparison on both the target and the template side, using the plinder curated artifact list (May 2024 update).
"Medium" if the complex is not "easy", but a template can be found that is either "easy" or "medium" for all the individual sequences.
"Hard" if at least one sequence can only be mapped to a "hard" template, or if no template covers the whole complex.

Note: targets classified as "medium" or "hard" may include ligands, which will be submitted to servers that can receive ligands. Other servers will receive the target without the ligand information.

Also note: known crystallographic artifacts may still be submitted as part of a complex with other, relevant ligands. You may choose to model them or not, as they are excluded from scoring.

Target Submission

After pre-processing, CAMEO submits the selected targets to the registered servers. CAMEO only submits complete targets to participating servers, that is targets that only contain types of sequences that the participant can model (protein, DNA, RNA and peptides). CAMEO will never submit part of a heteromeric complex.

For instance, a server supporting heteromeric protein modeling will not receive RNA-protein complexes; similarly a server capable of modeling only single protein chains will not receive a heteromeric protein complex as target.

The only exception to this is ligands: servers that cannot model ligands can still receive complexes containing ligands (just without the ligand information).

In order to receive submissions, your server should be ready to:

Receive the target(s) by HTTP POST or GET request.
Return a 200 HTTP status code as soon as it receives the target. If your server encounters an error and returns a 4XX or 5XX status code, CAMEO will interrupt the submission. We can manually restart failed submissions on a best-effort basis.

The exact contents of the request depend on the capabilities of your server, and can be customized to some extent.

Prediction Format

Participants return their predictions by email to the address provided during target submission (the "Results Email" variable registered with CAMEO). Up to 5 models can be returned; only model 1 is used for aggregated scores and comparisons on the website, while all 5 models are scored individually and available in the data downloads. The deadline is Wednesday at 0:00 UTC, with a grace period until 3:00 CET/CEST.

You can return several emails. Only the last email received within the submission window will be considered in the evaluation.

Supported formats

Legacy PDB or mmCIF format for polymer targets, optionally gzipped.
- The model number (1–5) is determined by the order of the attachments in the email.
- Residues must be numbered after the target sequence (1-based numbering). Insertion codes are not allowed.
- Ligands contained in legacy PDB files are ignored. Use SDF, mmCIF or the CASP15 format for ligand predictions.
- For multi-model PDB files, each model should begin with a MODEL record and end with ENDMDL. The file should end with an END record.
SDF format for ligand predictions, optionally gzipped, separately from the polymer prediction.
- Multiple distinct ligands and ligand copies must be attached in different files.
- In SDF files containing multiple ligands (separated by $$$$ per SDF format), each molecule will be treated as a separate pose of the ligand.
- All the ligands and poses submitted in SDF format will be scored against model 1.
The CASP15 ligand prediction format is also supported for both polymer and ligand submissions:
- MODEL and POSE keywords contain a 1-based integer. Only models and poses 1–5 will be processed.
- The LIGAND keyword is mandatory to mark the beginning of a ligand block. The number and identification of the ligand are ignored.
- Up to 5 models OR 5 poses of each ligand per model are supported. Unlike CASP15, it is not possible to match up MODELs and POSEs to return up to 25 predictions per ligand. Only 5 model/poses in total will be analyzed.
- Unlike CASP16, it is possible to return 1 MODEL with 5 POSEs. There is no need to duplicate the model: CAMEO will do it automatically, and rename the POSEs to the corresponding MODEL number.
- The model number is read from the MODEL keyword, not from the ordering of the attachment.
- Predictions should preferably be sent as a single attachment. Separate attachments are also supported, provided that all the data for one model (including ligands) is included in the same file. Ligands in separate files from the polymer model cannot be processed; for that, use the SDF format described above.
- Other special keywords (such as PFRMAT, TARGET, AUTHOR, METHOD and PARENT) are ignored.

General Guidelines

Models should be returned as MIME attachments. Alternatively, a single uncompressed PDB file can be included in the email body.
Residues must be numbered according to the submitted target sequence. Insertion codes are not allowed and models containing them will be ignored.
Chain naming is free, but every chain must have a unique name. Free ligands should be placed in separate chains.
Only residues and atoms whose names follow the PDB Chemical Component Dictionary notation are scored. Others are treated as not modeled.

Note: if your server is registered for the CASP experiment, it should already fulfill most of these technical requirements and you can use the same technology for CAMEO.

mmCIF Format Guidelines

CAMEO supports predictions in PDBx/mmCIF format. We read files with the LoadMMCIF function of OpenStructure. In addition to the atom_site category, the reader uses entity information from the entity, entity_poly, and entity_poly_seq categories to identify polymer/non-polymer chains and their sequences.

This function can tolerate some errors in the mmCIF format. However, we encourage participants to submit valid mmCIF files and strongly recommend validating them before submission. Below are several ways to validate your files.

1. Docker container

If you have access to Docker, we provide a convenient validation script as a container. The mmCIF file to validate must be in the current working directory or one of its subdirectories.

docker run --rm -v "$(pwd)":/tmp --user $(id -u):$(id -g) registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest validate-mmcif-file -a . -r --dict-sdb /usr/local/share/mmcif-dict-suite/mmcif_pdbx_v50.sdb [FILE.cif]

To update to the latest version of the container image, run:

docker pull registry.scicore.unibas.ch/schwede/modelcif-converters/mmcif-dict-suite:latest

The script is based on the RCSB mmCIF Dictionary Suite (see below) and includes a report mode which is easier to read.

2. PDBe mmCIF Validator extension for VS Code

PDBe provides an easy-to-install mmCIF Validator extension for VS Code.

Once installed, the validator can also be used from the command line.

3. RCSB mmCIF Dictionary Suite

The RCSB provides an mmCIF Dictionary Suite that can be run from the command line and requires minimal dependencies to install. It is the reference implementation for validating mmCIF files as it comes from the same source as the mmCIF format.

Evaluation

Once the PDB releases the experimental structures on Wednesday (00:00 UTC), we automatically start the evaluation process. Structural information is downloaded from the PDB and the following additional filtering steps are performed:

Only X-ray diffraction and high resolution (≤ 4.0Å) EM structures are included in the evaluation. Targets derived from solution NMR experiments are excluded since week 2025-07-05 but present in earlier evaluations. Targets derived from other experimental methods are excluded.
Very large complexes (> 200 polymer chains, or > 100 chains of a single entity) are excluded, as current scoring methods are not able to compute them in reasonable time.
Emails from participating servers are collected and analyzed, and predictions are scored if the emails were received on time or during the grace period (until Wednesday at 3:00 CET/CEST).

Scores

All the scores are computed with OpenStructure, as described in Studer et al. (2025), with alignments based on residue numbers. Scores are initially computed on every biounit, and only scores against the biounit yielding the highest LDDT (for polymer scores) and resulting in the most ligand scored, highest sum of LDDT-PLI and lowest sum BiSyRMSD (for ligand scores) are kept.

Most scores include a penalty for missing chains and ligands, penalizing methods that fail to predict stoichiometry. Variants considering only mapped chains alleviate this penalty. Besides LDDT, these mapped scores are provided as "Advanced scores" (see below for more details). Note that there are no penalties for added chains or ligands.

The following scores are available:

The LDDT score (Local Distance Difference Test) evaluates the quality of the local atomic environment of a model of protein or nucleic acid chains. LDDT is superposition independent and considers all atoms in one or more protein and nucleotide chains. LDDT rewards the fraction of correctly predicted inter-atomic distances in a model at different threshold levels. Chains are mapped automatically between the model and the reference, with alignments based on residue numbers. A filter based on the Engh and Huber bond lengths and angles removes stereochemical violations and steric clashes. We use the default inclusion radius (15 Å) and distance difference thresholds (0.5Å, 1Å, 2Å, 4Å).
LDDT (mapped chains) is a variation of LDDT that only considers mapped chains and hence doesn't penalize for wrong stoichiometry.
The iLDDT score (interface LDDT) is a variation of LDDT evaluating the accuracy of interfaces by focussing on inter-chain contacts only. ILDDT uses the default LDDT distance cutoffs (0.5Å, 1Å, 2Å, 4Å) and is therefore very stringent and quickly tends to 0 if interaces are modeled incorrectly.
The TM-score (Template modeling score) is a backbone-only score dependent on a global superposition to assesses the overall accuracy of a complex. It mitigates the effect of outlier regions by focusing on maximizing the alignment of correctly predicted regions, limiting the influence of erroneous regions by treating them as outliers. The TM-score contains a scaling factor d₀ in order to be independent on the protein length. As a result, for large oligomeric complexes, local accuracy can be fairly low even with a high TM-score. In CAMEO, the TM-score is computed with US-align (Zhang et al., (2022)) via OpenStructure.
The LDDT-PLI score (LDDT Protein-Ligand Interfaces) is a variation of LDDT evaluating the accuracy of polymer-ligand contacts.
The number of successes is the number of ligand predictions with a BiSyRMSD < 2Å, a symmetry-corrected RMSD after superposition of the binding site, similar to what is commonly defined in the docking community. The number is accompanied with the total number of ligands.

Scores aggregation

Only model-1 scores are displayed on the website. Scores for models 2–5 are available in the full data downloads.

When comparing multiple servers, aggregated scores are computed on the common subset of targets for which all selected servers have submitted predictions. This ensures a fair comparison by avoiding biases that would arise from evaluating servers on different sets of targets.

Ligand scores displayed on the website (LDDT-PLI, Ligand Success Rate) are aggregated over all relevant ligands in the target, excluding common crystallographic artifacts (such as GOL, PEG, 1PE and other buffer or crystallization components) identified using the plinder curated artifact list (May 2024 update). Per-ligand scores for all ligands, including artifacts, are available in the full data downloads.

Advanced Scores

The following scores are considered "advanced". They are available in the full data downloads, but are not displayed on the web:

The QS-score (Quaternary Structure score), initally described by Bertoni et al. (2017), quantifies the fraction of shared interface contacts (residues on different chains with a Cβ-Cβ distance < 12 Å) between two complexes. A QS-score close to 1 translates to very similar interfaces, matching stoichiometry and a majority of identical interfacial contacts. A QS-score close to 0 indicates a radically diverse quaternary structure, probably different stoichiometries and potentially representing alternative binding conformations. The QS-score is symmetrical and considers reference and model identically. Two flavors of QS-score are reported: QS-global where, if the reference is incomplete, contacts that are present only in the model will penalize the score, even though the involved residues are not covered by experiment; and QS-best, which will not penalize contacts missing in the model, even though the involved residues are covered by the reference structure.
The Complex RMSD is the root mean square deviation of Cα positions (C3' for nucleic acids) after Kabsch superposition, with a chain mapping optimizing RMSD. RMSD doesn't penalize for missing residues in the model. Because RMSD relies on a global superposition, it can be very sensitive to flexibility.
The LDDT-LP (LDDT Ligand Pocket) assess the accuracy of the residues in the ligand-binding pocket, irrespective of the ligand, and with a chain mapping optimizing the ligand RMSD. Unlike the LDDT-BS (LDDT Binding Site) computed by earlier versions of CAMEO, it is only computed when a ligand was predicted in or near the binding pocket.
The mapped scores are scores computed only on chains mapped by the initial chain mapping of the whole complex. They are meant to mitigate the penalty of missing chains for servers that do not predict the stoichiometry of the complexes.