[Introduction]
[Methods]
[Results]
[Discussion]
[Downloads]
[Acknowledgements]
[Citations]
Browse the SLAM gene predictions
Introduction
We have run the SLAM program on the human (NCBI Build 30,
June 2002) and mouse (MGSC v3, February 2002) genomes. Orthologous
regions from the two genomes as specified by a symmetric synteny map were used
as input to SLAM. Because the map used was symmetric and nonoverlapping,
the annotations produced by SLAM are also symmetric and nonoverlapping.
For each gene prediction made in the human genome there is a corresponding
gene prediction in the mouse genome with identical exon structure. In
addition to predicting genes, SLAM also outputs regions that it considers to
be conserved non-coding sequence (CNS). Here we present the annotations
made by SLAM on the whole human and mouse genomes.
Methods
- The symmetric synteny map used in the mouse paper (constructed by
Michael Kamal) was obtained.
The map pairs segments in the human and mouse genomes with continuity
broken only by blocks of length less than 300 kb.
- The syntenic segments were broken into smaller syntenic pieces
(length < 300 kb) for easier processing by SLAM.
- The syntenic pieces were aligned using
AVID. The AVID run was
completed on a cluster at Affymetrix in approximately 2.5 hours.
- SLAM was run on all syntenic pieces using the AVID alignments as guides.
The SLAM run was completed on a cluster at Affymetrix in approximately
2 days.
- SLAM gene predictions with coding length less than 120 bp were filtered
out.
Results
Summary Statistics
| Number of syntenic segments
|
342
|
| Number of syntenic pieces
|
10,613
|
| Number of predicted genes |
29,283
|
| Number of predicted exons
|
178,750
|
| Number of predicted CNS (Conserved non-coding sequence)
|
511,895
|
Discussion
The de novo SLAM predictions are orthologous predictions in the sense that SLAM predictions are symmetric, and there is a bijective correspondence between human and mouse gene predictions and their structures. The symmetry of SLAM predictions increases confidence in the predictions, because human gene predictions must have consistent ORFs in mouse, splice sites and exon lengths, and vice versa.
At the exon level, SLAM covers 79.8% of the RefSeq human exons and 77.5% of the exons in the ENSEMBL human gene set. These numbers are only slightly lower than Genscan and Twinscan coverage of these gene sets. This is because orthologous predictions are not possible in cases where there have been local rearrangements <300kb in size, or in cases where the synteny map is wrong (either mapped to the wrong place or to a paralogous region). As expected, SLAM is specific, with fewer coding exon predictions than other programs.
It is interesting to note that 151,770 ENSEMBL exons are covered by SLAM in human and 152,548 in mouse, suggesting that the sensitivity of ENSEMBL is very consistent in human and mouse. On the other hand, only 119,275 SLAM exons are covered by ENSEMBL in mouse, versus 125,773 in human, implying a small (but not insignificant) difference in specificity. Twinscan and Genscan display similar discrepancies between sensitivity/specificity in human and mouse.
In summary, the SLAM whole genome human/mouse run demonstrates the feasibility of de novo prediction of orthologous genes in the human and mouse genomes and results in thousands of new coding exons not predicted using other methods. The SLAM CNS set is the first de novo prediction of non-coding conserved regions in the human and mouse genome, and should be useful for many applications. In addition, the symmetric nature of SLAM allows for inferences about problems in the existing human and/or mouse gene sets.
Downloads
- All annotations are in GFF format.
- Download all human
genes (2.6 MB gzipped file) and
CNS (6.7 MB gzipped file)
predicted by SLAM.
- Download all mouse
genes (2.6 MB gzipped file) and
CNS (6.7 MB gzipped file)
predicted by SLAM.
Acknowledgements
- SLAM was written by Marina Alexandersson, Simon Cawley and
Lior Pachter.
- The whole genome run was engineered by Simon Cawley.
- This Web site and the associated analyses were done by Colin Dewey, with
help from Lior Pachter
Citations
If you use any of the results on these pages in a publication, please cite the following papers:
-
Mouse Genome Consortium,
Initial sequencing and comparative analysis
of the mouse genome,
Nature, 2002. 420(6915): p. 520-562.
-
M. Alexandersson,
S. Cawley,
L. Pachter,
SLAM- Cross-species gene finding and alignment with a generalized pair
hidden Markov model,
Genome Research, in press.
-
N. Bray,
I. Dubchak,
L. Pachter,
AVID:
A Global Alignment Program,
Genome Research, 2003. 13(1): p. 97-102.
bio.math.berkeley.edu