Genomic island datasets derived using comparative genomics
Genomic islands (GIs) are clusters of genes in prokaryotic genomes of probable horizontal origin. GIs are disproportionately associated with microbial adaptations of medical or environmental interest.
Recently, multiple programs for automated detection of GIs have been developed that utilize sequence composition characteristics (i.e. oligo-nucleotide bias, etc). To robustly evaluate the accuracy of such methods, we propose that a dataset of GIs be constructed that is not based on artificially generated sequences and is constructed using criteria that are independent of sequence composition-based analysis approaches.
We have developed a comparative genomics approach, named IslandPick, that identifies both very probable islands and non-island regions and permits an independent assessment of sequence composition based GI prediction tool accuracy. The approach involves:
1) The flexible, automated selection of comparative genomes for each query genome, using a distance function that picks appropriate genomes for identification of GIs
2) The identification of regions unique to the query genome in comparison with the previously chosen genomes (positive dataset)
3) The identification of conserved genomic regions that are common to all genomes being compared (negative dataset)
Using our constructed datasets, we investigated the accuracy of several sequence composition-based GI prediction tools.
A manuscript describing the generation of positive and negative datasets, and the accuracy results of other sequence composition based GI prediction tools is currently under review.
An overview of our approach to produce these datasets is available here.
Datasets and source code
The positive dataset of GIs and negative datasets (non-GIs) for 118 bacterial chromosomes that can be used for the evaluation of the accuracy of other GI predictors, as well as the source code for our comparative genomics approach, IslandPick, is available under a GPL license. An additional positive dataset of GIs that use more divergent genomes to predict more ancient GIs is also freely available.
Questions or problems can be emailed to firstname.lastname@example.org (contact name Morgan Langille).