Data

Sorghum bicolor GBS SNP data for NSF “BREAD: Platform, pipeline, and analytical tools for next generation genotyping to serve breeding efforts in Africa”.  IOS-0965342

Release date:  November 30, 2012.

SNP data from this project has been released in several forms to meet the needs of variety of users.

Please read the following descriptions carefully to select the appropriate version: 

Data Set A: Imputed Genotype Calls (Hapmap files, sorghum genome v.1 coordinates)

If you are (i) a user with limited bioinformatics capabilities, (ii) interested in a quick first-pass analysis, or (iii) interested in direct comparisons with our published findings we recommend the imputed genotypes for the 265K SNP set (Data Set A). (Missing data has been imputed using NPUTE). Users should keep in mind that methods for SNP calling and imputation from GBS/RAD data are rapidly evolving and likely to improve in the near future.  Also, there are many tradeoffs inherent in the SNP calling and imputation process (e.g. accuracy versus coverage) and no single build of the SNP data set will be optimal for all downstream applications. 

Data Set B: Raw Genotype Calls (Hapmap files, sorghum genome v.1 coordinates)

If you have bioinformatics capabilities to carry out imputation yourself (e.g. using one of the many free software are available for this) we recommend you download the Raw Genotype Calls (Data Set B) so you can test the effects of imputation methods/parameters and missing data on your results. For instance, you may want to check whether a genotype call underlying one of your key findings was based on direct observation from a sequencing read or an inference generated during imputation.  Note, this data set includes a large proportion of missing data.

Data Set C: Raw Sequence Reads (Illumina FASTQ files) 

If you are a user with advanced bioinformatics expertise and an interest in developing the most appropriate SNP calls for your study, we recommend downloading sequence reads (Data set C) and testing SNP calling and imputation methods in the context of your study. This will require you run the TASSEL GBS pipeline or another GBS/RAD SNP calling method (e.g. Stacks), then impute missing data.

Known issues that may affect the appropriateness of our Genotype Calls (Data Sets A & B), and are targets of ongoing development in our GBS pipeline: 

  • Ability to detect rare alleles;
  • Ability to call SNPs at high-diversity loci;
  • Ability to quantify or map presence/absence variation (PAV) due to restriction site polymorphism. 
  • Coordinates are for old genome version (Sorghum genome v.1)

PROCEED TO SNP DATA DOWNLOAD (Data set A & B)

PROCEED TO RAW DATA DOWNLOAD (Data set C) and KEYFILE for TASSEL analysis