Genotype by Sequencing (GBS) Data, over 100K markers
Genotype Experiment- For loading into the T3 website the submitter should include a description of the experiment, how the GBS data was processed, a marker load file,
and a genotype results file. - Additional files like the TagsByTaxa and VCF can be included as raw files and will be available for download from the T3 website.
- The marker_name and sequence should be unique.
- The marker sequence should be checked for synonyms to existing entries in the database (there is a BLAST tool to check sequence synonyms on the import page).
- We will BLAST newly submitted marker flanking sequences to flanking sequences previously submitted. For sequence pairs that are identical along
the full length of the shorter of the two sequences, we will assume identity of the SNP. Thus, the newly submitted marker will become a synonym of
the previously submitted marker. - The sequence for each marker should be long enough to uniquely define the marker within the genome. For the wheat genome anchored to the IWGSC assembly we use a marker sequence of 128 bases.
- file format - comma separated
WCSS1_marker_name,marker_type,A_allele,B_allele,sequence WCSS1_contig3917765_1al-5470,GBS,G,A,GCCGGACTGAGGCGGCAACTTGATGCGGCGGATGCCAACATTGCGCTTGTGAACAAGCGGCTTG[G/A]CGAGGCACAGGGTATGTATTTTCGGGTGGTCAACAAATATTAAGAGGAGCATGATGCTAGTAT WCSS1_contig3917765_1al-5481,GBS,G,T,GCGGCAACTTGATGCGGCGGATGCCAACATTGCGCTTGTGAACAAGCGGCTTGGCGAGGCACAG[G/T]GTATGTATTTTCGGGTGGTCAACAAATATTAAGAGGAGCATGATGCTAGTATCTATAATATGC WCSS1_contig3917765_1al-5493,GBS,C,T,TGCGGCGGATGCCAACATTGCGCTTGTGAACAAGCGGCTTGGCGAGGCACAGGGTATGTATTTT[C/T]GGGTGGTCAACAAATATTAAGAGGAGCATGATGCTAGTATCTATAATATGCTGTGACTGCAGA
- fields
marker_name = valid characters are alphanumeric and “_-.“ marker_type = GBS A_allele = reference allele B_allele = alternate allele sequence = ACTG, the SNP should be embedded in the sequence with the reference allele first and the alternate allele second
- The marker_name and sequence should be unique.
- The A and B alleles should be ordered alphabetically (there is a tool to order the alleles on the import page).
- The marker sequence should be check for synonyms to existing entries in the database (there is a BLAST tool to check sequence synonyms on the import page).
- We will BLAST newly submitted marker flanking sequences to flanking sequences previously submitted. For sequence pairs that are identical along the full length
of the shorter of the two sequences, we will assume identity of the SNP. Thus, the newly submitted marker will become a synonym of the previously submitted marker. - The sequence for each marker should be long enough to uniquely define the marker within the genome. For markers not anchored to a reference assembly
the markers are typically less than 64 bases. - file format - comma separated
marker_name,marker_type,A_allele,B_allele,sequence gbsCNLmaster1,GBS,A,G,TGCAGAAAAAAAAACT[A/G]CAATAAGACATGTGTTGTGATGGTGGAGGGTGCCGCTCGGCCATTCG gbsCNLmaster2,GBS,A,G,TGCAGAAAAAAAAACTACAATAAG[A/G]CATGTGTTGTGATGGTGGAGGGTGCCGCTCGGCCATTCG gbsCNLmaster3,GBS,C,G,TGCAGAAAAAAAACAACT[C/G]GCAGGTTCTCAAAGTAGGATCCAGAAGACTCAGGGAGGTGGCCGC gbsCNLmaster4,GBS,A,G,TGCAGAAAAAAAACTGTTAGACACGTGTAAATGTAGAACCAATTGATTGGATGCAC[A/G]AGGAAGG
- fields
marker_name = valid characters are alphanumeric and “_-.“ marker_type = GBS A_allele = ACTG B_allele = ACTG sequence = ACTG, the SNP should be embedded in the sequence with the reference allele first and the alternate allele second
- The import file is tab delimited similar to HapMap.
- The columns contain the lines and the rows contain the markers.
- Each cell of the matrix should be an IUPAC nucleotide (A, T, C, G) for a homozygote or (K, Y, W, S, R, M) for a heterozygote, and "N" for missing data.
- The Chrom and Pos fields can be blank if not available.
- Genotype files with over 100K markers should be imported via the command line as described in the GBS import instructions
- file format - tab separated
SNP Chrom Pos 2174-05 2180 Above Agate Alice Alliance WCSS1_contig3917765_1AL-5470 1AL 5905 N A N N WCSS1_contig3917765_1AL-5481 1AL 5916 N G G T WCSS1_contig3917765_1AL-5493 1AL 5928 N C C C
- fields
SNP = identifier Chrom = chromosome that SNP maps to (optional) Pos = chromosome position of the SNP (optional), for example the position on the map of ordered contigs Col4 and on = observed genotypes of samples
- Load file for tassel and rrBLUP format with the script load_gbs_bymarker.
- Calculate and load allele frequencies with the script load_gbs_frequencies.