The noncoding gene model annotations for Zm-B73-REFERENCE-NAM-5.0 and the NAM Consortium founder lines (REFERENCE-NAM-1.0) did not contain CDS annotations, which are required for transcript creation. MaizeGDB used the program genometools to add CDS annotations to the noncoding gene model annotation gff3 files. These updated gene model annotation gff3 files are included in the directory "noncoding_gff3_cds_added". CDS and protein transcript files were then created for each genome. However, genometools does not carry over the transcript IDs to the CDS IDs, but instead appends a unique identifier to the transcripts and gives each CDS annotation an association with that transcript. Therefore, MaizeGDB used the transcript ID given to each noncoding transcript by genometools to rename the genometools transcript IDs to the correct assigned gene model IDs. The scripts used to generate the CDS and protein fasta files are: ##get_transcripts.sh: #!/bin/bash ml genometools ml cufflinks export -f cufflinks FILE1=${1} FILE2=${2} #sh get_transcripts.sh gff_filename genome_filename (i.e. sh get_transcripts.sh Zm-M162W-REFERENCE-NAM-1.0_Zm00033ab.1.nc Zm-M162W-REFERENCE-NAM-1.0) cat ${1}.gff3 | grep -v "#" | grep -v "chromosome" | awk -v OFS="\t" '{if($3 != "CDS")print}' | sort -k1,1 -k4,4n | awk 'BEGIN { OFS="\t"; print "##gff-version 3"} { print $0}' > ${1}_sorted.gff3 gt gff3 -retainids -addids -sort -tidy -force -o ${1}_sorted_ids.gff3 ${1}_sorted.gff3 gt cds -matchdescstart -force -seqfile ${2}.fa -o ${1}.cds.gff3 ${1}_sorted_ids.gff3 gffread -x ${1}_temp -g ${2}.fa ${1}.cds.gff3 cat ${1}.cds.gff3 | tr ';' '\t' | awk -v OFS="\t" '{if($3 == "noncoding_transcript")print ">"$9,$12}' | sed -e 's/ID=//g; s/transcript_id=//g' | sort -k1,1 > ${1}_list cat ${1}_temp | tr ' ' '\t' | cut -f 1 | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' - | sort -k1,1 | join ${1}_list - | tr ' ' '\t' | sed -e 's/>//g' | awk -v OFS="\t" '{print ">"$2" predicted_transcript_ID_from_genometools_CDS="$1,$3}' | sort -k1,1 | tr '\t' '\n' > ${1}.cds.fa gffread -y ${1}_protein-temp -g ${2}.fa ${1}.cds.gff3 cat ${1}_protein-temp| tr ' ' '\t' | cut -f 1 | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' - | tr '.' '*' | sed -e 's/*//g' | sort -k1,1 | join ${1}_list - | tr ' ' '\t' | sed -e 's/>//g' | awk -v OFS="\t" '{print ">"$2" predicted_transcript_ID_from_genometools_CDS="$1,$3}' | sort -k1,1 | tr '\t' '\n' > ${1}.protein.fa