The noncoding gene model annotations for Zm-B73-REFERENCE-NAM-5.0 and the NAM Consortium founder lines (REFERENCE-NAM-1.0) did not contain CDS annotations, which are required for transcript creation. MaizeGDB used the program genometools to add CDS annotations to the noncoding gene model annotation gff3 files. These updated gene model annotation gff3 files are included in the directory "noncoding_gff3_cds_added". CDS and protein transcript files were then created for each genome. However, genometools does not carry over the transcript IDs to the CDS IDs, but instead appends a unique identifier to the transcripts and gives each CDS annotation an association with that transcript. Therefore, MaizeGDB used the transcript ID given to each noncoding transcript by genometools to rename the genometools transcript IDs to the correct assigned gene model IDs. 

The scripts used to generate the CDS and protein fasta files are:

##get_transcripts.sh:

#!/bin/bash

ml genometools
ml cufflinks
export -f cufflinks

FILE1=${1}
FILE2=${2}

#sh get_transcripts.sh gff_filename genome_filename (i.e. sh get_transcripts.sh Zm-M162W-REFERENCE-NAM-1.0_Zm00033ab.1.nc Zm-M162W-REFERENCE-NAM-1.0)

cat ${1}.gff3 | grep -v "#" | grep -v "chromosome" | awk -v OFS="\t" '{if($3 != "CDS")print}' | sort -k1,1 -k4,4n | awk 'BEGIN { OFS="\t"; print "##gff-version 3"} { print $0}' > ${1}_sorted.gff3

gt gff3 -retainids -addids -sort -tidy -force -o ${1}_sorted_ids.gff3 ${1}_sorted.gff3

gt cds -matchdescstart -force -seqfile ${2}.fa -o ${1}.cds.gff3 ${1}_sorted_ids.gff3

gffread -x ${1}_temp -g ${2}.fa ${1}.cds.gff3

cat ${1}.cds.gff3 | tr ';' '\t' | awk -v OFS="\t" '{if($3 == "noncoding_transcript")print ">"$9,$12}' | sed -e 's/ID=//g; s/transcript_id=//g' | sort -k1,1 > ${1}_list

cat ${1}_temp | tr ' ' '\t' | cut -f 1 | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' - | sort -k1,1 | join ${1}_list - | tr ' ' '\t' | sed -e 's/>//g' | awk -v OFS="\t" '{print ">"$2" predicted_transcript_ID_from_genometools_CDS="$1,$3}' | sort -k1,1 | tr '\t' '\n' > ${1}.cds.fa

gffread -y ${1}_protein-temp -g ${2}.fa ${1}.cds.gff3

cat ${1}_protein-temp| tr ' ' '\t' | cut -f 1 | awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' - | tr '.' '*' | sed -e 's/*//g' | sort -k1,1 | join ${1}_list - | tr ' ' '\t' | sed -e 's/>//g' | awk -v OFS="\t" '{print ">"$2" predicted_transcript_ID_from_genometools_CDS="$1,$3}' | sort -k1,1 | tr '\t' '\n' > ${1}.protein.fa