TY - JOUR
T1 - Improving transcriptome construction in non-model organisms
T2 - Integrating manual and automated gene definition in Emiliania huxleyi
AU - Feldmesser, Ester
AU - Rosenwasser, Shilo
AU - Vardi, Assaf
AU - Ben-Dor, Shifra
N1 - European Research Council (ERC) StG (INFOTROPHIC grant) [280991]; Israeli Science Foundation (ISF) Legacy Heritage fund [1716/09]; International Reintegration Grant (IRG) Marie Curie grant; Edith and Nathan Goldenberg Career Development ChairThe authors would like to thank Dr.Gilgi Friedlander for providing scripts, Ruth Khait and Adva Shemi for help with the manual definition of genes. AV would like to acknowledge the support of: the European Research Council (ERC) StG (INFOTROPHIC grant #280991), the Israeli Science Foundation (ISF) Legacy Heritage fund (grant #1716/09), International Reintegration Grant (IRG) Marie Curie grant and the generous support of Edith and Nathan Goldenberg Career Development Chair.
PY - 2014/2/22
Y1 - 2014/2/22
N2 - Background: The advent of Next Generation Sequencing technologies and corresponding bioinformatics tools allows the definition of transcriptomes in non-model organisms. Non-model organisms are of great ecological and biotechnological significance, and consequently the understanding of their unique metabolic pathways is essential. Several methods that integrate de novo assembly with genome-based assembly have been proposed. Yet, there are many open challenges in defining genes, particularly where genomes are not available or incomplete. Despite the large numbers of transcriptome assemblies that have been performed, quality control of the transcript building process, particularly on the protein level, is rarely performed if ever. To test and improve the quality of the automated transcriptome reconstruction, we used manually defined and curated genes, several of them experimentally validated.Results: Several approaches to transcript construction were utilized, based on the available data: a draft genome, high quality RNAseq reads, and ESTs. In order to maximize the contribution of the various data, we integrated methods including de novo and genome based assembly, as well as EST clustering. After each step a set of manually curated genes was used for quality assessment of the transcripts. The interplay between the automated pipeline and the quality control indicated which additional processes were required to improve the transcriptome reconstruction. We discovered that E. huxleyi has a very high percentage of non-canonical splice junctions, and relatively high rates of intron retention, which caused unique issues with the currently available tools. While individual tools missed genes and artificially joined overlapping transcripts, combining the results of several tools improved the completeness and quality considerably. The final collection, created from the integration of several quality control and improvement rounds, was compared to the manually defined set both on the DNA and protein levels, and resulted in an improvement of 20% versus any of the read-based approaches alone.Conclusions: To the best of our knowledge, this is the first time that an automated transcript definition is subjected to quality control using manually defined and curated genes and thereafter the process is improved. We recommend using a set of manually curated genes to troubleshoot transcriptome reconstruction.
AB - Background: The advent of Next Generation Sequencing technologies and corresponding bioinformatics tools allows the definition of transcriptomes in non-model organisms. Non-model organisms are of great ecological and biotechnological significance, and consequently the understanding of their unique metabolic pathways is essential. Several methods that integrate de novo assembly with genome-based assembly have been proposed. Yet, there are many open challenges in defining genes, particularly where genomes are not available or incomplete. Despite the large numbers of transcriptome assemblies that have been performed, quality control of the transcript building process, particularly on the protein level, is rarely performed if ever. To test and improve the quality of the automated transcriptome reconstruction, we used manually defined and curated genes, several of them experimentally validated.Results: Several approaches to transcript construction were utilized, based on the available data: a draft genome, high quality RNAseq reads, and ESTs. In order to maximize the contribution of the various data, we integrated methods including de novo and genome based assembly, as well as EST clustering. After each step a set of manually curated genes was used for quality assessment of the transcripts. The interplay between the automated pipeline and the quality control indicated which additional processes were required to improve the transcriptome reconstruction. We discovered that E. huxleyi has a very high percentage of non-canonical splice junctions, and relatively high rates of intron retention, which caused unique issues with the currently available tools. While individual tools missed genes and artificially joined overlapping transcripts, combining the results of several tools improved the completeness and quality considerably. The final collection, created from the integration of several quality control and improvement rounds, was compared to the manually defined set both on the DNA and protein levels, and resulted in an improvement of 20% versus any of the read-based approaches alone.Conclusions: To the best of our knowledge, this is the first time that an automated transcript definition is subjected to quality control using manually defined and curated genes and thereafter the process is improved. We recommend using a set of manually curated genes to troubleshoot transcriptome reconstruction.
KW - Emilania huxleyi
KW - Manual curation
KW - Non-model organism
KW - RNAseq
KW - Transcriptome assembly
UR - http://www.scopus.com/inward/record.url?scp=84895486781&partnerID=8YFLogxK
U2 - 10.1186/1471-2164-15-148
DO - 10.1186/1471-2164-15-148
M3 - مقالة
C2 - 24559402
SN - 1471-2164
VL - 15
JO - BMC Genomics
JF - BMC Genomics
IS - 1
M1 - 148
ER -