Abstract

It is a challenge to classify protein-coding or not-coding transcripts, particularly those re-constructed from high-throughput sequencing data of poorly annotated species. This report developed and evaluated a powerful signature tool, Coding-Not-Coding Index (CNCI), past profiling adjoining nucleotide triplets to finer distinguish protein-coding and not-coding sequences independent of known annotations. CNCI is effective for classifying incomplete transcripts and sense–antisense pairs. The implementation of CNCI offered highly accurate classification of transcripts assembled from whole-transcriptome sequencing data in a cross-species mode, that demonstrated factor evolutionary divergence betwixt vertebrates, and invertebrates, or betwixt plants, and provided a long non-coding RNA catalog of orangutan. CNCI software is bachelor at http://world wide web.bioinfo.org/software/cnci .

INTRODUCTION

The recent progress in the ENCODE project suggests that >70% of human being genome sequences are transcribed into candy or primary RNAs ( i , 2 ). For other species, numerous novel transcripts are identified by advances in RNA sequencing techniques (RNA-seq) ( 3 ). It remains a challenge to identify sequence differences between protein-coding and non-coding transcripts. During the by 5 years, several tools, such as CPC and phyloCSF, have been adult to classify poly peptide-coding or non-coding transcripts using information on open reading frame (ORF), known protein database, or evolutionary signatures ( four–6 ). These approaches have been constructive in identifying long coding transcripts for putative ORF or peptide hits. Withal, they are not suitable for identifying long not-coding RNAs (lncRNAs), which may comprise short protein-like sub-sequences or long putative ORFs ( vi , 7 ). Moreover, these bachelor tools search for the best segment of evolutionary signatures but may likewise atomic number 82 to significant simulated-positive and false-negative discoveries, because near of the functional non-coding RNAs have college evolutionary conservation relative to introns, which suggested the conserved chemical element exists in these not-coding sequences ( eight ), and in dissimilarity, many newly evolved proteins do not comprise a conserved ORF ( vii ). In addition, for poorly annotated species or those without whole-genome sequence, it is hard to ascertain the transcript condition using the existing tools.

To overcome these challenges, we developed Coding-Non-Coding Index (CNCI) software, a powerful signature tool, past profiling bordering nucleotide triplets (ANT), to effectively distinguish between protein-coding and non-coding sequences contained of known annotations. In comparing with the existing tools, CNCI showed meliorate performance in many aspects, especially for classification of incomplete transcripts and sense–antisense transcript pairs. Notably, CNCI performed well uniformly on all the species of the vertebrates, but relatively poorly for invertebrates and plants, when using human data every bit training sets. Because CNCI can classify poly peptide-coding and non-coding RNAs solely based on sequence intrinsic composition, information technology is potentially applicable to a variety of species without whole-genome sequence or with poorly annotated data. As an example of application for poorly annotated species, we tested CNCI on a published RNA-Seq data gear up from six organs of orangutan. As a consequence, CNCI annotated 7697 novel transcripts as lncRNAs, which contributed to the first comprehensive orangutan lncRNA catalog.

MATERIALS AND METHODS

Data clarification

For man grooming sets, poly peptide-coding genes were nerveless from RefSeq database and long non-coding genes were collected from Gencode (v11) ( 9 ). For mouse testing sets, both protein-coding and not-coding genes were collected from Ensembl (v65) ( 10 ) database. As other testing sets, gene annotation of other vertebrates or plants was downloaded from Ensembl (v69) and EnsemblPlants (v16) ( 10 ) databases, respectively. LincRNA catalog was obtained from human being body map ( eleven ). Whole-transcriptome sequencing data of the six organs of orangutan were obtained from the study of David Brawand et al. ( 3 ) and downloaded from Gene Expression Bus under accession lawmaking GSE30352. All the data were summarized in Supplementary Table S1 , and the length distributions of the human and mouse transcript collections are depicted in Supplementary Table S2 .

Calculation of usage frequency of ANT

In this study, we began analyzing the usage frequency of ANT in coding domain sequence (CDS) and non-coding RNA sequences. There were 64*64 ANT, and we calculated usage frequency of each ANT as follows:

Where X indicates the ANT, Southward j (X i ) is the occurrence number of X in sequence i, Ten i Due north is X i 's full occurrence number in the data set, T is all kinds of ANT'southward total occurrence number in the data set up and Ten i F is the usage frequency of the Emmet. In human sequences, the usage frequency of ANT was calculated in thirty 507 CDS and 18 566 long non-coding transcripts. In mouse sequences, the usage frequency was calculated in 25 316 CDS and 8696 long non-coding transcripts ( Supplementary Figure S1 ). The log-ratio of the usage frequency of all kinds of ANT constituted a 64*64 ANT score matrix ( Figure 1 a and b).

Figure 1.

 Illustration of ANT score matrix and CNCI framework. The score of each ANT is calculated for human ( a ) or mouse ( b ). The three black rows or columns represent three stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and ACT in cDNA sequence, respectively), which shows low frequency in protein-coding sequence. ( c ) The framework of CNCI. The top panel shows the process of a sequence in a testing set. For a given sequence, six MLCDS regions (represented by six lines) are identified from six reading frames (represented by six color arrow lines) using a sliding window and dynamic programming algorithm. Then, an MLCDS region with a maximal S-score is selected to incorporate into an SVM. The bottom panel shows the training and classification process. Reliable protein-coding and non-coding sequences are used as a training set, and five features are extracted to train SVM, which classifies the incorporating sequence into protein-coding or non-coding sequence.

Illustration of Pismire score matrix and CNCI framework. The score of each Pismire is calculated for homo ( a ) or mouse ( b ). The three black rows or columns represent three stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and ACT in cDNA sequence, respectively), which shows depression frequency in poly peptide-coding sequence. ( c ) The framework of CNCI. The top panel shows the process of a sequence in a testing fix. For a given sequence, vi MLCDS regions (represented by 6 lines) are identified from 6 reading frames (represented past six color arrow lines) using a sliding window and dynamic programming algorithm. Then, an MLCDS region with a maximal Southward-score is selected to incorporate into an SVM. The lesser panel shows the preparation and classification process. Reliable protein-coding and non-coding sequences are used as a training set, and 5 features are extracted to train SVM, which classifies the incorporating sequence into poly peptide-coding or non-coding sequence.

Figure 1.

 Illustration of ANT score matrix and CNCI framework. The score of each ANT is calculated for human ( a ) or mouse ( b ). The three black rows or columns represent three stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and ACT in cDNA sequence, respectively), which shows low frequency in protein-coding sequence. ( c ) The framework of CNCI. The top panel shows the process of a sequence in a testing set. For a given sequence, six MLCDS regions (represented by six lines) are identified from six reading frames (represented by six color arrow lines) using a sliding window and dynamic programming algorithm. Then, an MLCDS region with a maximal S-score is selected to incorporate into an SVM. The bottom panel shows the training and classification process. Reliable protein-coding and non-coding sequences are used as a training set, and five features are extracted to train SVM, which classifies the incorporating sequence into protein-coding or non-coding sequence.

Analogy of ANT score matrix and CNCI framework. The score of each ANT is calculated for human ( a ) or mouse ( b ). The iii black rows or columns stand for 3 stop codons, including UAA, UAG and UGA (corresponding to ATT, ATC and Act in cDNA sequence, respectively), which shows depression frequency in poly peptide-coding sequence. ( c ) The framework of CNCI. The top panel shows the process of a sequence in a testing prepare. For a given sequence, six MLCDS regions (represented by six lines) are identified from vi reading frames (represented by six color pointer lines) using a sliding window and dynamic programming algorithm. And so, an MLCDS region with a maximal S-score is selected to incorporate into an SVM. The bottom panel shows the training and classification process. Reliable protein-coding and non-coding sequences are used as a grooming set up, and five features are extracted to train SVM, which classifies the incorporating sequence into protein-coding or not-coding sequence.

Utilization of the sliding window to scan each transcript

It is an essential footstep for our approach to identify the most-like CDS (MLCDS) region of each transcript. We get-go used the sliding window to analyze each transcript past setting the size of the sliding window and the browse step as ane ANT. To verify the size of the sliding window (represented by parameter Northward) to achieve the robust concluding classification outcome, nosotros divers a series of N'south with different lengths (from thirty nt to 300 nt, with xxx nt as a step-forward interval). For each of the North sequences, a classifier was trained with human training information. The sensitivity–specificity curves of the classification were so calculated on the testing prepare ( Supplementary Figure S2 ). In our classification model, N of 150 nt was plant to be the robust, and thus we chose N = 150 nt, which is a proper size longer than small RNAs but shorter than lncRNAs, into the classification model.

The window scanned each transcript half dozen times to generate six reading frames. At the same time, CNCI calculated the sequence-score (Due south-score) of each window based on Ant score matrix; thus, a given transcript produced half dozen detached numerical arrays. In the process of sliding window, each transcript was converted into six arrays, and each array was composed of sliding window'south S-score of one kind of reading frame. S-score can reverberate the coding power of each sliding window, and our initial purpose was to find out sub-sequences (described every bit MLCDS before in the text) in each of the vi reading frames that have the most ability to lawmaking.

Prediction of the MLCDS in each of the six reading frames

To place the MLCDS of each reading frame, we applied a dynamic programming called Maximum Interval Sum ( 12 ). This method tin scan i numerical array (which contains positive and negative numbers), and identify a sequent sub-assortment, which has the largest sum value than any other sub-arrays even including the whole assortment, using the following formula:

Where V is the maximum interval sum in a reading frame, a[k] is the maximum substring respective to V and i and j correspond the start and end position of a[k] in this reading frame, respectively. To calculate 5, i and j, we introduced a limited local maximum interval sum, b[j], which is defined as:

Where m is a variable representing the start position of b[j], j is the end position of b[j] and b[j] is the local maximum interval sum. Past combining with the definition of V, we can deduce that:

This formula ways that V is the maximum value of b[j] sets. Co-ordinate to the definition of b[j], we can draw a determination that when b[j − 1] > 0, b[j] = b[j − 1] + a[j] whatever a[j] is. When b[j − one] < 0, b[j] = a[j] whatsoever a[j] is. Thus, we can use the dynamic programming to scan the numerical array of one reading frame, as stated in the post-obit rule:

Afterwards the aforementioned processes, six candidate MLCDS regions from the six reading frames of each transcript were derived, and the maximum one as the all-time MLCDS region of the transcript that had a meaning larger length and higher quality percentage than the other 5 candidates was used to perform the feature extraction. To estimate the quality of the MLCDS, we compared the MLCDS of all human poly peptide-coding transcripts with the corresponding true CDS and evaluated the distribution of overlap caste ( Supplementary Figure S3 ).

Feature extraction and classification model structure

To distinguish protein-coding sequences from the not-coding sequences, we extracted 5 features, i.e. the length and Due south-score of MLCDS, length-percentage, score-distance and codon-bias. The length and South-score of MLCDS were used as the offset 2 features, which assess the extent and quality of the MLCDS, respectively ( Supplementary Tabular array S3 ). Moreover, as demonstrated earlier in the text, protein-coding transcripts possess a special reading frame obviously singled-out from the other v in the distribution of Ant. We analyzed 6 MLCDS candidates outputted by dynamic programming of the vi reading frames for each transcript, with the assumption that there must be one all-time MLCDS (as described before in the text); however, this phenomenon does not mostly exist for not-coding transcripts. Thus, we defined other 2 features, length-percentage and score-distance, as follows:

Where Ml is the length of the best MLCDS (according to South-score value) among that of 6 reading frames, and Y i represents the length of each half dozen of the MLCDS.

Where S is the South-score of the all-time MLCDS, and Ej represents the S-score of the other five MLCDS ( Supplementary Table S3 ).

All aforementioned four selected features could, to some extent, distinguish the protein-coding and not-coding sequences and were concordantly higher in protein-coding transcripts and lower in non-coding transcripts ( Supplementary Figure S4 ). Finally, nosotros included the fifth characteristic, the frequency of single nucleotide triplets, in the MLCDS as the terminal feature to complement the construction of a nomenclature model. This feature was defined as codon-bias, which evaluated the coding-non-coding bias for each of the 61 kinds of codons (the iii stop codons were ruled out) ( Supplementary Figure S5 ).

To get the positive and negative preparation sets, we extracted the v features for each best MLCDS from the known protein-coding and not-coding transcript data sets, respectively. We and so incorporated these two training sets into a support vector machine (SVM) as a model construction ( Figure one c). We used the A Library for Back up Vector Machines (LIBSVM) ( thirteen ) to train an SVM model using the standard radial basis function kernel, where the C and gamma parameters were set by default.

Identification of orangutan lncRNAs

To identify orangutan lncRNAs, nosotros used the spliced read aligner TopHat ( 14 ) (version V1.iii.1) to map all sequenced reads to the orangutan genome (ponAbe2) with the following parameters: min-anchor = five, min-isoform-fraction = 0 and the remainder set equally default. We and so aligned reads of each tissue from TopHat and assembled them into transcriptome separately by Cufflinks ( fifteen ) (version 1.0.three) with default parameters (and 'min-frags-per-transfrag = 0'). After that, we constructed the transcripts from six tissues and merged them together to plant the final transcript prepare of orangutan and so compared them with known genes annotated by Ensembl database (v69). Novel long transcripts (>200 bp) that practise not overlap with any known annotation and are localized in intronic, antisense or intergenic region were filtered by CNCI and added to the lncRNA itemize of orangutan.

RESULTS

Overview of CNCI methodology

CNCI contains ii main steps, including scoring the sequence and construction of classification model. To employ ANT signature to classify protein-coding and non-coding sequences, nosotros synthetic ANT score matrix that represents the degree of coding-non-coding bias. Beginning, nosotros calculated the usage frequency of Emmet in true CDS and non-coding sequences separately ( Supplementary Figure S1 ) and the log-ratio of the usage frequency between coding and non-coding sequences for each Emmet. For human sequences as case, the usage frequency of ANT was calculated in 30 507 CDS and 18 566 long not-coding transcripts. And then, the log-ratio of the usage frequency of all kinds of ANT constituted a 64*64 Pismire score matrix ( Figure 1 a and b). According to this matrix, Due south-score of six reading frames for each sequence was calculated using the following formula:

Where S is the Due south-score, H p is the ANT score matrix, X represents the types of ANT and n is the length of sequences in nucleotide triplet format.

We next evaluated whether S-score is effective in classifying coding and non-coding sequences. Outset, we compared the S-score distribution of true CDS with that of other five reading frames of human being protein-coding transcripts, which revealed a singled-out pattern, where the true CDS frame had the highest score amongst all six frames ( Supplementary Figure S6 ). So, we normalized the length of truthful CDS and all other five reading frames of all coding transcripts and plotted the ANT score for each position. The data showed the same pattern with distribution of South-score ( Supplementary Figure S7 ). For not-coding transcripts, the Ant scores across the normalized length showed a blueprint like to those of the other v non-coding reading frames of protein-coding transcripts ( Supplementary Figure S7 ). The results confirmed the classification ability of S-score and suggested that information technology is crucial to first place MLCDS region and and then classify any given sequences according to the best MLCDS using these six reading frames of a coming sequence regardless of whether it is an actual poly peptide-coding transcript or not.

To construct a classification model, nosotros used a sliding window approach to compute S-score for each reading frame of a sequence in each of the windows. Then, we identified the MLCDS for each reading frame co-ordinate to the cumulative S-score of the combined windows using a dynamic programming algorithm. Subsequently, we extracted these five statistical features from the best MLCDS that have a maximal S-score among these six reading frames and incorporated them into an SVM classifier to train this classification model ( Figure one c).

CNCI performance and comparison with existing tools

We applied CNCI to reliable protein-coding and non-coding data sets of both human and mouse and assessed its performance in the case of cross-species. Nosotros offset trained the CNCI on man protein-coding and long non-coding transcripts that showed 97.3% accuracy by 10-fold cross-validation. For all protein-coding transcripts, the correct transcriptional reading frame showed a notable peak of the ANT scores within the CDS region, and had singled-out pattern compared with the other 5 reading frames ( Figure 2 a). The coverage of the maximum MLCDS for all the poly peptide-coding transcripts was consistent with the distribution of the ANT scores ( Effigy 2 a), and a loftier degree of coincidence between MLCDS and CDS was observed ( Supplementary Effigy S3 ). Even so, these phenomena did not occur in non-coding transcripts, suggesting that the CNCI method has robust strength ( Supplementary Figure S7 ). Adjacent, nosotros applied the learned regularities to classify objects in a test set, which was nerveless from mouse protein-coding and non-coding transcripts. We found that the minimum average fault (MAE) (the cutoff that minimizes the boilerplate false-positive and simulated-negative rates) was 0.05 after the exam of the receiver operating feature (ROC) curve ( Figure 2 b). The result showed that CNCI worked reasonably well on mouse data, although CNCI was trained on human sequences. Moreover, we also compared the performance of CNCI with that of other available tools by re-analyzing the test data ready using CPC and phyloCSF. The ROC curves showed that MAE of CNCI was lower than that of other two methods (MAE was 0.11 and 0.28 for CPC and phyloCSF, respectively), indicating that CNCI is a better tool ( Figure two b). In improver, we further tested CNCI equally well every bit CPC and phyloCSF, on an independent long intergenic non-coding RNA information fix from human body map lincRNAs itemize ( eleven ). After removing the overlapping transcripts with grooming set, nosotros examined their operation across different lengths of transcripts, and found that CNCI had better performance on all not-coding transcripts with various lengths, whereas CPC and phyloCSF had poor operation on transcripts with longer sequences ( Figure two c).

Figure 2.

 CNCI performance. ( a ) The top panel shows ANT score distribution (the left y -axis) of these six reading frames for each protein-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x -axis. Red line represents the correct transcriptional reading frame and other five lines (blue or green) represent other five reading frames. Green line indicates the distribution of the coverage (the right y -axis) of the MLCDS region for each protein-coding transcript across the normalized length. The three regions marked by blue, yellow and green indicate the mean length of 3′UTR (6%), CDS (56.6%) and 5′UTR (37.4%), respectively, across the normalized length. The bottom panel shows an example of a gene NM_021222. ( b ) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted by solid squares is 0.05, 0.11 and 0.28, respectively. ( c ) The accuracy of CNCI, CPC and phyloCSF for classification of different lincRNA lengths. ( d ) The ROC curves and taxonomic tree of 12 species. The minimum error rate is marked following the name of species.

CNCI performance. ( a ) The top panel shows ANT score distribution (the left y -axis) of these six reading frames for each protein-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x -centrality. Red line represents the correct transcriptional reading frame and other five lines (blue or greenish) represent other 5 reading frames. Dark-green line indicates the distribution of the coverage (the right y -axis) of the MLCDS region for each protein-coding transcript across the normalized length. The three regions marked by blue, yellow and green betoken the mean length of iii′UTR (6%), CDS (56.6%) and 5′UTR (37.4%), respectively, across the normalized length. The bottom panel shows an case of a factor NM_021222. ( b ) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted by solid squares is 0.05, 0.xi and 0.28, respectively. ( c ) The accuracy of CNCI, CPC and phyloCSF for classification of dissimilar lincRNA lengths. ( d ) The ROC curves and taxonomic tree of 12 species. The minimum error rate is marked following the name of species.

Figure 2.

 CNCI performance. ( a ) The top panel shows ANT score distribution (the left y -axis) of these six reading frames for each protein-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x -axis. Red line represents the correct transcriptional reading frame and other five lines (blue or green) represent other five reading frames. Green line indicates the distribution of the coverage (the right y -axis) of the MLCDS region for each protein-coding transcript across the normalized length. The three regions marked by blue, yellow and green indicate the mean length of 3′UTR (6%), CDS (56.6%) and 5′UTR (37.4%), respectively, across the normalized length. The bottom panel shows an example of a gene NM_021222. ( b ) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted by solid squares is 0.05, 0.11 and 0.28, respectively. ( c ) The accuracy of CNCI, CPC and phyloCSF for classification of different lincRNA lengths. ( d ) The ROC curves and taxonomic tree of 12 species. The minimum error rate is marked following the name of species.

CNCI performance. ( a ) The peak panel shows ANT score distribution (the left y -axis) of these six reading frames for each poly peptide-coding transcript, whose length is normalized to 1100 nucleotide triplets in the x -axis. Blood-red line represents the correct transcriptional reading frame and other 5 lines (bluish or green) represent other five reading frames. Light-green line indicates the distribution of the coverage (the correct y -axis) of the MLCDS region for each protein-coding transcript beyond the normalized length. The three regions marked past blue, yellow and light-green indicate the mean length of iii′UTR (6%), CDS (56.half dozen%) and 5′UTR (37.four%), respectively, across the normalized length. The bottom panel shows an instance of a gene NM_021222. ( b ) The ROC analyses of CNCI, CPC and phyloCSF. The MAE denoted past solid squares is 0.05, 0.xi and 0.28, respectively. ( c ) The accuracy of CNCI, CPC and phyloCSF for classification of dissimilar lincRNA lengths. ( d ) The ROC curves and taxonomic tree of 12 species. The minimum fault rate is marked following the name of species.

Considering both the reference and the reconstructed transcripts in RNA-Seq experiments may be incomplete, nosotros modified known gene note past trimming the exon at three′- or v′-end of each transcript to generate a modified transcript data set (to mimic the incomplete RNA-Seq information) and re-evaluated CNCI performance in these incomplete transcripts. At that place were 28.three and 45.ii% of known poly peptide-coding transcripts with a consummate CDS afterwards trimming the v′ and 3′ exon, respectively ( Supplementary Table S4 ). CNCI maintained its high accuracy in the modified transcript data sets with a mean accuracy of 97.9 and 97.7%, respectively, which was college than that of CPC (87.1 and 87.9%, respectively) and phyloCSF (82.0 and 82.three%, respectively) ( Supplementary Table S5 ). To address whether CNCI is constructive for sense–antisense pairs, we evaluated its operation on antisense lncRNAs and their protein-coding counterparts, too as on coding–coding pairs and non-coding–non-coding pairs. The results showed that the mean classification accuracy was 98% for coding–not-coding pairs, 87% for coding–coding pairs and 97% for non-coding–non-coding pairs, which was higher than that of CPC (95, 82 and 97%, respectively) and phyloCSF (63, 91 and 55%, respectively) ( Supplementary Table S6 , Supplementary Figure S8 ). These results demonstrated that CNCI tool is not only useful for classifying incomplete transcripts from RNA-Seq data simply also has proficient functioning of classifying sense–antisense transcript pairs.

Awarding of CNCI to gene sets of multiple species and RNA-Seq data of poorly annotated species

Considering factor notation in multiple species (such as vertebrates, invertebrates and plants) has been partially completed by the Ensembl project ( xvi ), we tested CNCI on a series of species based on taxonomy. Interestingly, we found that using homo data as training sets, CNCI performed well uniformly on all the species of the vertebrates (all MAE< 0.ane), merely relatively poorly on invertebrates and plants (MAE is 0.18 and 0.24, respectively) ( Effigy two d). Although the accuracy and integrity of the known gene annotation varied across different species (i.e. homo, mouse, Caenorhabditis elegans and Arabidopsis thaliana have higher quality of gene annotation than others), the distinct features of protein-coding and non-coding sequences between vertebrates, invertebrates and plants were obvious ( Figure 2 d). These results demonstrated that information technology is necessary to use invertebrates and plants as the training information to allocate transcript sequences of the corresponding species, respectively. Our findings on the sequence characteristics may reflect changes in evolutionary trends of genes between species. Because RNA-Seq experiments have been carried out for many, although not well-studied, species,we tested CNCI performance on a published RNA-Seq information set up from six organs of orangutan ( 3 ). Using the integrative approach to comprehensively reconstruct transcripts ( 11 , 17 ), we identified 110 154 expressed multiexonic transcripts, of which 88 563 (eighty%) had been annotated by Ensembl database, and 20 414 known genes, of which xiii 678 corresponded to 67% known protein-coding genes. CNCI annotated 7697 novel transcripts as lncRNAs, including 631 intronic, 6029 intergenic and 1037 antisense RNAs that contributed to the orangutan lncRNA catalog ( Supplementary information set 1 ). This can exist applied to other species irrespective of the current note condition.

Word

A large number of lncRNAs have been identified, facilitated past the rapid progress of high-throughput sequencing technology ( xi , 18 ). Previous studies have demonstrated that lncRNAs are involved in diverse cellular processes, such as cell differentiation, imprinting control, immune responses, and a growing number of lncRNAs accept been constitute to exist implicated in disease etiology ( 19–21 ). Even so, for most species, it remains a claiming to identify lncRNAs from protein-coding genes considering of the lack of necessary information such equally whole-genome sequence, known poly peptide database or bourgeois regions. Therefore, it is important to develop a method independent of known annotations to de novo classify lncRNAs and poly peptide-coding genes. In this written report, we found a powerful signature, the profile of the pairs of ANTs, which effectively distinguishes protein-coding or not-coding sequence regardless of species. Our finding was consistent with observations that the CDS regions have been under a multifariousness of competing selection pressures, especially the translation optimization force that is associated with the juxtaposition of tRNAs but not required for non-coding regions ( 22 ). It is worth mentioning that a previous report used the length of the longest region in the transcript without stop codons to finer discriminate the coding and not-coding sequences ( 23 ). This so-chosen stop-best feature was included in the ANT score matrix of our method. Similarly, a recent study demonstrated that the hexamer usage bias is a powerful indicator in the assessment of the poly peptide-coding status of a sequence because of the sequence composition constraints introduced in the coding sequences by the genetic code ( 24 ). In improver, a gene finding program, GENSCAN, uses a homogeneous fifth-order Markov model for non-coding regions and an inhomogeneous fifth-gild Markov model for coding regions of transcripts ( 25 ).

Although CNCI would be effective for classifying incomplete transcripts assembled from RNA-Seq data in most cases, caution should be taken in some cases. In mammalian genomes, at least 3′ exons of protein-coding transcripts may not extend significantly into the coding regions of transcripts. Instead, they may extend for several kilo base away, and occur abundantly in most of the RNA-Seq libraries, and thus are deemed as independent transcript units by nigh assembly tools. In such cases, CNCI may misclassify these 3′ (or 5′) partial sequences as not-coding RNAs; however, this misclassification is not considering of the classification method per se , just the accuracy of the used assembly method. Therefore, the authentic query sets containing high-quality assembled transcripts are requisite to reach optimal performance of CNCI.

CNCI is especially well suited to the transcriptome analysis of the non well-studied species because it tin can effectively classify transcripts solely based on nucleotide composition of their sequence. The length of sequences we adopted in this work is >200 nt, and thus, theoretically, any sequence >200 bp can be analyzed using our proposed method. Our method differs from the previous methods that depend on information of known genome annotation or sequence conservation ( 4 , v ). Therefore, CNCI has a key advantage over other methods because genome sequences take been well annotated or completely sequenced only for limited species so far, and for most species, simply partial or fifty-fifty none of their whole-genome sequences accept been known. For these large number of species with poorly annotated sequences, it is difficult to use peptide hits or multispecies alignments to classify sequences into poly peptide-coding or non-coding transcripts, as dissimilar ORF cutoffs may lead to a high fake-negative/positive rate, particularly for lncRNAs ( seven ). Although sequence search approaches for the bigotry between protein-coding and non-coding transcripts take been bachelor ( 26 ), at that place is nevertheless lack of constructive de novo arroyo to accomplish information technology. Thus, CNCI is a useful tool, not only for predicting protein-coding or non-coding sequences for high-throughput sequencing data of numerous species but also for analyzing the sequence features across species equally a way to proceeds insights into the evolution.

FUNDING

Preparation Programme of the Major Research plan of the National Natural Science Foundation of Prc (91229120); International Scientific discipline and Technology Cooperation Projects (2010DFA31840 and 2010DFB33720). Funding for open up access charge: Preparation Program of the Major Research program of the National Natural Science Foundation of China [91229120].

Conflict of interest argument. None alleged.

REFERENCES

1

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

An integrated encyclopedia of DNA elements in the man genome

,

Nature

,

2012

, vol.

489

 (pg.

57

-

74

)

two

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Landscape of transcription in human cells

,

Nature

,

2012

, vol.

489

 (pg.

101

-

108

)

3

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

The evolution of gene expression levels in mammalian organs

,

Nature

,

2011

, vol.

478

 (pg.

343

-

348

)

iv

,  ,  ,  ,  ,  ,  .

CPC: assess the protein-coding potential of transcripts using sequence features and support vector machine

,

Nucleic Acids Res.

,

2007

, vol.

35

 (pg.

W345

-

W349

)

5

,  ,  .

PhyloCSF: a comparative genomics method to distinguish protein coding and non-coding regions

,

Bioinformatics

,

2011

, vol.

27

 (pg.

i275

-

i282

)

six

,  ,  ,  .

Differentiating poly peptide-coding and noncoding RNA: challenges and ambiguities

,

PLoS Comput. Biol.

,

2008

, vol.

4

 pg.

e1000176

 

7

,  .

Modular regulatory principles of large non-coding RNAs

,

Nature

,

2012

, vol.

482

 (pg.

339

-

346

)

8

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Chromatin signature reveals over a thousand highly conserved big non-coding RNAs in mammals

,

Nature

,

2009

, vol.

458

 (pg.

223

-

227

)

9

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

The GENCODE v7 itemize of human long noncoding RNAs: assay of their gene construction, development, and expression

,

Genome Res.

,

2012

, vol.

22

 (pg.

1775

-

1789

)

10

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Ensembl genomes: an integrative resources for genome-scale data from not-vertebrate species

,

Nucleic Acids Res.

,

2012

, vol.

40

 (pg.

D91

-

D97

)

11

,  ,  ,  ,  ,  ,  .

Integrative annotation of human large intergenic noncoding RNAs reveals global properties and specific subclasses

,

Genes Dev.

,

2011

, vol.

25

 (pg.

1915

-

1927

)

12

,  .

MM-align: a quick algorithm for aligning multiple-chain poly peptide circuitous structures using iterative dynamic programming

,

Nucleic Acids Res.

,

2009

, vol.

37

 pg.

e83

 

thirteen

,  .

LIBSVM: A Library for Back up Vector Machines

,

Acm T Intel Syst Tec

,

2011

, vol.

two

 

14

,  ,  .

TopHat: discovering splice junctions with RNA-Seq

,

Bioinformatics

,

2009

, vol.

25

 (pg.

1105

-

1111

)

15

,  ,  ,  ,  ,  ,  ,  ,  .

Transcript associates and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation

,

Nat. Biotechnol.

,

2010

, vol.

28

 (pg.

511

-

515

)

16

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Ensembl 2011

,

Nucleic Acids Res.

,

2011

, vol.

39

 (pg.

D800

-

D806

)

17

,  ,  ,  ,  ,  ,  ,  ,  ,  .

Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks

,

Nat. Protoc.

,

2012

, vol.

7

 (pg.

562

-

578

)

18

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

NONCODE v3.0: integrative annotation of long noncoding RNAs

,

Nucleic Acids Res.

,

2012

, vol.

forty

 (pg.

D210

-

D215

)

19

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Large-calibration prediction of long non-coding RNA functions in a coding-non-coding factor co-expression network

,

Nucleic Acids Res.

,

2011

, vol.

39

 (pg.

3864

-

3878

)

xx

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

Long non-coding RNAs function notation: a global prediction method based on bi-colored networks

,

Nucleic Acids Res.

,

2013

, vol.

41

 pg.

e35

 

21

,  ,  ,  ,  ,  ,  ,  ,  ,  , et al.

ncFANs: a web server for functional annotation of long non-coding RNAs

,

Nucleic Acids Res.

,

2011

, vol.

39

 (pg.

W118

-

W124

)

22

,  ,  .

tRNA properties aid shape codon pair preferences in open reading frames

,

Nucleic Acids Res.

,

2006

, vol.

34

 (pg.

1015

-

1027

)

23

,  ,  .

Accurate discrimination of conserved coding and not-coding regions through multiple indicators of evolutionary dynamics

,

BMC Bioinformatics

,

2009

, vol.

ten

 pg.

282

 

24

,  ,  ,  ,  ,  .

CPAT: Coding-Potential Cess Tool using an alignment-costless logistic regression model

,

Nucleic Acids Res.

,

2013

, vol.

41

 pg.

e74

 

25

,  .

Prediction of complete gene structures in human being genomic Dna

,

J. Mol. Biol.

,

1997

, vol.

268

 (pg.

78

-

94

)

26

,  ,  ,  ,  ,  ,  ,  .

BlastR—fast and accurate database searches for not-coding RNAs

,

Nucleic Acids Res.

,

2011

, vol.

39

 (pg.

6886

-

6895

)

Author notes

The authors wish information technology to be known that, in their stance, the first two authors should exist regarded every bit joint Starting time Authors.

This is an Open Access commodity distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/iii.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.

Supplementary data

Submit a annotate

Y'all have entered an invalid code

Cheers for submitting a comment on this commodity. Your annotate volition be reviewed and published at the journal'south discretion. Please cheque for further notifications by email.