With the aim to accomplish the proposed objectives of the SeqCOVID project, we are sequencing samples from patients diagnosed with COVID-19 disease from different hospitals all over Spain. From the collaboration of more than 30 hospitals, they send us clinical extractions from genetic material (RNA), and we have been able to carry out the sequencing of 1,440 clinic samples, a number that increases week after week to cover all phases of the epidemic.
For the clinical detection of SARS-COV-2 a qPCR is necessary, using RNA from nasopharyngeal exudate. One of the parameters to decide if a sample is positive for SARS-COV-2 is the Cycle threshold (Ct), which indicates the cycle from which fluorescence begins to be detected above the background noise. Therefore, for the same parameter, smaller Ct cycles values are associated with greater amounts of genetic material.
Ct values depend on the amount of genetic material contained in the sample, but there are other factors that can influence fluorescence measurements, such as the presence of PCR inhibitors, or the kits and instruments used. In our case, we are interested in the inverse relationship of Ct with respect to the amount of genetic material, since it can give us an approximate idea of the success of the sequencing; in such a way that compared to lower Ct values, the amount of starting RNA will be greater, and therefore we can expect a more successful sequencing.
In the context of genetic sequencing, we define sequencing “success” as the ability to sequence a sample in such a way that at least 90% of the genome is covered (with a depth of at least 30X). Preliminary analysis indicates that we can use the Ct parameter as a guideline to select which samples to prioritize, and in extreme cases, whether or not to proceed with their sequencing.
Relationship between Cycle threshold and genomic coverage.
Currently there is a wide variety of kits and designs for qPCR, yet all of them are based on the amplification of three genes: E, N and / or RdRP. At the time of this report, 1,440 samples have been sequenced, of which 632 (44%) present the Ct data for at least one of these genes:
- 492 samples with known Ct for the E gene
- 323 samples with known Ct for the N gene
- 237 samples with known Ct for the RdRP gene
We have chosen the Ct value of the E gene as a criterion for making decisions regarding sequencing. This has been motivated by two reasons: (i) it is the data that hospitals report most frequently, and (ii) on average it is the gene that presents lower Ct values (24.8 versus 27.5 for the N gene and 28.3 for the RdRP gene). This result may be indicative that the assay designed for the amplification of the E gene will obtain greater sensitivity.
After comparing the Ct data of the E gene with the results obtained in the sequencing (specifically the genomic coverage achieved for each sample), we reached the conclusion that a high proportion of samples with Ct values greater than 33 presented a percentage genomic coverage of less than 90, so they are not considered suitable for sequencing (Figure 1A). Specifically, of a total of 492 samples with known Ct for the E gene, 459 had a Ct value <= 33. Within this range, 442 samples (96%) reached more than 90% coverage. However, only 14 of 33 samples (42%) with Ct> 33 reach more than 90% coverage.
Figura 1. Relationship between Ct (qPCR) and genomic coverage. (A) This panel represents the genomic coverage reached by the samples with known Ct for the E gene, the dashed lines delimit the established cut-off values (more than 90% coverage as an index of good sequencing quality and Ct less than 33). It is observed that a high proportion of the samples that exceeds the limit of Ct <33 does not reach 90% of genomic coverage. (B) (C) Represents the genomic coverage in relation to the Ct values of the RdRP and N genes respectively.
For this reason, the cut-off point was established in Ct gene E <= 33, as indicated by the dashed lines in Figure 1A. This reference helps us decide which samples to prioritize for sequencing. It should be noted that all samples, regardless of the value of Ct gene E, are being sequenced, since in some they do achieve adequate quality. Today we are using this criterion only to give preference to the sequencing of the samples that they accomplish.
Relationship between Cycle threshold and detection of mutations.
Parallel to the influence that the Ct value may have on coverage, we have also studied another factor that can be affected during sequencing. It has been observed that when sequencing samples with high Ct values, where the amount of starting genetic material is small, artifacts can be introduced.
We have analyzed this possibility at two key points: the SNPs calling and the detection of indels. For this we have compared the Ct gene E values with the number of SNPs and the number of indels respectively. As seen in the scatter plot (figure 2A), this does not seem like a problem when calling SNPs, since a clear correlation is not observed between the value of Ct and the number of SNPs in a sample Pearson = 0.06, not significant with p-value = 0.18). However, it does seem to affect indels (Figure 2B), although not always, and with a low correlation, samples with higher Ct values use to present more indels in the consensus genome (Pearson = 0.34, significant p-value <0.01).
Figura 2. Relationship between Ct gene E values (qPCR) and number of mutations detected by sequencing. (A) Represents the number of SNPs found in each sample in relation to its Ct value, a lack of linear correlation between both variables is observed. (B) In this case, the Ct value of the E gene is represented against the amount of indels detected in each sample. An increasing trend can be seen in the number of indels as the Ct value of the sample increases.
In the case of not having the data for the E gene, we can carry out an interpolation from the Ct values for the N and RdRP genes, since there is a linear correlation between them. The calculation of the Pearson correlation coefficient yields the following results: 0.94 for the Cts of the E and N genes (Figure 3A), and 0.91 for the Cts of the E and RdRP genes (Figure 3B); in both cases with p-value <0.01. Therefore, we can conclude that there is a direct linear relationship between the variables in both cases. The scatter plot in which this correlation can be visually observed are shown below.
Figura 3. Scatter plot in which the Ct values of the E gene are compared with the Ct of the N gene (A) and with the Ct of the RdRP gene (B) respectively. The direct linear correlation that exists in both cases can be observed, which allows an interpolation of the Ct values in case of lacking the Ct data for the E gene.
As new data are obtained, we will have the ability to carry out a more robust analysis, which allows us to verify how accurate our preliminary results are, and thus be able to establish with a greater degree of confidence the Ct range that ensures optimal genomic coverage. Given the high incidence of COVID-19, it is important to set criteria of these characteristics that help us establish an order of priority for the sequencing of samples. Thanks to this, we can streamline our workflow and obtain faster results that can be translated into concrete actions to control the pandemic.
Report written by the members of the consortium: Ana María García Marín, Galo Adrián Goig Serrano. Institute of Biomedicine of Valencia. Superior Council of Scientific Investigations.
Traduction by: Paula Ruiz Rodriguez. Instituto de Biología Integrativa de Sistemas, I2SysBio (CSIC-Universitat de València), Valencia, Spain