Improving missing value estimation in microarray data with gene ontology: Supplementary web page

Johannes Tuikkala(1,3,*), Laura Elo(2,3,4), Olli S. Nevalainen(1,3,4), and Tero Aittokallio(2,3,4)

(1) Department of Information Technology, University of Turku, Lemminkäisenkatu 14A, FIN-20520, Finland,
(2) Department of Mathematics, University of Turku, FIN-20014 Finland,
(3) Turku Centre for Computer Science (TUCS), Lemminkäisenkatu 14A, FIN-20520, Finland,
(4) Turku Centre for Biotechnology, Tykistökatu 6, FIN-20521, Finland
(*) Corresponding author

Matlab and Java codes

Selection of the neighbourhood size


Supplementary figure 1

Selection of the neighbourhood size k was done by evaluating the imputation accuracy of the k-NN and the GO-based k-NN methods with k varying from 1 to 49. Supplementary fig. 1 is consistent with the remark of Feten et al. (2005) that, after some k-value, the number of neighbours has only little importance. We observed that 20 neighbours were enough for each of the data sets and thus the value k = 20 was used in each test run. In these runs, 5 % of values of each data set were set as missing.


alpha test run for molecular function ontology


Supplementary figure 2

Imputation accuracy of the GO-based k-NN method was tested for different values of alpha using molecular function (MF) ontology. Larger values of alpha are needed when the percentage of missing values is high. The optimal alpha value depends also considerably on the data set under investigation, suggesting that there is no universal alpha value. In particular, the number of conditions has a marked influence (c.f. diauxic versus elutriation data sets).


alpha test run for biological process ontology


Supplementary figure 3

Imputation accuracy of the GO-based k-NN method was tested for different values of alpha using biological process (BP) ontology. Larger values of alpha are needed when the percentage of missing values is high. Supplementary fig. 3 demonstrates also that the results with BP ontology are very similar to those with MF ontology (Supplementary fig. 2). These two figures reveal the practical usefulness of the GO-based imputation algorithm; larger optimal values of alpha appear when there is more benefit from using the GO-information.


Accuracy of the adaptive alpha selection


Supplementary figure 4

The accuracy of the adaptive alpha selection procedure was evaluated by monitoring the values of selected alphas for fixed numbers of missing values (10 % and 20 %). We evaluated the results of 100 test runs for each missing value percentage (Supplementary fig. 4). Diauxic data set and BP ontology were used. The selection procedure seemed to work as expected, and the histograms are in good agreement with the results of the alpha test runs above (Supplementary fig. 3).


Eigenvalues of the data sets


Supplementary figure 5

The strength of the correlation structure of a data set was determined as the ratio between the first eigenvalue of the covariance matrix and the sum of all eigenvalues. The individual eigenvalues of the data sets are shown in Supplemetary fig. 5, suggesting that the correlation structures of the data sets under study were generally rather strong.


Effect of the number of conditions


Supplementary figure 6

Comparison of the NRMS errors of the k-NN and the GO-based k-NN methods for four different missing value percentages when the number of conditions varies. alpha part of the Spellman's Yeast cell cycle data set and biological process ontology are used here (Supplementary fig. 6).


Effect of the release date of the ontology


Supplementary figure 7

We studied how the evolution of the gene ontology affects the imputation accuracy of the GO-based k-NN method. We determined the imputation accuracy of the GO based k-NN method with the old biological process ontology files downloaded from the ftp-server of the Gene Ontology consortium (ftp.geneontology.org). It proved out that the evolution of the function ontology has significant influence on the imputation accuracy of the GO based k-NN method only if the percentage of missing values is high (Supplementary fig. 7).


GO Slims


Supplementary figure 8

Imputation accuracy of the GO-based k-NN method was compared between GO Slims (generic GO Slims and yeast GO Slims) and the full ontologies. Supplementary fig. 8 shows that when the missing value percentage is low, the GO Slims are almost as good as the full ontologies, but when the missing value rate is high, the full ontologies produce more accurate results then the Slims, at least when the diauxic data set is used.


Reduced function ontology


Supplementary figure 9

Imputation accuracy of the GO-based k-NN method was compared between a reduced molecular function ontology (terms and subgraphs related to transferase or kinase activity were deleted from the ontology graph) and the full ontology. The objective was to study the effect of such general functional categories on the imputation method. Supplementary fig. 9 shows that there were no significant differences in the imputation accuracy as compared to the original ontology. GOkNNr: reduced MF-ontology. GOkNN: full MF-ontology.


Best and worst terms of MF sub-ontology


Supplementary figure 10

Distribution of the 100 best terms (red nodes) and 100 worst terms (blue nodes) in the MF sub-ontology. Click image to zoom to the nodes to see their IDs. The impact of each MF term used on the imputation was measured by subtracting the imputation error obtained with the GO-based k-NN method from the error with the pure k-NN method. We repeated this calculation 10 times in the diauxic data set with 20% missing rate and calculated the average impact of each term. The alpha value was fixed to 1 in these runs. The 100 best terms and 100 worst terms in the AmiGO ontology browser.


Effect of using constrained set of GO annotations


Supplementary figure 11

We studied how constraining annotations affect the imputation accuracy of the GO-based k-NN method. Two sets of annotations were used: (i) all available annotations and (ii) only annotations with TAS or IDA evidence field label (the two most reliable evidence fields). The objective was to investigate the effect of such reliable annotations on the accuracy of the GO-based imputation algorithm. With MF ontology, this impairs the imputation accuracy on average, whereas with BP ontology, the result is vice versa. However, these differences were rather small and we cannot give any recommendations what evidence code the user should apply (Supplementary fig. 11).


Comparions of execution times against imputation accuracy


Supplementary figure 12

Summary of the execution times (bars) of different versions of the k-NN based imputation along with the corresponding NRMS errors (line). The nine versions are arranged according to their imputation accuracies. Diauxic data set with 20 % of missing values was used. The results show that 20 % of genes are enough for the adaptive alpha selection procedure. The execution times do not include the ontology buffer construction times which were 206 min for the BP ontology, 109 s for the MF ontology, 53 s for the BP yeast slim, and 41 s for the MF yeast slim. An ontology buffer is constructed only once for each ontology and data file and it can be stored in the hard disk. It contains GO dissimilarity values for each gene pairs in the data set. Stored ontology buffer needs about 300 MB disk space, and about 450 MB main memory when using it.