Mapping Binary (On/Off) Gene Expression for Pathway & Tissue Analysis A THESIS SUBMITTED TO THE FACULTY OF UNIVERSITY OF MINNESOTA BY Raymond M Moore IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE Jean-Pierre Kocher, Claudia Neuhauser August 2012 © Raymond Moore 2012 Acknowledgements Jean-Pierre Kocher Chen Wang Claudia Neuhauser Hugues Sicotte Yuji Zhang Special Thanks to: Matthew McCall i ii Dedication This thesis is dedicated to Caitlin & my Parents. iii Abstract Microarray data contain information about the level of expression of genes that can be informative of changes taking place in cells. This information has been widely used to study the changes in gene expression between normal and cancer cells. Gene expression has been used as a biomarker predictive of the progression of a disease and to identify drug targets specifically expressed in cancer tissue. Although the level of expression of a gene can change by multiple folds across tissues, the simplest information (on/off status) about a gene is whether it is or is not expressed in a given tissue. In this study we propose to take advantage of a large set of tissue specific gene expression data to study the profile of gene expression in pathways and tissues. To perform this analysis, we leverage and improve a method that computes the on/off status of a gene from their level of expression. The percent of genes with on status in a given tissue was selected to summarize across bio specimen. The gene state method was applied to sets of tissue specific expression microarray extracted from the GEO database. We then studied the profile of on/off state of genes in KEGG pathways across several tissues. The data were then used to calculate a distance between gene sets. Using all genes, a distance could be calculated between normal and cancer tissue, as well as pairwise comparisons between each tissue type. The gene sets were then narrowed and selected based on pathway annotation from KEGG. This demonstrated an ability to identify known cancer pathways based on their gene signature distances. The results affirm known cancer pathways by calculating relative distances. iv Table of Contents Acknowledgements ......................................................................................................................... iii Dedication ........................................................................................................................................ ii Abstract ........................................................................................................................................... iii List of Tables ................................................................................................................................... v List of Figures ................................................................................................................................. vi Introduction ...................................................................................................................................... 1 Goals of the Project...................................................................................................................... 1 Introduction to Microarray Technology and its Application in Gene Expression ................... 3 Microarray Preparation ............................................................................................................ 4 Design of an Affymetrix Chip ................................................................................................. 4 Affymetrix Standard File to Store Gene Expression Results ................................................... 5 Preprocessing of Gene Expression Data Obtained from Microarray ........................................... 6 Background on Normalization Methods .................................................................................. 6 The Multiple Array Normalization Method ............................................................................. 6 Background on Application of On/Off State of a Gene ............................................................... 7 The Argument for Informative Pathway Level Expression Analysis ...................................... 7 Affymetrix Present/Absent Call ................................................................................................... 8 Background ...................................................................................................................................... 9 Origin and Retrieval of Data ........................................................................................................ 9 The Need for Better Summarization, a New CDF ....................................................................... 9 The Original Affymetrix CDF Annotation File ....................................................................... 9 New CDF: Probe Set Equals a Gene ...................................................................................... 10 Introduction to Barcode ............................................................................................................. 11 The Equation to Creating the Model: ..................................................................................... 11 Estimation of the Noise/Off Distribution ............................................................................... 12 The Decision Boundary ......................................................................................................... 13 Methods ......................................................................................................................................... 13 Barcode Methods & Script Descriptions ................................................................................... 13 Input Data Prep ...................................................................................................................... 14 Round 1: Estimate Distribution without Global Effect .......................................................... 15 Computational Speed Up ....................................................................................................... 16 Round 2: Re-estimation taking Global Effects into Account ................................................ 17 Validation of Barcode Modifications......................................................................................... 18 Results & Discussion ..................................................................................................................... 19 Gene Expression Analysis in Tissues .................................................................................... 19 Comparison between Normal and Cancer Tissue .................................................................. 20 Comparison between Normal Tissues ................................................................................... 21 Pathway Focused Comparison of Normal versus Cancer Tissue........................................... 22 How the Mean is Computed .................................................................................................. 22 Future Work ................................................................................................................................... 23 Application of Other Microarray Platforms ........................................................................... 23 Utilization of More Robust Differential Calculation ............................................................. 24 Integration with Archive Resources ....................................................................................... 24 Expansion to Open Source Software ..................................................................................... 24 References ...................................................................................................................................... 26 i v List of Tables Table 1: Tissue Normal Vs. Cancer .................................................................................. 21 Table 2 Normal Vs. Normal Tissue Comparison.............................................................. 21 Table 3 Pathways Normal Vs. Cancer .............................................................................. 23 vi List of Figures Figure 1: Workflow Overview. ........................................................................................... 2 Figure 2: The process of creating a microarray chip .......................................................... 4 Figure 3: A custom gene CDF ........................................................................................... 10 Figure 4: Prep Input for Barcode ...................................................................................... 14 Figure 5: Model Estimation Workflow. ............................................................................ 16 Figure 6: Validation with original barcode. ...................................................................... 18 Figure 7: Example of Euclidean distance determination .................................................. 20 Equation 1: Barcode Model .............................................................................................. 11 Equation 2: Barcode Decision Factor ............................................................................... 13 1 Introduction Goals of the Project Gene expression has been used as a biomarker in various studies to predict the progression of a disease. It has also been used to identify drug targets in cancer tissue. (Parmigiani, 2002) Expression data is usually acquired through microarray, which provides information about the level of expression by measuring the amount of messenger ribonucleic acid (mRNA). RNA translates to protein, which is the functional mechanism in a cell. Measuring the level of RNA can provide insight into the changes taking place in a cell. The experiment entailed finding a method to calculate the on/off gene state from gene expression data. Population studies, which investigate cellular changes across a population, could utilize this approach which can provide data by viewing all pathways or tissues simultaneously The level of expression of a gene varies greatly across tissues and even within tissues. From a functional point of view, the most basic information about a gene is whether or not it is expressed in a given tissue, which can be summarized as two states: on or off. Reducing gene expression into on and off is important because it simplifies the data. The Boolean description of the on or off state of a gene yields a straightforward interpretation. The gene states reflect the biological mechanism of a gene being on and used or not and unused. The modeling of biological networks into Boolean “control circuits” was originally developed by Kauffman (1969). Despite possible oversimplification, the information obtained from this Boolean approach could reveal 2 insights into relationships between genes, corresponding pathways and across tissue types that would be otherwise hard to interpret (Kauffman, 1973). We discuss the implementation of a specific method that enabled us to model Boolean networks. The primary aim of this work was the assembly of a workflow to process gene expression data from microarrays (Affymetrix) and convert the level of expression into the on/off state of a gene. This workflow included several applications that had been previously developed. However these applications could not be readily assembled without significant modifications. This workflow was designed to process a large number of microarrays from different tissues and different experiments. Therefore, the critical first step was data normalization to adjust the distribution of gene expression to be comparable. This step was accomplished by fRMA (McCall, Irizarry 2011), a recently developed method that allowed for the sequential normalization (preferably by batch) of microarray data. The normalized data are then processed by an application called “Barcode” (McCall, Uppal, Jaffee, Zilliox, Irizarry, 2011) that converts level of expression into on/off state of a gene. By default, these two applications use a standard CDF (cel definition file) annotation file provided by Figure 1: Workflow Overview. Each sample is normalized by fRMA, then applied to the barcoding algorithm, which utilizes the previously created model to calculate Boolean values. 3 Affymetrix that relates each probe to a probe set. (A probe set is a collection of probes intended to target specific sequences in a gene.) Since the knowledge of the human genome is constantly expanding, the CDF provided by Affymetrix has become outdated. A newly updated CDF was proposed by several research groups, to address the changes from the older version. (Wang, Verhaak, Purdom, Spellman, Speed, 2011) The work performed for this thesis includes the 3 following specific aims: 1. Enable fRMA and the “Barcode” application to process data with the new CDF provided by Brain Array lab. (Thompson, Fan, 2011) 2. Build these applications into a workflow and interface to the BORA repository system developed by the Bioinformatics Core at Mayo Clinic. 3. Describe the behavior of the on/off state of gene in pathways across different tissue types. Introduction to Microarray Technology and its Application in Gene Expression Microarray is a technique that captures the level of expression of genes on a genome wide scale by measuring the quantity of RNA molecules bound to their complement. The technique enables fast processing of ribonucleic acid (RNA) molecules, indicating the level of gene expression. Microarray technology creates a snapshot of the level of expression for thousands of genes in a biospecimen. The process of microarray can be summarized into four steps: obtain a selection of RNA molecules from a sample, select a manufactured chip and perform the experiment, preprocess the raw intensity data, then analyze the data according to the experimental type. 4 Microarray Preparation There are two methods for manufacturing microarray chips: DNA fragment deposition and in situ synthesis (Zhang, 2006). One of the best known manufacturers is Affymetrix. In the manufacture of Affymetrix chips two techniques are required: photolithography & solid- phase DNA synthesis. The process begins with photolithographic masks that allow light to react only over specific cells on the chip. The open spots in the mask allow light to convert the protective group on the terminal nucleotide into a hydroxyl group. The hydroxyl group is reactive enough to create a new bond to the next nucleotide in the sequence. The mask changes each round and is designed to correspond to all the cells having the same nucleotide added. It takes many rounds of masking and adding the nucleotides to create a chip with thousands of individual sequences called “probes.” Microarray preparation begins with the extraction of RNA. Then the collections of RNA molecules are applied to the microarray chip. The microarray chip containing RNA molecules are inserted into a machine to be read. Design of an Affymetrix Chip Two types of probes exist on an Affymetrix chip; reference probes (or perfect match) and mismatch probes. During the hybridization step the DNA probes on the Figure 2: The process of creating a microarray chip can really be simplified down to three basis steps. Spotting, where the probes are arranged on the chip. Hybridization, where the sample RNA is applied to the chip and allowed to bind to its compliment. Scanning, where the fluorescence is measured to identify the amount of RNA bound to the chip. (Eichinger, 2009) 5 microarray will form heteroduplexes with their target RNA due to the Watson-Crick base pairing rule. (Zhang, 2006) Fluorophores are added to measure the level of fluorescence, which is proportional to the amount of heteroduplexes. The reference probes (PM) are specifically targeting a known sequence in the genome within an expected gene. The mismatch probes (MM) have a single nucleotide that is altered intentionally. The altered nucleotide is meant to detect sequences that have cross-hybridized. Cross-hybridization happens when a DNA/RNA sequence binds to a probe that is similar but not its perfect complement, even with a single nucleotide difference (Zhang, 2006). RNA hybridization can be influenced by many conditions including temperature, humidity, salt, target solution volume and the sequence itself (Zhang, 2006). Including mismatch probes on a chip is intended to be a strategy for detecting noise generated by the non-specific binding. The concept was that low expressed probes may be entirely made up of the noise or baseline and thus, unreliable. Additionally, some base line is expected for every probe. However, the usefulness of the mismatch probe is degraded by the fact that it is not necessarily removing noise, but potentially, viable information as well. This being understood, few normalization techniques continue to utilize the mismatch probes. Affymetrix Standard File to Store Gene Expression Results The dense intensity data is stored in files that have the extension cel and adhere to a standard format (“.CEL File Extension” 2011). The cel file stores the calculated numeric interpretation from the digital image of the features on the microarray chip. The core of the file is made up of the calculated feature intensity, which is obtained from pixel values of the image file created by digitally reading the light signatures in the 6 microarray reading machine. In addition to the feature data, there are various other values to describe the microarray including flags, outliers, masked features, sub-grids and records of the algorithms used to calculate the data (“.CEL File Extension” 2011). Every value stored in a cel file comes from a single feature, which represents one probe. The information about a single probe is not informative. The microarray chip is designed to contain multiple features to statistically derive useful information about a gene. The cel file is loaded into a normalization algorithm. Preprocessing of Gene Expression Data Obtained from Microarray Background on Normalization Methods Normalization, also known as data transformation or preprocessing, is the method of converting the microarray raw intensities into biologically interpretable signals. Normalization ensures constant variability at all intensity levels and eliminates experimental bias (Zhang). In addition, Microarray can produce values that contain noise specific to the facility and version, which also requires consideration of normalization procedures. The most widely used techniques are robust multi-array analysis (RMA), GeneChip RMA (gcRMA), Model Based Expression Index. (MBEI), and Probe Logarithmic Intensity Error (PLIER), all of which are referred to as multi-array analysis (McCall, Irizarry 2011). The Multiple Array Normalization Method Robust multi-array average (RMA) employs log-transformed PM values, and adjusts those values based on a global background reference computed across all the 7 included arrays (Irizarry, Hobbs, Collin, Beazer-Barclay, Antonellis, Scherf, et al. 2003). It is a well-established methodology that has been shown to be more effective than techniques that include MM values (Irizarry, 2003). The RMA method has an alteration to allow for iterative processing without the need for re-analyzing every array when a new one is added, known as frozen Robust Multi-array Analysis (fRMA). (McCall, Irizarry 2011). The premise of fRMA is to calculate a set of parameters that are essentially frozen, from a very large curated database of microarrays. Background on Application of On/Off State of a Gene Initally, the developers of Barcode came up with two applications; identifying orthologous genes via phylogenetic tissue analysis and determining an unknown tissue or cell type through gene expression. The first application consisted of choosing tissues from mouse and human samples, and looking at clustering of expression values and barcoded values. The results showed that the expression values clustered according to species, while the barcoded values successfully identified orthologous tissues. (McCall, 2011) The other application employed by the Barcode group was to use barcoded values to identify tissue or cell type of an unknown sample. Searching through a library of cell types previously barcoded and matching the unknown sample to the most likely cell type. The result of the application was very successful in matching cell or tissue types by their gene expression barcoded values (McCall, 2011). The Argument for Informative Pathway Level Expression Analysis The Barcode group demonstrated a set of applications for Boolean values, however, there is another group investigating pathway analysis techniques that could 8 benefit from gene state information. A literature search revealed a group using normalized gene intensities that could be investigated further with Boolean data. The technique employed was built on the concept of existing pathway networks, a predefined set of genes selected from other methods. The method used was a three step strategy of obtaining a predefined set of genes, calculating the activity of that pathway and then identifying the differentially expressed pathways. The results of the research were made available via a webserver. The differential analysis includes an evaluation score based on randomly permutated sample labels. (Tomfhor, 2005) Affymetrix Present/Absent Call There exist many approaches of categorizing probe intensities; the quality function provided by the microarray company Affymetrix is the present/absent score. The quality measure was specifically designed as a quality metric and was not meant to contain biologically relevant information. This score was also provided by our automated workflows, meaning that it was readily available along with the initial data. It was proposed that perhaps this easily obtained score had some usefulness in identifying the gene expressions that are noise (absent) or actual (present). A simple way to determine if the score was a viable candidate was to demonstrate a separation of absent and present across expression values. Our analysis indicates that there was no distinct separation between samples identified as absent and those identified as present, thus confirming that it was indeed a poor candidate to identify on/off. 9 Background Origin and Retrieval of Data Gene expression microarrays can add up to a massive amount of data. There is a data management system available at Mayo Clinic called BORA. BORA stands for Biological Oriented Repository Architecture which is a system designed to integrate multiple genomic data to enable integrative analysis. It is intended to be used by researchers to integrate multiple data types to maximize information cross-analysis. The data already loaded into BORA includes normal, cancer and various disease conditions across more than a hundred tissue descriptions. The tissue terminology was collected by experts within the Mayo Clinic group as well as my own text mining scripts. The Need for Better Summarization, a New CDF The Original Affymetrix CDF Annotation File The information used to aggregate which probes belong together in a probe set is located in the Affymetrix chip definition file (CDF). This file is critical to the normalization step, because it defines which features on the chip belonged to which probe set. A probe set is intended to target a gene within the genome. (“CDF FILE” 2009) The target is determined by a selection based on a fully sequenced and annotated genome. However, the understanding of where genes are located is continually updated. As knowledge of the genome advances, it exposes the fact that some probes target the wrong gene. 10 New CDF: Probe Set Equals a Gene Gene level summarization was initially chosen due to publications that suggested gene expression biomarker panels validated better with an updated definition of probe sets. The decision was based on the popularity in the research community. A validation was performed by looking at 246 Glioblastoma (GBM) & 175 Ovarian samples from the Cancer Genome Atlas (TCGA). (Wang, 2011) They used the digital sequence data as a basis of comparison against the three platforms available; Affymatrix U133, Exon, and Agilent. An updated definition removed ambiguity by rearranging the probes to represent the appropriate gene. Genes with long sequences would receive the greatest benefit. The better selectivity should also increase the ratio between signal and noise. Most often a probe set was selected to represent a gene from an array, whereas gene level summarization removed the process of selection and enforce the same microarray analysis. In addition to rearranging probes to their appropriate gene, some probes were simply removed from analysis based on poor performance. The conclusion was that gene level analysis outperformed the simple, probe set analysis for all three platforms. (Wang, 2011) Figure 3: The better validation was what we were trying to show in the plots. Using a custom gene CDF also did significantly better than averaging multiple probe sets. The probes with high hybridization efficiency “rescued” the probes with low affinity based on the robust averaging method. 11 The data retrieved was already preprocessed through fRMA with an alternative CDF, mapped directly to genes. The two main advantages of the new Brain Array CDF were higher probe sensitivity and probes mapped based on the latest sequence knowledge. (Thompson, Fan 2011). Most of the remapping occurs where previous probes were incorrectly identified in the original CDF created nearly a decade ago. An additional motivation for gene level summarization, which evolved during the analysis, was into wanting to integrate gene expression data across platforms, since that is currently impossible. Introduction to Barcode The Equation to Creating the Model: Barcode is a technique developed by McCall, Zilliox & Izzari (2011) to categorize a single gene expression as being 0 or 1 (akin to on or off) and thus creating a barcode effect in a given sample. The strategy entails a mixed model system, per gene (or probe set) specific to a microarray platform. The estimation process employs a variation of the Probability of Expression (POE) algorithm (Parmigiani, Garrett, Anbazhaghan, Gabrielson, 2002). The modified algorithm attempted to determine the proportion (1-pg) of silenced (noise) normal N(µg, τ 2 g) distribution and expressed uniform ),(~ ),(~ ),(),()1(~| ),(~)|( 2 2 2 2     IG N SUpNpjg Ny g g ggggggg gjgjgijg   Equation 1: yijg is the observed log2 intensity value for the g th gene in the i th sample of j th tissue type. The Θjg is the resulting mixed model of some proportion of a normal distribution N(µg, τ 2 g), considered the “silent” distribution of noise and some proportion of the uniform distribution of true expression U(µg, Sg). 12 U(µg, Sg) distribution from the silenced mean to the saturation point Sg, per gene. All gene expression intensity values, per tissue per sample are represented by yijg, the log2 intensity from fRMA, on gene (g) from sample (i) and tissue type (j) (Zilliox, Irizarry 2007). Thus the collection of Θjg is the average gene intensities per gene in a given tissue. It is assumed that the mean µg comes from a normal distribution N(ξ, λ 2 ) on a global scale external to a specific gene, but rather based on biological constraints. Additionally, τ2g is proposed to exist from an inverse gamma distribution, in the same global sense to account for gene intensity outliers. Estimation of the Noise/Off Distribution Highly diverse microarray data were necessary to obtain good results. The EM algorithm employed estimates the off (noise) distribution. In general, most genes are not expressed in every tissue. There are some genes, often referred to as house-keeping genes that are always expressed. Accounting for these genes can be achieved by using a microarray study performed as a cross-hybridization experiment, to obtain expression values that are specifically background noise. Since it was not possible to estimate all of the parameters simultaneously, the programmatic structure consists of three parts. First, estimate a placeholder threshold via EM, leaving out higher level allowing the model to fit gene-by-gene. Second, estimate the highest/global level parameters (µ & τ) minus the off (noise) distribution. Third, re-estimate the parameters in the first step, using the hierarchical model including the highest/global parameters from the second step. This process then creates a model of parameters, which model the normal distributions that represent the noise otherwise known as off. 13 The Decision Boundary It is from the resulting model, that the barcode value 0 or 1 (on or off respectively) could be calculated for a given probe set in a sample. The 1 or 0 value was obtained in multiple steps. The first step was to determine where the observed log2 intensity values lied in the noise distribution. Every probe set is partnered to an off (noise) distribution in the model. A p-value was calculated on the corresponding model‟s off (noise) distribution for that probe set. The p-value was converted into a Boolean value, by being either above or below a decision boundary. The decision boundary is a tunable parameter. The current decision boundary was calculated to be the number that provided the best performance in a leave-one-out categorization of an unknown tissue sample. The decision value was selected by the value which was able to correctly identify a random sample into its correct tissue/cell type. Methods Barcode Methods & Script Descriptions The available Barcode models within BioConductor (Gentleman, Carey, Bates, Bolstad, Dettling, Dudoit, 2004) were calculated from fRMA normalization with the CDF provided by Affymetrix, thus incompatible with our alternative CDF normalization. Thus it became necessary to contact the original developer for the ability to create new Barcode models that use an alternative CDF. Matt McCall, the original script developer,           otherwise Czif b y z g g g gg g 0 1 ˆ ˆ   Equation 2: Calculating the Barcode value is performed as a logic bound to a constant (C). The value yg is the observed gene intensity, µg & τg are the model mean and variance, respectively. 14 shared various pieces of Rscript code. The collection of scripts needed to be reviewed, connected and edited to fit into a more generalized format that utilized the parallel computational and archival resources available. The entire process was to understand and order each piece, then connect them and ensure the entire pipeline was error free. Input Data Prep The first three scripts rearranged and formatted the input data. These three scripts were provided by Matt McCall, and were specific to the original development system. Thus it was necessary to write new scripts to replace the original scripts. The data existed primarily as flat files. The first script copied the fRMA normalized expression data into a Figure 4: Some formatting is necessary to prepare the various inputs. All of the microarray samples must be normalized in the same way. They are they loaded into a R data structure. Additionally the background samples and the sample annotation are loaded into similar R data structures. The annotation contains the information about tissue and disease for each sample. The various data structures are then merged into a single input file. 15 single R structure, where the rows were the genes, labeled with Entrez Id, and the columns were samples, denoted by a GSM number. The script created a matrix style data frame contained within a single file named e.rda, which was the format necessary for the original algorithm. Since the algorithm is dependent on tissue type, a description of the sample‟s origin was required. The text based data was stored in a single flat file consisting of columns of information about the tissue type for each sample. The second Rscript was converted the flat file of text into a matrix data structure named tab-GPL.rda. The third and final piece of the input was the background expression. A Rscript nearly identical to the first compiled all of the background expression files into a single R data structure called ebg.rda. The three input files previously described were the input for a single initiation script. Unlike the previous steps, this preparation script and most of the subsequent scripts were unaltered from the original. This script creates multiple R structures that contain the intersection between the input samples and the tissue descriptions. The results were R structures that ordered values of expression and tissue information, within a file labeled get-bcparams-input.rda. Round 1: Estimate Distribution without Global Effect The first round of estimation employed the EM algorithm. The EM algorithm is computer code for the Probability of Expression algorithm, designed by Parmigiani. Round 1 estimation was designed to make the per gene estimation without the global factors. This step involved an iterative optimization function used to obtain the most likely normal mean and variance values for the probe set log2 intensity. Like most optimization 16 functions it is susceptible to local minima in the stochastic descent. From a computational perspective, any gene estimate could fail optimization. The code was originally written to detect failed optimization, in those cases the initial parameters were adjusted and rerun. The probability of the estimation failing multiple times was very rare. The resulting data structure from this preliminary estimation was a data structure containing each gene‟s mean and variance of the off (noise) distribution, which is a normal distribution. Computational Speed Up Parallelization of the EM algorithm could be achieved in per-gene calculations. The original scripts were adapted to our multi-node computational resource, which was achieved by leveraging our SGE installment. The script accepts two integers that indicate Figure 5: Model Estimation Workflow. 17 the slice of rows, the genes, which the script computes. In a parallel context this allowed the script to be called and run on individual nodes, each working on a separate slice of the input data. Reading from a single file may act as a limiting factor however, it was not the computationally heavy part of the algorithm, compared to the optimization component, and thus the slowdown created by reading from a single file is minimal handled within SGE optimally. The result of the parallelization was a collection of files. Merging happened in the next script, merge-params-gpl.R, recombine multiple files into a single file, bcparams.rda, in preparation for Round 2. Round 2: Re-estimation taking Global Effects into Account As noted, the basis of the approach was to get a rough estimate, and then obtain the global variables, referred to as hyper parameters, to describe the overall trends within the dataset. The script get-hyper-params-all.R simply followed the equation in obtaining the normal distribution and inverse gamma distribution of the entire dataset. The data were stored in the file hyper_params.rda, which is half of the input for the subsequence script. Nearly identical to the first round script, get-params-r2-with-var-shrink.R accepted both the original input space get-bcparams-input.rda and the hyper parameters to perform estimation under the EM algorithm, with the addition of the hyper parameters. This script follows the same parallel strategy as the first round estimation and results in a similar output, as well as a separately designated directory. The same merge script compiles the multiple files into a single file and the result became the barcode model from which all of the results can be computed. 18 Validation of Barcode Modifications In adapting an already established algorithm it was important to show that no significant changes were made that would invalidate the algorithm. The entire data set was run, normalized by fRMA under the old CDF, to inspect the precision of values outputted, with a new input. A minimal amount of difference was expected, for genes close to the decision boundary. The validation model was done by computing data points that represent probe sets rather than previously being summarized into genes. The validation focused on the consistency of the algorithm. The comparison was a measure of exact point-by-point comparison, therefore if the original algorithm produced a 0 and the modified algorithm produced a 0, it was considered an agreement. The same agreement exists with 1‟s matching in both cases. However, if the original produced a 0 and the modified model produced a 1 or vice versa, it was considered a disagreement. A total of ~6,000 microarray files were barcoded by both models (original & modified). The results from these two models where then compared point-by-point comparison resulted in 89% identical results. That meant that 11% of the data had switched from one result to the other. Determining if the 11% mis-match, is significant is not trivial. It Figure 6: Validation with original barcode. 19 would require having a known truth to compare both models against, which does not currently exist. However it is possible to identify the location of the mis-matches in relation to the decision boundary, which both models share. As a result, there is a peak which can be interpreted as the values “Off originally, On in the modified model”. Our claim for this result is the difference in microarrays used in the initial creation of the barcode model. Genes that were originally located near the decision boundary would be highly sensitive to slight changes, such as additional microarrays added to the original collection. The important point was that there is a subset of genes that are influenced by the input space and our modification, which could represent genes with high variability. Results & Discussion Gene Expression Analysis in Tissues The summarization of gene expression into Boolean values is a means to simplify the complex intensity data into more interpretable values with regards to the biology of a cell. The simpler can then be used in additional layers of analysis easily. (Kauffman, 1974)The complexity quickly increases by combining tissue descriptions, curated pathway networks and Boolean gene expression, but this more comprehensive cross- section can provide additional perspectives into the behavior of gene expression. The analysis was narrowed to 14 tissues; Breast, Cervix, Colon, Fallopian Tube, Gastric Tissue, Kidney, Liver, Lung, Ovary, Pancreas, Prostate, Sigmoid Colon Mucosa, Skin, Tongue Squamous Cells. Since the output values for each gene is a Boolean value, either 0 or 1, summarization of the state of a gene in a tissue is done by computing percentage 20 of time the gene is „on‟ in the set of bio specimens in this tissue. This strategy intends to provide a single value that captures the empirical variability of a gene within a tissue. The on /off state of gene can vary within bio specimens a given tissue type. Thus any gene at 0.5 or 50% would reflect the highest variability. When comparing tissues between themselves or pathways state between tissues, the percentage of gene in „on‟ state is used as its „coordinate‟ in the state space. For instance the state of a pathway in a tissue that includes n genes is defined as a n-dimensional vector of this percentage value. By representing the data in this manner the Euclidean distance equation can be applied, to determine the distance between those two pathways. Note that this distance can be similarly computed between all genes in two tissues to compare them. Comparison between Normal and Cancer Tissue We computed the difference between normal tissue and the cancer tissue. Calculating the Euclidean distance between the two signatures of normal and cancer provides a metric of difference. The result of the distances between normal and cancer are shown in the table 1. According to this method the tissue with the shortest distance Breast - Normal Breast - Cancer Figure 7: Example of Euclidean distance determination. This approach uses the tissue signal as a coordinate in an n th dimensional configuration, where n is the number of genes. 21 between normal and cancer is Gastric tissue, suggesting that the change from an on/off perspective is minimal. From a biological perspective, this indicates that the on/off variance between genes in the two tissues quite similar. The tissue with the largest distance is Lung cancer, suggesting that the disease affects the state of a larger number of genes or impact more dramatically the state of some genes. The tissue that appears to have the least amount of gene expression in common is lung normal from lung cancer. Comparison between Normal Tissues The same analysis was performed across tissue types by computing the distance between two normal tissues. For instance, as expected, the distance between Breast and Ovarian displays the closest distance, suggesting a high degree of similarity between these two tissues. Interestingly it is well know that the cancer state of these two tissues is very similar. On the other hand, Breast is more distant from Cervix and Fallopian Tube Epithelium, than Ovarian, which suggests that our metric does not capture a signature specific to female organs. We also observe that Colon and Lung being the farthest away, among all the tissues listed. Table 2 Normal Vs. Normal Tissue Comparison Breast Cervix Colon FTE Gastric Kidney Liver Lung Ovary Pancs Proste SCM Skin TSC Tissue Cancer Gastric Tissue 18.28577 Tongue Squamous Cells 31.62825 Breast 37.11043 Pancreas 38.12178 Colon 40.73352 Prostate 46.14632 Fallopian Tube Epithelium 52.39634 Kidney 55.13081 Sigmoid Colon Mucosa 56.02789 Skin 65.77019 Cervix 69.26119 Ovary 70.78509 Liver 70.8988 Lung 78.3623 Table 1: Tissue Normal Vs. Cancer 22 Breast 0.00 Cervix 61.37 0.00 Colon 49.11 75.24 0.00 Fallopian Tube 56.82 55.83 71.21 0.00 Gastric Tissue 63.83 32.32 76.28 58.09 0.00 Kidney 67.23 37.51 79.85 60.88 34.42 0.00 Liver 64.83 32.72 76.78 59.05 30.51 33.03 0.00 Lung 72.77 44.51 84.50 66.73 45.25 40.71 43.08 0.00 Ovary 28.90 62.99 48.20 60.83 66.89 69.87 66.63 74.46 0.00 Pancreas 69.21 42.82 81.34 61.14 40.30 34.48 39.80 30.49 71.71 0.00 Prostate 48.81 74.63 40.06 69.39 76.08 79.33 76.62 83.91 48.82 80.67 0.00 Colon Mucosa 57.41 43.48 70.16 51.17 41.36 48.91 46.27 57.23 61.68 51.94 71.21 0.00 Skin 42.98 72.21 32.77 67.62 74.10 77.19 74.45 81.96 44.37 78.62 34.76 68.6 0.00 Tongue 41.99 68.93 49.80 63.51 70.68 74.02 71.06 78.63 41.14 75.57 47.34 65.94 45.71 0.00 Pathway Focused Comparison of Normal versus Cancer Tissue We computed the pathway centric distance between normal and cancer of the same tissue type Pathways were obtained from KEGG (Kanehisa, 2012). All the distances were computed between normal and cancer for each tissue type and for all pathways. This analysis was performed on 185 KEGG pathways across the 14 tissue and 2 disease states resulting in 2,590 distances. How the Mean is Computed This metric provides a means to rank pathways in a similar way to differential analysis. The expected pathways that have the greatest distance would be known cancer pathways. As such, the most commonly first ranked pathway is the KEGG “Pathways in 23 Cancer”, which provides a good affirmation. Additionally, the p53, WNT and MapK signaling pathways are ranked highly as well. These three pathways are commonly known to be associated with cancer. In addition to detecting pathways in cancer that are farthest from normal, the pathways closest to normal are also revealed. We would expect that pathways that act very similar between normal and cancer are fundamental for survival, most likely house-keeping pathways. The pathways with the least amount of distance were typically protein and amino acid metabolism, and sugar degradation. Future Work Application of Other Microarray Platforms This analysis contains a significant amount of statistics based work. This project only entailed a single Affymetrix array; U133-Plus2. This single array was selected due to it being the largest represented array in the GEO database, with 66,894 samples publicly available. Of that collection only samples with Tissue and disease details Mean Rank Score Pathway Name 1.6 PATHWAYS_IN_CANCER 2.9 CYTOKINE_CYTOKINE_RECEPTOR_INTER ACTION 5.6 MAPK_SIGNALING_PATHWAY 6.0 NEUROACTIVE_LIGAND_RECEPTOR_INTE RACTION 6.5 FOCAL_ADHESION 6.6 ENDOCYTOSIS 8.9 CHEMOKINE_SIGNALING_PATHWAY 9.4 REGULATION_OF_ACTIN_CYTOSKELETON 9.5 CELL_ADHESION_MOLECULES_CAMS 11.1 PURINE_METABOLISM 13.1 TIGHT_JUNCTION 19.1 WNT_SIGNALING_PATHWAY 19.9 SPLICEOSOME 20.6 PYRIMIDINE_METABOLISM Table 3 Pathways Normal Vs. Cancer 24 available were used. The first step would be to obtain the rest of the samples and tissue annotations (only possible through literature searching and human selection). That would be followed by applying the whole platform of that data to Barcode model training. Once that is completed, the second desire would be to obtain other platforms, Affymetrix or otherwise. By obtaining multiple platforms, it should be feasible to make gene expression comparisons across microarray platforms. Utilization of More Robust Differential Calculation There is a potential pitfall that biases the Euclidean distance calculations for describing the difference between normal and cancer. The summarization over bio specimen of the same tissue type fails to account for the variance of that percentage. Hence a more complicated, more robust method is proposed by determining the difference through permutation probability. The permutation probability approach would be used to calculate a more representative value for gene behavior across bio specimen. Integration with Archive Resources The Barcode workflow was created to initially from data within BORA and built with the intention to loop back to deposit the Boolean results back into BORA. Currently the data used to create the modified Barcode model is being loaded into the repository, but the resulting barcoded values has yet to be stored. Expansion to Open Source Software One of the most widely used open source software used for pathway analysis is Cystoscape (Shannon, Markiel, Ozier, Baliga, Wang, Ramage, et. al., 2003). Cytoscape is 25 a client software program that contains tools to integrate datasets and interact with graphical representations. A direct link between Cystoscape and BORA has been developed, to leverage the stored data in a databank. Once the Barcoded values of gene states are loaded into BORA, this link could be used to perform more targeted, deeper analysis. 26 References Barrett, T., Troup, D. B., Wilhite, S. E., Ledoux, P., Evangelista, C., Kim, I. F., et al. (2011). NCBI GEO: Archive for functional genomics data sets--10 years on. Nucleic Acids Research, 39(Database issue), D1005-10. doi:10.1093/nar/gkq1184 CDF FILE. (2009, March). In Affymetrix® CDF Data File Format. Retrieved April 10, 2012, from http://www.affymetrix.com/support/developer/powertools/changelog/gcos-agcc/cdf.html .CEL File Extension. (2011, May 18). In FileInfo.com. Retrieved February 22, 2012, from http://www.fileinfo.com/extension/cel Eichinger, L. (2009, January 7). Figure 1: Principle of DNA microarray analysis. In Dictyostelium discoideum Microarray analysis. Retrieved April 5, 2012, from http://www.uni-koeln.de/med- fak/biochemie/transcriptomics/07_analysis.shtml Glass, L. (1975). Classification of biological networks by their qualitative dynamics. Journal of Theoretical Biology, 54(1), 85-107. Glass, L., & Kauffman, S. A. (1973). The logical analysis of continuous, non-linear biochemical control networks. Journal of Theoretical Biology, 39(1), 103-129. Gentleman R.C., Carey V.J., Bates D.M., Bolstad B., Dettling M., Dudoit S., Ellis B., Gautier L., Ge Y., Gentry J., Hornik K., Hothorn T., Huber W., Iacus S., Irizarry R., Leisch F., Li C., Maechler M., Rossini A.J., Sawitzki G., Smith C., Smyth G., Tierney L., Yang J.Y. and Zhang J. (2004) Bioconductor: open software development for computational biology and bioinformatics. Genome Biol. 5(10): R80. Hellwig, B., Hengstler, J. G., Schmidt, M., Gehrmann, M. C., Schormann, W., & Rahnenfuhrer, J. (2010). Comparison of scores for bimodality of gene expression distributions and genome-wide evaluation of the prognostic relevance of high-scoring genes. BMC Bioinformatics, 11, 276. doi:10.1186/1471-2105-11-276 Irizarry, R. A., Bolstad, B. M., Collin, F., Cope, L. M., Hobbs, B., & Speed, T. P. (2003). Summaries of affymetrix GeneChip probe level data. Nucleic Acids Research, 31(4), e15. Irizarry, R. A., Hobbs, B., Collin, F., Beazer-Barclay, Y. D., Antonellis, K. J., Scherf, U., et al. (2003). Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England), 4(2), 249-264. doi:10.1093/biostatistics/4.2.249 Jia, P., Zheng, S., Long, J., Zheng, W., & Zhao, Z. (2011). dmGWAS: Dense module searching for genome-wide association studies in protein-protein interaction networks. Bioinformatics (Oxford, England), 27(1), 95-102. doi:10.1093/bioinformatics/btq615 Kanehisa, M., Goto, S., Sato, Y., Furumichi, M., and Tanabe, M.; KEGG for integration and interpretation of large- scale molecular datasets. Nucleic Acids Res. 40, D109-D114 (2012). Kauffman, S. (1969). Homeostasis and differentiation in random genetic control networks. Nature, 224(5215), 177- 178. Kauffman, S. (1974). The large scale structure and dynamics of gene control circuits: An ensemble approach. Journal of Theoretical Biology, 44(1), 167-190. 27 Kim, J., Patel, K., Jung, H., Kuo, W. P., & Ohno-Machado, L. (2011). AnyExpress: Integrated toolkit for analysis of cross-platform gene expression data using a fast interval matching algorithm. BMC Bioinformatics, 12(75), 1. Retrieved April 5, 2012 Lage, K., Hansen, N. T., Karlberg, E. O., Eklund, A. C., Roque, F. S., Donahoe, P. K., et al. (2008). A large-scale analysis of tissue-specific pathology and gene expression of human disease genes and complexes. Proceedings of the National Academy of Sciences of the United States of America, 105(52), 20870-20875. doi:10.1073/pnas.0810772105 McCall, M. N., & Irizarry, R. A. (2011). Thawing frozen robust multi-array analysis (fRMA). BMC Bioinformatics, 12, 369. doi:10.1186/1471-2105-12-369 McCall, M. N., Uppal, K., Jaffee, H. A., Zilliox, M. J., & Irizarry, R. A. (2011). The gene expression barcode: Leveraging public data repositories to begin cataloging the human and murine transcriptomes. Nucleic Acids Research, 39(Database issue), D1011-5. doi:10.1093/nar/gkq1259 Ochab-Marcinek, A., & Tabaka, M. (2010). Bimodal gene expression in noncooperative regulatory systems. Proceedings of the National Academy of Sciences of the United States of America, 107(51), 22096-22101. doi:10.1073/pnas.1008965107 Parmigiani,G., Garrett,E.S., Anbazhaghan,R. and Gabrielson,E. (2002) A statistical framework for expression-based molecular classification in cancer. J. Roy. Stat. Soc. B (Stat. Meth.), 64, 20. Ponten, F., Gry, M., Fagerberg, L., Lundberg, E., Asplund, A., Berglund, L., et al. (2009). A global view of protein expression in human cells, tissues, and organs. Molecular Systems Biology, 5, 337. doi:10.1038/msb.2009.93 Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Research 2003 Nov; 13(11):2498-504 Subramanian, A., Tamayo, P., Mootha, V. K., Mukherjee, S., Ebert, B. L., Gillette, M. A., et al. (2005). Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences of the United States of America, 102(43), 15545-15550. doi:10.1073/pnas.0506580102 Thompson, Robert, and Fan Meng. "Brain Array." MBNI Microarray Lab. University of Michigan - Ann Arbor, Aug. 2011. Web. 7 Feb. 2012. http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/15.0.0/entrezg.asp Wang, X. V., Verhaak, R. G., Purdom, E., Spellman, P. T., & Speed, T. P. (2011). Unifying Gene Expression Measures from Multiple Platforms Using Factor Analysis. Analysis. PLoS ONE, 6(3). Zhang, Aidong. Advanced Analysis of Gene Expression Microarray Data. Singapore: World Scientific Publishing, 2006. N. pag. Print. Zilliox, M. J., & Irizarry, R. A. (2007). A gene expression bar code for microarray data. Nature Methods, 4(11), 911- 913. doi:10.1038/nmeth1102