This is an Open Access article distributed under the following Assignment of Rights http://www.excli.de/documents/assignment_of_rights.pdf. You are free to copy, distribute and transmit the work, provided the original author and source are credited.

In living systems, RNAs play important biological functions. The functional form of an RNA frequently requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements that are hydrogen bonds within the molecule. Here, we concentrate on the inverse RNA folding problem. In this problem, an RNA secondary structure is given as a target structure and the goal is to design an RNA sequence that its structure is the same (or very similar) to the given target structure. Different heuristic search methods have been proposed for this problem. One common feature among these methods is to use a folding algorithm to evaluate the accuracy of the designed RNA sequence during the generation process. The well known folding algorithms take O(n^{3}) times where n is the length of the RNA sequence. In this paper, we introduce a new algorithm called GGI-Fold based on multi-objective genetic algorithm and Gibbs sampling method for the inverse RNA folding problem. Our algorithm generates a sequence where its structure is the same or very similar to the given target structure. The key feature of our method is that it never uses any folding algorithm to improve the quality of the generated sequences. We compare our algorithm with RNA-SSD for some biological test samples. In all test samples, our algorithm outperforms the RNA-SSD method for generating a sequence where its structure is more stable.

RNAs perform a wide range of functions in biological systems. The functional form of RNA, frequently requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements that are hydrogen bonds within the molecule. Therefore, the study and analysis of RNA secondary structures are critical to understand their functional roles inside the cell (Condon et al., 2004[

One of the most important problem in RNA area is the inverse RNA folding. In this problem, a secondary structure of an RNA is given, and the goal is to find a proper sequence that folds into the given RNA secondary structure. The inverse RNA folding problem can be used to design the non-coding RNAs, which are involved in gene regulation, chromosome replication and RNA modification (Knight, 2003[

RNAinverse, available as a part of the Vienna RNA package, is an original approach to solve this problem (Hofacker et al., 1994[

It should be mentioned that the existing methods use a folding algorithm for evaluating and improving the accuracy and the quality of the generated sequences. Employing any folding algorithm requires at least O(n^{3}) time steps. Therefore, it slows the overall running time of all proposed methods. On the other hand, any algorithm that uses a specified folding method will be biased to that method. In this paper, we present a new method to solve this problem without using any folding algorithm. Our new algorithm (GGI-Fold) is designed based on the multi-objective genetic algorithm and Gibbs sampling method. At first, GGI-Fold designs a sub-sequence for each sub-structure of the target structure based on genetic algorithm. Then all sub-sequences are updated by Gibbs sampling method. Finally, the sub-sequences are assembled to construct a sequence corresponding to the target structure. In this approach, our main effort is to generate feasible sub-sequences corresponding to sub-structures in such a way that the assembled RNA sequence hopefully folds into the target structure. The GGI-Fold algorithm is implemented and tested on some biological data and the obtained results are compared with RNA-SSD algorithm.

The rest of this paper is organized as follows. In Section 2, some basic definitions are presented. In Section 3 and Section 4, a new method for the inverse RNA folding problem and some results are shown, respectively. Finally, the conclusion is presented in Section 5.

An RNA molecule is composed of a long, usually single-stranded chain of nucleotide units: Adenine (^{′}-3^{′} direction can be represented as _{1}_{2}..._{ℓ}, where |_{i }∈ {

The RNA secondary structure is formed by the creation of hydrogen bonds between Watson-Crick complementary bases (_{1 }, _{1}), (_{2}, _{2}) ∈ _{1} = _{2} if and only if _{1} = _{2} (each base can take part in at most one base pairing). The set T is called pseudoknot-free structure if for all (_{1}, _{1}), (_{2}, _{2}) ∈ T, they are either nested (_{1 }< _{2 }< _{2 }< _{1}) or disjoint (_{1 }< _{1 }< _{2 }< _{2}), as shown in Figure 1

A pseudoknot-free secondary structure can be described as a string of balanced parentheses. In this representation, each two paired bases _{i}_{j}_{1}_{2}_{n}_{i}_{1}_{1}_{2}_{2}_{1} and ended in position _{1} and the paired positions started from _{2} and ended in _{2}. Also, _{i} = (

Based on the above discussion, the inverse RNA folding problem can be described as follows: an RNA secondary structure is given as an input (target structure) and the goal is to find an RNA sequence _{1}_{2}..._{ℓ}, such that its secondary structure is the same (or very similar) to the given target structure.

In this section, we first explain how the real RNA dataset is reconstructed and then we present the details of our proposed methods.

In order to compare the results of our proposed algorithm with the results of the other existing algorithms, we employ the same dataset of RNA sequences as presented in Andronescu et al. (2004[

To determine the structures corresponding to the RNA sequences, we employ the RNAfold program (available as a part of Vienna RNA package). This program is implemented based on the Zuker's algorithm (Zuker, 1994[

As mentioned, the goal of the inverse RNA folding problem is to design a sequence for the given target structure. In Figure 2_{1}_{2}_{n}

Let _{1}_{2}_{n}_{i} is a sub-sequence corresponding to the component c_{i}. Let also Z_{j} denotes the prefix of _{j}_{1}_{2}_{j}_{0} is an empty list). The main steps of our algorithm (GGI-Fold) are illustrated as follows:

The list

For each _{k} and_{k−1}. MOGA makes a sub-sequence _{k} according to the component _{k} and the generated sub-sequences in _{k−1}.

The list Z is updated by Gibbs sampling method (see section 3.2.2).

All the generated sub-sequences in

Since the longer sub-sequences are more important than the shorter ones, as well as they need more efforts to calculate their fitness values, so we first consider them. In the first step of the algorithm, the list _{k} is randomly removed from the list _{k}) to find a sub-sequence for _{k}. This method raises the dependency not only between small and large components, but also among all of them. Finally, when the algorithm cannot produce better sub-sequence, these sub-sequences are assembled to make an RNA sequence for the given structure

The details of the genetic algorithm MOGA (Step 2), Gibbs sampling method (Step 3), and the process of assembling the generated sub-sequences (Step 4) are discussed in the following subsections.

Genetic algorithm is a heuristic search method that mimics the process of natural evolution. This heuristic is used routinely to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms, which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover (Goldberg, 1989[

In this section, our multi-objective genetic algorithm, MOGA, is introduced to design a sub-sequence for each component in the target structure. This algorithm prevents miss-hybridization as well as keeping the uniform chemical characteristics in the generated sub-sequences. Suppose that the component c_{k} and the list _{k−1} are given to MOGA as inputs. In the population, each individual is an RNA sequence of length _{k}, where _{k }is the length of the component _{k}. Algorithm MOGA takes the component _{k} and the list _{k−1}and generates the best sub-sequence (based on the fitness function) for the _{k−1}, and the new generated sub-sequence is added to _{k−1} to produce the list _{k}. The stopping condition is considered as a maximum number of regenerating the populations. We use the conventional genetic operations such as

Our genetic algorithm employs a multi-objective fitness function to evaluate the available solutions in the current population. The fitness function is the summation of five different measures. Four of these measures are introduced by Shin et al. (2005[

ƒ_{AU_content} is a partial fitness function for counting the amount of A or U nucleotides in the sequence, which can be used to control the percentage of

ƒ_{Similarity} is a partial fitness function for preventing undesired hybridization by keeping the sequences as unique as possible in order to improve the accuracy of the generated sequences.

ƒ_{Continuity} is a partial fitness function for preventing occurrence of same bases continuously in a sequence in order to achieve the biologically relevance sequences.

ƒ_{Hybridization} is a partial fitness function for preventing the potential hybridization in a loop. This can be done in a similar manner as ƒ_{Similarity}, where a sub-sequence is checked against the reverse complement of the other sub-sequences.

The last part of the multi-objective fitness function is the minimum free energy (ƒ_{MFE}

_{1} × ƒ_{AU_content}+ _{2} × ƒ_{Similarity} + _{3 }× ƒ_{Continuity} +_{4} × ƒ_{Hybridization} + _{5} ×_{M F E},

where _{i}s are the weights of each part in the fitness function. Note that the best fitness value is zero; therefore the genetic algorithm tries to minimize the fitness function.

In addition to the processing of components in decreasing order of their length, the Gibbs sampling method let us to process a random ordering for making sub-sequences. At first, the sub-sequences in the list _{k}^{′}. Then, MOGA is performed with inputs _{k}^{′} to generate a new sub-sequence ^{′}_{k }corresponding to the component _{k} and add it to ^{′}. Then the new list ^{′} is assembled to produce a new sequence ^{′}. Later, the minimum free energies of the sequences ^{′} over the target structure are computed. If the minimum free energy of ^{′} is less than ^{′}. This process is repeated for 2 ×

As mentioned, the sub-sequences corresponding to the components are generated by our genetic algorithm. Then, the quality of them is improved by Gibbs sampling method. To obtain the final result, these sub-sequences are simply assembled in the corresponding position of each component. This process is illustrated in Figure 3

The GGI-Fold algorithm is implemented in C #.Net framework 3.5. We perform GGI-Fold on real RNA sequences taken from the RNA family database. The results of GGI-Fold algorithm over these sequences are compared to the results of RNA-SSD (Andronescu et al., 2004[

Table 2

The inverse RNA folding problem is considered in this paper as a multi-objective optimization problem. We used genetic algorithm and Gibbs sampling method to address this problem. We employed the genetic algorithm in an unusual way: instead of considering a population of chromosomes, each for a whole sequence, we break down the structure into some components and use the genetic algorithm for generating a good sub-sequence for each component. In this way, the generated sub-sequences are far from each other as much as possible. Also, instead of using the structural distance as a single measurement, we have employed five different measures to obtain the more reliable results. It should be mentioned that no folding algorithm has been employed to evaluate the accuracy of the generated sequences. As mentioned by Aguirre-Hernández et al. (2007[

This work is dedicated to Prof. Hayedeh Ahrabian, who is passed away on July (2011).

This research is supported by the Institute for Studies in Theoretical Physics and Mathematics (IPM) with grant number: 89680067.