This is an Open Access article distributed under the following Assignment of Rights http://www.excli.de/documents/assignment_of_rights.pdf. You are free to copy, distribute and transmit the work, provided the original author and source are credited.

Metabolic syndrome (MS) is a condition that predisposes individuals to the development of cardiovascular diseases and type 2 diabetes mellitus. A cross-sectional investigation of 15,365 participants residing in metropolitan Bangkok who had received an annual health checkup in 2007 was used in this study. Individuals were classified as MS or non-MS according to the International Diabetes Federation criteria using BMI cutoff of ≥ 25 kg/m^{2} plus two or more MS components. This study explores the utility of quantitative population-health relationship (QPHR) for predicting MS status as well as discovers variables that frequently occur together. The former was achieved by decision tree (DT) analysis, artificial neural network (ANN), support vector machine (SVM) and principal component analysis (PCA) while the latter was obtained by association analysis (AA). DT outperformed both ANN and SVM in MS classification as deduced from its accuracy value of 99 % as compared to accuracies of 98 % and 91 % for ANN and SVM, respectively. Furthermore, PCA was able to effectively classify individuals as MS and non-MS as observed from the scores plot. Moreover, AA was employed to analyze individuals with MS in order to elucidate pertinent rule from MS components that occur frequently together, which included TG+BP, BP+FPG and TG+FPG where TG, BP and FPG corresponds to triglyceride, blood pressure and fasting plasma glucose, respectively. QPHR was demonstrated to be useful in predicting the MS status of individuals from an urban Thai population. Rules obtained from AA analysis provided general guidelines (i.e. co-occurrences of TG, BP and FPG) that may be used in the prevention of MS in at risk individuals.

Metabolic syndrome (MS) is defined as a group of metabolic abnormalities comprising of central obesity, dyslipidemia, hyperglycemia, and hypertension (Babu and Fogelfeld, 2006[

Data mining is a robust tool for extracting useful knowledge from large quantities of data and can be readily applied to clinical data as to help physicians in the decision-making process of diagnosis, prognosis, and treatment of patients. Data mining techniques such as artificial neural network (ANN), support vector machine (SVM), multiple linear regression (MLR), principal component analysis (PCA), self organizing map (SOM), decision tree (DT) and association analysis (AA) have been successfully used in clinical medicine for predictive modeling of diseases (Chang et al., 2011[

A cross-sectional data set comprising of 15,365 individuals receiving an annual health check-up in 2007 from the Faculty of Medical Technology, Mahidol University in Bangkok, Thailand was previously reported by Worachartcheewan et al. (2010[^{2 }(5,638 from total population) as the first component along with two or more components:

blood pressure (BP) ≥ 130/85 mmHg or previously diagnosed hypertension,

fasting plasma glucose (FPG) ≥ 100 mg/dL or previously diagnosed type 2 DM,

triglyceride (TG) ≥150 mg/dL or specific treatment for triglyceride abnormality as well as high-density lipoprotein cholesterol (HDL-C) < 40 mg/dL in males or < 50 mg/dL in females or specific treatment for HDL-C abnormality.

Individuals with BMI ≥ 25 kg/m^{2} were selected for QPHR study (encompassing a total of 5,638 individuals) as they met the first requirement of the IDF criteria of central obesity. From this subset of data, individuals with 2 or more MS components were identified as MS (2,991 individuals: 1,598 males and 1,393 females) while healthy individuals were classified as non-MS (2,647 individuals: 1,063 males and 1,584 females).

Determining factors of MS (i.e. BP, FPG, TG and HDL-C) were stratified according to guidelines of the WHO (Wilson, 2009[^{2} were further divided into six BMI groups as well as separated by gender (male and female) and stratified into four age groups as presented in Table 1

Independent variables were adjusted to comparable scale by standardizing variables to zero mean and unit variance. Standardization of variables was performed as described by the following equation:

where _{ij} is the value of each sample,

Health parameters from annual health check-ups of an urban Thai population served as the data set for multivariate analysis where individuals were classified as MS or non-MS by means of several data mining techniques.

Decision tree (DT) is a supervised technique for classifying data into categorical classes of interest and the wisdom gained from the learning process are summarized in the form of if-then rules. DT finds the most important independent variable and sets it as the root node, which is followed by a series of bifurcating nodes when decision criteria are met. This is performed iteratively until leaf or terminal nodes are reached where it is then assigned one of many possible class labels of the dependent variable (i.e. MS or non-MS). This study employs the J48 algorithm (Witten et al., 2011[

Artificial neural network (ANN) is a data mining technique that functions in a similar manner to the learning process of neurons in the human brain. ANN is essentially comprised of 3 layers of nodes: input, hidden and output layers (Zupan and Gasteiger, 1999[

Support vector machine (SVM) is a statistical learning method developed by Vapnik and co-workers (Cortes and Vapnik, 1995[_{i}_{i} with constraints 0 ≤ α_{i}_{i}_{i}

where _{i}_{i} is a set of descriptors, and _{i}

In an SVM regression, the decision function was used in predicting or approximating the function as follows:

where α_{i} is a real value, and _{i}

Linear and non-linear regressions approximate the function by minimizing the regularized risk function

where _{ε}(

Three major learning kernels of SVM are comprised of linear, polynomial and radial basis function kernel.

Linear kernel is defined by the following equation:

where

Polynomial kernel is described by the following equation:

where

Radial basis function is defined by the following equation:

Principal component analysis (PCA) was performed using The Unscrambler software package, version 9.6 (Camo Software AS, Norway). Metabolic parameters were used as independent variables while the MS status was used as the dependent variable. Input variables were standardized as described by Eq. (1). The optimal number of PCs was determined according to the method of Haaland and Thomas (1988[

where _{i}_{i }

Association analysis (AA) was performed using SPSS Clementine, version 11.1 (SPSS Inc., USA). AA is a data mining technique that discovers unknown relationship of items by searching for those that frequently occur together (Wang et al., 2004[_{sup}_{conf}_{1}, _{2}, _{3}, . . .,_{m}} where each item represents a unique literal. A set of transaction T in a transaction database denoted by D is composed of transaction T, which contain sets of items such that

The possibility of transaction

Furthermore, the possibility of a transaction D is composed of X also contained Y was represented in following equation:

The _{sup}_{conf}_{sup}_{conf}

Data sampling was performed by separating the data set into two subsets: (i) training set and (ii) 10-fold cross-validation (CV) testing set. 10-fold CV essentially separates the data into ten groups, leaves one group out as the testing set and uses the remaining nine groups as the training set. This process was repeated iteratively until all groups had a chance to be used as the testing set.

Seven statistical parameters were employed for evaluating the predictive power of the models, which is comprised of root mean squared error, sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV) (Kuo et al., 2001[

Root mean square error (RMSE) was used as a measure of the predictive error of the model and was calculated as follows:

where _{i}_{i}

where

The study population is comprised of 15,365 participants where 6,005 (39 %) are males and 9,360 (61 %) are females. This population was previously classified into MS and non-MS groups based on a BMI cut-off of ≥ 25 kg/m^{2} following the IDF criteria (Alberti et al., 2009[^{2}, 2,991 had MS while 2,647 did not. The clinical and biochemical features of MS and non-MS groups in the Thai population were stratified and summarized in Table 1

QPHR modeling is a multivariate approach for predicting MS status as a function of health parameters. Thus, the development of QPHR models essentially involves the correlation of biomedical parameters with their respective MS status. Prior to multivariate analysis, the independent variables were pre-processed to comparable scales by means of standardization using WEKA, version 3.4.5. Such standardization of variables was performed according to Eq. (1) in order to scale variables to zero mean and unit variance. In this study, several data mining techniques (i.e. DT, PCA, ANN and SVM) were employed for identifying MS in the investigated Thai population.

Decision tree or DT displayed accuracies of 99.98 % and 98.86 % for the training set and the 10-fold CV set, respectively, as shown in Table 2

ANN parameters were optimized in order to obtain an optimal set of parameters. It was found that optimal values for the number of hidden node, learning epochs, learning rate and momentum are 7, 9500, 0.2 and 0.5, respectively. Statistical parameters for assessing the predictive performance of QPHR models are presented in Table 2

In order to achieve maximal performance, SVM parameters (i.e. ^{-15} to 2^{15} using a step size of 2^{2}. Results from global grid search indicated that the optimal^{11} and 2^{3}, respectively. Subsequently, local grid search was performed by refining the search to regions in the vicinity of the optimal values from the global grid search in the regions from 2^{9} to 2^{13} for the ^{1} to 2^{5} was investigated for the γ parameter using step sizes of 2^{0.25}. Results from the local grid search indicated that the optimal values for^{12.5 }and 2^{3}, respectively. Statistical parameters for assessing the predictive performance of the QPHR models are presented in Table 2

Three-dimensional displays of the PCA scores plots are shown from the 120° (Figure 1A-1D

Association analysis or AA was used in the discovery of association rules as to elucidate frequently occurring variables of metabolic abnormalities leading to MS. Binning was performed on the health parameters by transforming quantitative values to qualitative values. Particularly, binning was performed by stratifying values of the variables into several value ranges (Table 1

Quantitative population-health relationship or QPHR modeling is proposed herein for elucidating the relationship between biomedical parameters from individuals with respect to their metabolic syndrome status. Such QPHR model has useful implications for clinical applications in diagnosis (Firouzi et al., 2007[

A statistical summary of the overall predictive performance of the data mining methods employed in this study, namely DT, SVM and ANN, are presented in Table 2

AA has previously been used in clinical diagnosis for the discovery of risk factors that are associated with the development of diseases such as diabetes (Quentin-Trautvetter et al., 2002[

The findings strongly suggest the robustness of data mining methods (i.e. DT, ANN, SVM and PCA) for identification and classification of individuals with or without MS in an urban Thai population. The results indicated that DT was the best performing method with an accuracy of greater than 99 %. Furthermore, AA provided pertinent information on common MS components (i.e. triglyceride levels, systolic and diastolic blood pressure and fasting plasma glucose) that frequently occur together. Identification of MS components by means of association rule provided general guidelines that may potentially be used in preventing MS in individuals at risk for MS, a condition that predisposes them to the development of CVD and type 2 DM.

A.W. is supported by the Royal Golden Jubilee (Ph.D.) scholarship of the Thailand Research Fund under the supervision of V.P and this research project is supported by the Office of the Higher Education Commission and Mahidol University under the National Research Universities Initiative. We thank the Center of Medical Laboratory Services and Mobile Health Unit of the Faculty of Medical Technology for the data set used in this study.

Chanin Nantasenamat and Virapong Prachayasittikul (Department of Clinical Microbiology and Applied Technology, Faculty of Medical Technology, Mahidol University, Bangkok 10700, Thailand; phone: +66 2 441 4376, Fax: +66 2 441 4380; virapong.pra@mahidol.ac.th) contributed equally as corresponding authors.