|
|
|
|
|
|
|
|
| Darius Dziuda, Central Connecticut State University |
|
Dr. Darius M. Dziuda has a Ph.D. in Computer Science from Warsaw University of Technology and extensive academic and biotech experience in data mining and biomarker discovery. His research and professional activities have been focused on efficient data mining of biomedical data, and on identification of small and accurate genomic, proteomic, and metabolomic multivariate markers for medical diagnosis, prognosis, individualized medicine, and drug discovery. He is a consultant in bioinformatics and author of the MbMD data mining software system for biomarker discovery, which has been successfully used in multiple projects in the USA and in Europe. Currently, at the Central Connecticut State University (CCSU), Dr. Dziuda is focusing on developing and teaching new online courses on Data Mining for Genomics and Proteomics and on Biomarker Discovery. CCSU is the only University in the world offering an online Master of Science degree in data mining. Dr. Dziuda is also a co-author of thirteen US Patent Applications in the area of bioinformatics and drug discovery. His recent and ongoing collaborations include research projects with Baylor College of Medicine and Virginia Bioinformatics Institute.
|
|
Multivariate Biomarkers Discovery
Darius M. Dziuda, Central Connecticut State University
Life sciences are rapidly changing from disciplines that were dealing with relatively small data sets to research areas ‘bombarded’ with large and huge data sets. As a result of research sparked by the Human Genome Project, we have now growing library of organisms with already sequenced genome. Data sets generated by current microarray technologies consist of tens of thousands of gene expression variables. When protein chip technology matures, we may even see data sets with more than a million variables. Traditional “one-gene-at-a-time”, or univariate approach, which has dominated life sciences for a very long time, is no longer sufficient. Different approaches are necessary and multivariate analysis should become a standard one. There is nothing wrong with using the univariate analysis, but if the research stops there (and this, unfortunately, is still the case in many studies), the huge amount of generated data may be heavily underused, and potentially important biomedical knowledge contained therein not extracted.
Biomarker discovery means selection of an optimal subset of variables, which together – as a set – can significantly differentiate the phenotypic classes of interest. Due to large numbers of variables, p, exhaustive search that would guarantee finding the best subset cannot be implemented, as the order of the search space is O(2p). Generally, many problems related to feature selection have been shown to be NP-hard. Because of these difficulties, many studies reported in the literature pretty much neglect this step and apply more or less arbitrary selection of features used then to build classification models. Usual approach is to find an ordered list of features (using simple univariate methods like ANOVA) and then use some number of the features from the top of the list. Such a univariate approach not only neglects correlations between variables but also usually results in removing from consideration important discriminatory information.
The presentation will outline and discuss current issues in biomarker discovery, especially multivariate approach to identification of genomic, proteomic or metabolomic biomarkers for medical diagnosis, prognosis, individualized medicine, and drug discovery.
Keywords: Biomarkers, Multivariate Analysis, Data Mining, Bioinformatics, Genomics, Proteomics, Metabolomics, Differential Diagnosis, Drug Discovery, Individualized Medicine, Classification and Prediction Systems.
|
|
|
|
|
|
|
|
|
|
|