Researches on data mining modeling theories and its applications in bioinformatics
Shen Hong-Bin
ABSTRACT
In the past decades, large amount of data has been obtained with the fast development of science, economic and society. How to find valuable knowledge and rules behind these data is a critical problem and is a hot research topic in both theoretical and practical researches. At the same time, the biological data has also increased exponentially with the development of the various biological devices. Under such conditions, it is both very expensive and time consuming for dealing with such large size of data only based on the conventional biological experiments. It has become a major challenge to bridge the gap between the number of newly generated data and understanding the knowledge they contain. Bioinformatics is a very young research direction, trying to find the knowledge and rules behind the biological data by combining information science, computer science, physics as well as the life science knowledge, which could be further used to explain the biological life. It is expected that the life science researches and the drug discovery can be speeded up by the bioinformatics researches. In this paper, we focus on the data mining and bioinformatics theoretical and practical researches.
Clustering analysis is one of the most important research areas in data mining. In the real world, we often have to deal with the high-dimensional dataset, in which, different attributes will contribute differently to each cluster in most cases. Considering such a problem, a kind of attribute weighted fuzzy kernel clustering algorithm is proposed. This new kernel clustering algorithm can reflect properly the attribute importance for each cluster and hence can yield much higher clustering accuracy than the conventional clustering algorithms. Another thing we often encounter in the real world is that one dataset is independent of others but also cooperate with others at the same time. Based on such cooperative constraints, new information based collaborative clustering algorithm is proposed. Such collaborative clustering algorithm considers the influence from other datasets and the corresponding clustering results will be more flexible.
Prediction of protein folding patterns is one level deeper than that of protein structural classes, and hence is much more complicated and difficult. To deal with such a challenging problem, the ensemble classifier was introduced. It was formed by a set of basic classifiers, with each trained in different parameter systems, such as predicted secondary structure, hydrophobicity, van der Waals volume, polarity, polarizability, as well as different dimensions of pseudo amino acid composition, that were extracted from a training dataset. Their outcomes were combined thru a weighted voting to give a final determination for classifying a query protein. The recognition was to find the true fold among the 27 possible patterns. The overall success rate thus obtained was 62% for a testing dataset where most of the proteins have less than 25% sequence identity with the proteins used in training the classifier. Such a rate is 6-21% higher than the corresponding rates obtained by various existing NN (Neural Networks) and SVM (Support Vector Machines) approaches, implying that the ensemble classifier is very promising and might become an useful vehicle in protein science, as well proteomics and bioinformatics.
The structural class is an important attribute used to characterize the overall folding type of a protein. Proteins often have quite similar or identical folding patterns even if they consist of very different sequences or bear various biological functions. In view of this, Levitt and Chothia tried to classify proteins into the following four structural classes: (1) all- , (2) all- , (3) , and (4) . Prediction of protein classification from the sequences is both an important and a tempting topic in protein science. This is because of not only that the knowledge thus obtained can provide useful information about the overall structure of a query protein, but also that the practice itself can technically stimulate the development of novel predictors that may be straightforwardly applied to many other relevant areas. In this paper, a novel approach, the so-called “supervised fuzzy clustering approach” is introduced that is featured by utilizing the class label information during the training process. Based on such an approach, a set of “if-then” fuzzy rules for predicting the protein structural classes are extracted from a training dataset. It has been demonstrated thru three different working datasets that the overall success prediction rates obtained by the supervised fuzzy clustering approach are all higher than those by the unsupervised fuzzy c-means introduced by the previous investigator. It is anticipated that the current predictor may play an important complementary role to other existing predictors in this area to further strengthen the power in predicting the structural classes of proteins and their other characteristic attributes.
As a “building block of life”, a cell is deemed the most basic structural and functional unit of all living organisms. It is highly organized with many functional units or organelles according to the cellular anatomy. Most of these units are “enveloped” by one or more membranes, which are the structural basis for many important biological functions. Membrane proteins are a special group in the protein families, which accounts for ~30% of all proteins but solved membrane protein structures only represent<1% of known protein structures to date. This class of proteins constitutes the majority of ion channels, transporters, and receptors in living organisms, for example, phospholamban protein is an integral membrane protein that regulates the Ca2+ pump in the heart. Because of the importance of membrane proteins, they act as the targets of approximately 80% drugs in the markets. Hence, solving the structures of membrane proteins plays key important roles in modern life science researches. Due to the intrinsic structural plasticity associated with many of these proteins, the chance of obtaining crystals suitable of X-ray or electron diffraction studies is small. Although helical membrane proteins pose higher degree of experimental difficulty, their conformation is, in a number of ways, more predictable than that of water-soluble proteins. In this paper, we have proposed a novel protein sequence discrete model, i.e. PsePSSM, and an ensemble classifier framework to predict the membrane protein topology in the cell membrane. Experimental results on the stringent dataset have shown that the prediction accuracy of the membrane protein topology in the 8 classes is more than 85%, which is about 30% than the conventional methods.
The knowledge of locations of protein in the cell is closely related with its functions. Even the function characters of a protein are known, it is still critical to know where the protein functions in the cell. One of the fundamental goals in molecular cell biology and proteomics is to identify their subcellular locations or environments because the function of a protein and its role in a cell are closely correlated with which compartment or organelle it resides in. For example, in 1986 the SWISS-PROT databank contained only 3,939 entries of protein sequences; recently, the number jumped to 223,100 according to the version released on June-2006 at http://www.ebi.ac.uk/swissprot/, meaning that the number of the entries now is more than 56 times the number of 1986! With the avalanche of protein sequences generated in the post-genomic era, it is highly desired to develop an automated method for fast and reliably annotating the subcellular locations of uncharacterized proteins. The knowledge thus obtained can help us timely utilize these newly-found protein sequences for both basic research and drug discovery. In this paper, a) we have firstly in the literature proposed the prediction algorithm to predict the dynamic feature of proteins may simultaneously exist at, or move between, two or more different subcellular locations, i.e. the model that can deal with proteins with multiple subcellular location sites; b) we have firstly proposed the model of prediction protein sub-sub-cellular location problem, i.e. prediction the protein subnuclear locations; c) we have for the first time extended the prediction scope to cover 22 subcellular locations, which greatly improves the practical value of the computational models. At the same time, we have also proposed to use the novel combined “high-level” gene ontology with the “ab-initio” sequence features to predict the protein subcellular locations. Also, we have proposed the “organism specific” ideas in developing the protein subcellular location prediction models. Experimental results on the stringent datasets have shown that the performance of the new models proposed in this paper is 35% higher than the conventional methods. All of these work have been accepted and used by other international researchers.
During the researches, we have constructed 15 online bioinformatics servers at: http://www.csbio.sjtu.edu.cn/bioinf/ and the biologists all over the world can easily submit their biological data to these servers, from which they will obtain immediate response. According to the statistics, these web-servers have been accessed and used more than 1,100,000, indicating these online servers are really useful in the life science researches. Furthermore, many calculated output from these web servers have already been published by other biologists. We believe that such user-friendly online web servers will play important roles in modern life science researches and drug discoveries.
Key words: Data mining, Clustering analysis, Bioinformatics, Machine learning, Information theory, Evidence theory, Ensemble classifier, Protein structure prediction, Protein subcellular location prediction, Membrane protein type recognition, Cellular network, Protein evolution theory
以上就是全国优秀博士学位论文中英文摘要精选连载。此外,考博英语复习参考【考博词汇书】更多资料请持续关注本站。选择【在线考博课程】,讲练结合,实用高效。有关考博初试成绩、复试安排、考博问答以及后程指导请参考最新专题【2014考博成绩查询】,希望对大家考试有实质性的帮助。真题发布请持续关注【点击查看2014考博真题】。祝考生们顺利通过考试!更多资讯请关注新东方在线考博频道。
考博必备!历年真题及答案
考博精品好课,就选新东方!
资料下载
【必看】考博英语词汇10000例精解
发布时间:2020-09-02关注新东方在线服务号
回复【10000】免费获取
医学考博英语作文核心基础词汇整理
发布时间:2020-04-15关注新东方在线服务号
回复【医学考博】获取
医学考博英语阅读理解练习资料
发布时间:2020-04-15关注新东方在线服务号
回复【医学考博】获取
法学考博英语高频词汇word版
发布时间:2020-04-15关注新东方在线服务号
回复【医学考博】获取
医学博士英语统考真题及解析
发布时间:2019-12-26关注新东方在线服务号
回复【考博真题】获取
全国医学博士外语统一考试真题
发布时间:2019-12-26关注新东方在线服务号
回复【考博真题】获取
中科院考博英语复习备考实战经验分享
发布时间:2019-12-26关注新东方在线服务号
回复【考博经验】获取
中科院考博英语真题练习资料
发布时间:2019-12-26关注新东方在线服务号
回复【考博真题】获取
关注新东方在线服务号
关注新东方在线服务号,
免费获取考博必看干货资料

推荐阅读
随着2025年考博接近尾声,2026年新一轮的备考也悄然拉开帷幕。还有很多学生会不明晰考博什么时候准备最好?该怎么做时间规划?需要准备多久
很多同学是第一次备考博士,英语不知如何下手。因此小编整理了2026年医学考博英语备考攻略,希望在大家备考时有帮助。一、2026年医学考博英
来源 : 网络 2025-05-12 16:45:48 关键字 : 医学考博英语
考博英语复习备考中,查找一些具体的备考资料,对于大家来说也是比较耗时的事情。为了让大家更好的来备考考博考试,小编为大家整理了一
来源 : 网络 2025-05-08 18:17:56 关键字 : 考博英语句型模板
考博英语复习备考中,查找一些具体的备考资料,对于大家来说也是比较耗时的事情。为了让大家更好的来备考考博考试,小编为大家整理了一
来源 : 网络 2025-05-08 18:17:32 关键字 : 考博英语句型模板
考博英语复习备考中,查找一些具体的备考资料,对于大家来说也是比较耗时的事情。为了让大家更好的来备考考博考试,小编为大家整理了一
来源 : 网络 2025-05-08 18:17:13 关键字 : 考博英语句型模板
考博好课推荐
基础薄弱,备考迷茫,送纸质资料
价格 : ¥2280元
资深教师,教学简明,直接有效!
价格 : 0元
资料下载
关注新东方在线服务号
回复【10000】免费获取
关注新东方在线服务号
回复【医学考博】获取
关注新东方在线服务号
回复【医学考博】获取
关注新东方在线服务号
回复【医学考博】获取
关注新东方在线服务号
回复【考博真题】获取
关注新东方在线服务号
回复【考博真题】获取
关注新东方在线服务号
回复【考博经验】获取
关注新东方在线服务号
回复【考博真题】获取
阅读排行榜
相关内容