Projects done/doing by HYUK CHO (Organized in reverse chronological order) Current Projects Co-clustering Algorithms Research Projects Minimum Squared Residue Co-clustering of Gene Expression Data MDL-Based Formulation of Distributional Clustering Semisupervised Learning for Classification of Large Text Data using Feature Clustering Comparisons on Classification Algorithms Text Mining: Clustering and Querying Evaluation of Algorithms on Document Retrieval Comparisons on Partitioning Algorithms Spectral Graph Partitioning Design and Application of Intelligent System Using Clustering Technique and Evolution Program Efficient Clustering Algorithm An Optimal Design Procedure for BSB(Brain-State-in-a-Box) Neural Networks Class Projects Feature Clustering on Clustering and Classification of .GOV TRECWeb Data SCHEME Interpreter using JAVA Language Classification Algorithms on Gene Expression Data Evaluation of Algorithms on Document Retrieval Back to Top Current Projects Co-clustering Algorithms June 2003 ~ Current. Preprocessing for Co-clustering Algorithms Initialization for Co-clustering Algorithms New Co-clustering Algorithms Applications of Co-clustering Algorithms Evaluation of Co-clustering Algorithms Back to Top Research Projects Minimum Squared Residue Co-clustering of Gene Expression Data June 2003 ~ December 2003. With Dr. Inderjit S. Dhillon, Yuqiang Guan, and Suvrit Sra. Datasets, Reports, and Programs are available upon request. MDL-Based Formulation of Distributional Clustering September 2002 ~ May 2003. With Dr. Inderjit S. Dhillon, Manyam, and Dr. Byron Dom(IBM). Apply MDL(Minimum Description Length) formulation in order to predict optimal number of feature clusters from Distributional Clustering for better classification of text documents. Use NEWS20 data and some other artificial data for experiments. Datasets, Reports, and Programs are available upon request. Semisupervised Learning for Classification of Large Text Data using Feature Clustering May 2002 ~ August 2002. CS395T - Conference Course (Dr. Inderjit Dhillon) Do feature clustering for better classification of text documents for which only small training data's labels are available, and make use of label information to enhance classification accuracy. Apply different distance (or similarity or divergence) measures. Use NEWS20 data for experiments. Datasets, Reports, and Programs are available upon request. Comparisons on Classification Algorithms January 2001 ~ December 2001. CS395T - Conference Course (Dr. Inderjit Dhillon). With Yancong Zhou. Implement in MATLAB well-known classification algorithms: Naive Bayesian(NB), K-Nearest Neighbor(KNN), Centroid-based(CB), and Support Vector Machine(SVM). Propose several variations of algorithms and compare their classification performance in accuracy, precision, and recall classification measure. Use CLASSIC3 and NEWS20 data for experiments. Project Report Webpage Datasets, Reports, and Programs are available upon request. Text Mining: Clustering and Querying January 2001 ~ December 2001. CS395T - Conference Course (Dr. Inderjit Dhillon) Do feature clustering for better query retrieval Apply different normalizations and query expansions to Keyword Matching(KM) and Generalized Vector Space Model (VSM) for query retrieval Use FBIS and LATIMES data (in TREC) for experiments. Datasets, Reports, and Programs are available upon request. Comparisons on Partitioning Algorithms January 2000 ~ May 2000. CS395T - Conference Course (Dr. Inderjit Dhillon) Use METIS and hMETIS for text document clustering (here we do partitioning.). Modifying METIS is essential. Converting VSM into graph model. Compare with other graph partitioning algorithms such as CHACO and FM Datasets, Reports, and Programs are available upon request. Spectral Graph Partitioning August 1999 ~ May 2000. CS395T - Conference Course (Dr. Inderjit Dhillon) Use existing Lanczos-based software for computing eigenvectors of both adjacency and Laplacian matrices Some experience with Lanczos algorithm (in FORTRAN and C) is essential. Apply spectral algorithms to special graphs such as Clique and Roach graphs. Datasets, Reports, and Programs are available upon request. Design and Application of Intelligent System Using Clustering Technique and Evolution Program July 1997 ~ July 1998. With Dr. Daihee Park and Dr. Jooyoung Park. Koran University Research Foundation Efficient Clustering Algorithm April 1997 ~ March 1998. With Dr. Daihee Park and Dr. Jooyoung Park. Korea University Research Foundation. An Optimal Design Procedure for BSB(Brain-State-in-a-Box) Neural Networks August 1996 ~ July 1997. With Dr. Daihee Park and Dr. Jooyoung Park. Korea Research Foundation. Back to Top Class Projects Feature Clustering on Clustering and Classification of .GOV TRECWeb Data January 2003 ~ May 2003. EE380L - Practicum in Data Mining (Dr. Joydeep Ghosh) With Alex(EE) and Rajal(ME). Evaluate how feature clustering affects on both clustering and classification of huge text datasets. Apply different distance (or similarity) measures of document clustering algorithms and compare feature selection vs. feature clustering. Use .GOV TRECWeb data and some other artificial data for experiments. Datasets, Reports, and Programs are available upon request. SCHEME Interpreter using Java Language January 2002 ~ May 2002. EE386L - Programming Languages (Dr. Greg Lavender) Datasets, Reports, and Programs are available upon request. Classification Algorithms on Gene Expression Data August 2001 ~ December 2001. CH391L - Bioinformatics (Dr. Edward M. Marcotte) Compare the performance of supervised machine learning algorithms for classification based on gene expression data. Evaluate the feasibility and performance of traditional classification algorithms. Project Report Webpage Datasets, Reports, and Programs are available upon request. Evaluation of Algorithms on Document Retrieval August 2000 ~ December 2000. EE380L - Data Mining (Dr. Joydeep Ghosh) Use SVDPACKC for Singular Value Decomposition(SVD) of Vector Space Model(VSM) for query retrieval. Modifying SVDPACKC is essential to get sparse matrix storage format of Compressed Column Storage(CCS). Use CLASSIC3 data(CISI, CRAN, MED) for experiments. Datasets, Reports, and Programs are available upon request. Back to Top