바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

기계학습을 통한 디스크립터 자동부여에 관한 연구

A Study on automatic assignment of descriptors using machine learning

정보관리학회지 / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2006, v.23 no.1, pp.279-299
https://doi.org/10.3743/KOSIM.2006.23.1.279
김판준 (신라대학교)
  • 다운로드 수
  • 조회수

초록

학술지 논문에 디스크립터를 자동부여하기 위하여 기계학습 기반의 접근법을 적용하였다. 정보학 분야의 핵심 학술지를 선정하여 지난 11년간 수록된 논문들을 대상으로 문헌집단을 구성하였고, 자질 선정과 학습집합의 크기에 따른 성능을 살펴보았다. 자질 선정에서는 카이제곱 통계량(CHI)과 고빈도 선호 자질 선정 기준들(COS, GSS, JAC)을 사용하여 자질을 축소한 다음, 지지벡터기계(SVM)로 학습한 결과가 가장 좋은 성능을 보였다. 학습집합의 크기에서는 지지벡터기계(SVM)와 투표형 퍼셉트론(VPT)의 경우에는 상당한 영향을 받지만 나이브 베이즈(NB)의 경우에는 거의 영향을 받지 않는 것으로 나타났다.

keywords
descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, 디스크립터, 자동색인, 기계학습, 자질 선정, 분류기, 텍스트 범주화

Abstract

This study utilizes various approaches of machine learning in the process of automatically assigning descriptors to journal articles. After selecting core journals in the field of information science and organizing test collection from the articles of the past 11 years, the effectiveness of feature selection and the size of training set was examined. In the regard of feature selection, after reducing the feature set by χ2 statistics(CHI) and criteria which prefer high-frequency features(COS, GSS, JAC), the trained Support Vector Machines(SVM) performs the best. With respective to the size of the training set, it significantly influences the performance of Support Vector Machines(SVM) and Voted Perceptron(VTP). but it scarcely affects that of Naive Bayes(NB).

keywords
descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, descriptors, automatic indexing, machine learning, feature selection, classifier, text categorization, 디스크립터, 자동색인, 기계학습, 자질 선정, 분류기, 텍스트 범주화

참고문헌

1.

김판준. (2005). 새로운 주제 탐지를 통한 지식 구조 갱신에 관한 연구.

2.

윤구호. (1999). 색인․초록:서울: 도서관협회.

3.

이재윤. (2005). 자질 선정 기준과 가중치 할당 방식간의 관계를 고려한 문서 자동분류의 개선에 관한 연구. 문헌정보학회지, 39(2), 123-146.

4.

이재윤. (2005). 문헌간 유사도를 이용한 SVM 분류기의 문헌분류성능 향상에 관한 연구. 정보관리학회지, 22(3), 261-287.

5.

정영미. (2005). 정보검색연구:서울: 구미무역(주) 출판부.

6.

Borko, H.. (1963). Automatic Document Classification. JACM, 10(2), 151-162.

7.

Chang, Jeffrey. (2000). Using the MeSH Hierarchy to Index Bioinformatics Articles. CS224N/Ling237 Final Projects:Stanford University.

8.

Chung, Y. (1998). Automatic subject indexing using an associative neural network." , 59-68.. ACM international Conference on Digital Libraries, 3, 59-68.

9.

Freund, Yoav. (1998). Large Margin Classification Using the Perceptron Algorithms. Proceedings of the 11th Annual Conference on Computer Learning Theory, , 209-217.

10.

Humprey, Susanne M. (1999). Automatic indexing of Documents from Journal Descriptors: A Preliminary Investigation. JASIS, 50(8), 661-674.

11.

Joachims, Thorsten. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. European Conference on Machine Learning, 10, 137-142.

12.

Joachims, Thorsten. (2001). Learning to Classify Text Using Support Vector Machines:Boston: Kluwer Academic Publishers..

13.

John, George H.. (1995). Estimating Continuous Distributions in Bayesian Classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence, , 338-345.

14.

Lan, Man. (2005). A Comprehensive Comparative Study on Term Weighting Schemes for Text Categorization with Support Vector Machines. International Conference on World Wide Web, WWW(Special Interest Tracks and Posters), 14(10), 1032-1033.

15.

Lauser, B. (2003). Automatic Multi-Label Subject Indexing in a Multilingual Environment. European Conference in Research and Adavanced Technology for Digital Libraries(ECDL), 7, 140-151.

16.

Lewis, D. D. (1996). Training Algorithms for Linear Text Classfiers. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 19, 208-306.

17.

Liang, Chun-Yan. (2006). Dictionary-based Text Categorization of Chemical Web Pages. IPM, 42(4), 1017-1029.

18.

Lewis, D. D. (1996). Training Algorithms for Linear Text Classifiers. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 19, 298-306.

19.

Moens, Marie-Francine. (2000). Automatic Indexing and Abstracting of Document Texts. The Kluwer International Series on Information Retrieval, , -.

20.

Platt, John. (1999). Fast Training of Support Vector Machines using Sequential Minimal Optimization. Advances in Neural Information Processing Systems, 11, -.

21.

Plaunt, C. (1998). An association-based Method for automatic indexing with a controlled vocabulary. JASIS, 49(10), 888-902.

22.

Rogati, M.. (2002). High-Performing Feature Selection for Text Classification. ACM CIKM International Conference on Information and Knowledge Management, , 659-661.

23.

Miguel E. Ruiz, Padmini Srinivasan. (2009). 1999. "Combining Machine Learning and Hierarchical Indexing Structures for Text Categorization." To appear in Advances in Classification Research Vol. 10: Proceedings of the 10th ASIS SIG/CR Classification Research Workshop, Washington D.C.. http://informatics.buffalo.edu/faculty/ruiz/publications/sigcr%5F10.

24.

Ruiz, Miguel E. (2002). Hierarchical Text Categorization Using Neural Networks. Information Retrieval, 5(10), 87-118.

25.

Sebastiani, Fabrizio. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1), 1-47.

26.

Tzeras, Kostas. (199). Automatic indexing based on Bayesian Inference Network. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), , 22-34.

27.

Yang, Y. (1999). An Evaluation of Statistical Approaches to Text Categorization. Information Retrieval, 1, 69-90.

28.

Yang, Y.. (1997). A Comparative Study on Feature Selection in Text Categorization. International Conference on Machine Learning(ICML), 14, 412-420.

29.

Yang, Y. (1999). A Re-examination for Text Categorization Methods. Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), 22, 42-49.

30.

Zhang, J. (2003). Robustness of Regularized Linear Classification Methods in Text Categorization. Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval(SIGIR), , 190-197.

정보관리학회지