바로가기메뉴

본문 바로가기 주메뉴 바로가기

logo

학습문헌집합에 기 부여된 범주의 정확성과 문헌 범주화 성능

The Effect of the Quality of Pre-Assigned Subject Categories on the Text Categorization Performance

정보관리학회지 / Journal of the Korean Society for Information Management, (P)1013-0799; (E)2586-2073
2006, v.23 no.2, pp.265-285
https://doi.org/10.3743/KOSIM.2006.23.2.265
심경 (Systems R&D Center, Iris.Net)
정영미 (연세대학교)
  • 다운로드 수
  • 조회수

초록

문헌범주화에서는 학습문헌집합에 부여된 주제범주의 정확성이 일정 수준을 가진다고 가정한다. 그러나, 이는 실제 문헌집단에 대한 지식이 없이 이루어진 가정이다. 본 연구는 실제 문헌집단에서 기 부여된 주제범주의 정확성의 수준을 알아보고, 학습문헌집합에 기 부여된 주제범주의 정확도와 문헌범주화 성능과의 관계를 확인하려고 시도하였다. 특히, 학습문헌집합에 부여된 주제범주의 질을 수작업 재색인을 통하여 향상시킴으로써 어느 정도까지 범주화 성능을 향상시킬 수 있는가를 파악하고자 하였다. 이를 위하여 과학기술분야의 1,150 초록 레코드 1,150건을 전문가 집단을 활용하여 재색인한 후, 15개의 중복문헌을 제거하고 907개의 학습문헌집합과 227개의 실험문헌집합으로 나누었다. 이들을 초기문헌집단, Recat-1, Recat-2의 재 색인 이전과 이후 문헌집단의 범주화 성능을 kNN 분류기를 이용하여 비교하였다. 초기문헌집단의 범주부여 평균 정확성은 16%였으며, 이 문헌집단의 범주화 성능은 F1값으로 17%였다. 반면, 주제범주의 정확성을 향상시킨 Recat-1 집단은 F1값 61%로 초기문헌집단의 성능을 3.6배나 향상시켰다.

keywords
Text categorization, test collections, kNN, training sets, 텍스트 범주화, 문헌범주화, 실험문헌집단, kNN 분류기, 학습문헌집합, Text categorization, test collections, kNN, training sets

Abstract

In text categorization a certain level of correctness of labels assigned to training documents is assumed without solid knowledge on that of real-world collections. Our research attempts to explore the quality of pre-assigned subject categories in a real-world collection, and to identify the relationship between the quality of category assignment in training set and text categorization performance. Particularly, we are interested in to what extent the performance can be improved by enhancing the quality (i.e., correctness) of category assignment in training documents. A collection of 1,150 abstracts in computer science is re-classified by an expert group, and divided into 907 training documents and 227 test documents (15 duplicates are removed). The performances of before and after re-classification groups, called Initial set and Recat-1/Recat-2 sets respectively, are compared using a kNN classifier. The average correctness of subject categories in the Initial set is 16%, and the categorization performance with the Initial set shows 17% in F1 value. On the other hand, the Recat-1 set scores F1 value of 61%, which is 3.6 times higher than that of the Initial set.

keywords
Text categorization, test collections, kNN, training sets, 텍스트 범주화, 문헌범주화, 실험문헌집단, kNN 분류기, 학습문헌집합, Text categorization, test collections, kNN, training sets

참고문헌

1.

Automated learning of decision rules for text categorization. ACM Transactions on Information Systems Subject access in online catalogs Journal of the American Society for Information Science. , 357-376.

2.

(2003). Using asymmetric distributions to improve text classifier probability estimates.. , 111-118.

3.

Indeterminancy in the subject access to documents.. , 229-241.

4.

(2002). Feature selection using linear support vector machines. , -.

5.

(2003). Text categorization by boosting automatically extracted concepts.. , 182-189.

6.

(c.1984.). Optimizing convenient online access to bibliographic databases.. , 37-47.

7.

(2005). 정보검색연구 [Research in information retrieval]』. , 1-18.

8.

(1999). Document classification and routing. , 289-310.

9.

A study of indexer consistency. , 92-94.

10.

(2002). Natural language processing for online applications: text retrieval, extraction and categorization. , -.

11.

J. and V. Slamecka. 1962. Indexer consistency under minimal conditions. Bethesda. , -.

12.

(1999). Transductive inference for text classification using support vector machines.. , 200-209.

13.

(1998). Text categorization with support vector machines: learning with many relevant features.. , 137-142.

14.

(1999). 문서범주화를 위한 선형분류기와 kNN의 결합모델 [Combining a linear classifier and a kNN model for text categorization]』.. , 225-231.

15.

(1998). Using a generalized instance set for automatic text categorization.. , 81-89.

16.

(1996). Combining classifiers in text classification. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. , 289-297.

17.

(2003). 복합분류기를 이용한 웹 문서범주화에 관한 실험적 연구 [An experimental study on categorization of web documents using an ensemble classifier]』.. , -.

18.

(20000). Interindexer consistency in PsycINFO.. 32(1), 4-8.

19.

A sequential algorithm for training text classifiers. In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval. , 3-12.

20.

(1996). Training algorithms for linear text classifiers.. , 298-306.

21.

(1999). Combining machine learning and hierarchical indexing structures for text categorization.. , 107-124.

22.

(t.1991.). Individual differences in organizing . In Proceedings of the 54th Annual Meeting of the Society for Information Science. , 82-86.

23.

(2002). Machine learning in automated text categorization. 34(1), 1-47.

24.

(2006). 학습문헌집합의 속성에 따른 문헌 범주화 성능 실험 [An experimental study ascertaining the relationships between the characteristics of a training document set and the performance of text categorization]』.. , -.

25.

(1999). Maximizing text-mining performance. 14(4), 63-69.

26.

(1996). Text classification in USENET Newsgroups: a progress report. , -.

27.

(1999). An evaluation of statistical approaches to text categorization.. 1, 69-90.

28.

(1994). effective and efficient learning from human decisions in text categorization and retrieval. In Proceedings of SIGIR-94 17th ACM International Conference on Research and Development in Information Retrieval. , 13-22.

29.

(1999). An re-examination of text categorization methods.. , 42-49.

30.

(1998). The effect of using hierarchical classifiers in text categorization.. , 1-18.

정보관리학회지