“ What can Hyperplane-Classifiers tell us about
Texts?”
Edda
Leopold
GMD German National Research Center for Information
Technology
Institute for Autonomous intelligent Systems
Jörg
Kindermann
GMD German National Research Center for Information
Technology
Institute for Autonomous intelligent Systems
We want to report from our results with Support Vector Machines for Text
Classification in order to promote the interdisciplinary dialogue. Our research
group consists mainly of statisticians and computer-scientists, and focusses on
the algorithmic side of text-classifica-tion. We want to discuss our experiences
with researchers working on other fields of linguistic computing and ask for the
implications of our results on linguistic approaches which use vector space
representations as for example "semantic spaces" and "latent semantic
indexing".
The algorithm called "Support Vector Machines", can shortly be described as
follows (A more detailed description can be found in (Vapnik 1998)):
- 1 A set of labeled documents is needed for training. Documents are mapped to their type-frequency vectors. These vectors span an high dimensional input space (every type represents one dimension). This kind of abstraction from syntagmatic structures is often refered to as "bag-of-words" approach.
- 2. The algorithm searches for a hyperplane in input space which optimally separates the training documents.
- 3. Documents of a test-set are attributed to one of the classes depending on the side of the hyperplane they are located on. SVM have proven to provide an effective means for text classification on different languages (English and German) and textual domains (English Reuters news; Ohsumed medical abstracts, e-mail newsgroups; German: newspapers taz, FR, BZ, e-mail newsgroups) and different tasks (topic identification, Authorship attribution, classification according newspaper issues of different years). (Joachims 1997; Joachims1998; Drucker et al.; Dumais et al. 1998; Diederich & Kindermann & Leopold & Paaþ 2000; Leopold & Kindermann 2001)
References
Joachim Diederich Jörg Kindermann Edda Leopold Gerhard Paaß. “Authorship Attribution with Support Vector
Machines.” Poster presented at The Learning Workshop; 4 - 7 April, 2000 in Snowbird, Utah. : , 2000.
H. Drucker D. Wu V. Vapnik. “Support vector machines for spam categorization.” IEEE Transactions on Neural Networks. 1999. 10: 1048-1054.
Susan Dumais John Platt David Heckerman Mehran Sahami. “Inductive Learning Algorithms and Representations for
Text Categorization.” Proceedings of ACM-CIKM-98; 7th International Conference on Information Retrieval and Knowledge Management. : , 1998. 148--155.
Thorsten Joachims. “Text categorization with support vector machines:
learning with many relevant features.” Proceedings of ECML-98, 10th European Conference on Machine Learning. Lecture Notes in Computer Science, Number 1398. Heidelberg, DE: Springer Verlag, 1998. 137-142.
Jörg Kindermann Edda Leopold Gerhard Paaß. “Multiclass Classification with Error Correcting
Codes.” Treffen der GI-Fachgruppe 1.1.3 Maschinelles Lernen. Ed. Edda Leopold Mathias Kirsten. : , 2000. 56-64.
Edda Leopold Jörg Kindermann. “Text Categorization with Support Vector Machines. How
to Represent Texts in Input Space?.” Machine Learning. 2001. : .
M.Stricker. “Réseaux de neurones pour le traitement automatique du
langage : conception et réalisatiion de filtres
d'informations.” School of Library, Archive and Information Studies, University College London, 2000.
Vladimir Vapnik. Statistical Learning Theory. : Wiley & sons, 1998.