Text Mining Within the GridMiner Framework Ivan Janciak , Peter Brezany

advertisement
Text Mining Within the
GridMiner Framework
Ivan Janciak1), Peter Brezany1)
and Martin Sarnovsky2)
1)Vienna
University of Technology
Institute of Scientific Computing
2)Technical
University of Kosice
Department of Cybernetics and AI
www.gridminer.org
… Intelligent Grid Solutions
Outline






Motivation
Text Mining Workflow
Implementation
OGSA-DAI & Text Mining in GridMiner
Future work
Summary
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
2
GridMiner Framework
University of Vienna
Target: provide tools to discover and access relevant knowledge and information from
different distributed and heterogeneous data sources
Application area: medical – treatment of Traumatic Brain Injury
(Predicting the outcome of seriously ill patients)
Virtual Organization
Business
understanding
Data
understanding
Data provider
Data
Data
Preparation
Deployment
Modeling
Data Exploration
Services
Pre-processing
Services
Data Mining
Services
GridMiner
•Data Mining Services
•Clustering
•Classification
•Association rules
•Sequences
•OLAP
•Pre-Processing
•Data Cleaning
•Data Integration
•Visualization
•Job Control
•GUI
•Text Mining Services
User
Evaluation
CRISP-DM, SPSS
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
3
Text Mining - Use Case
Grid/
Internet
Information
Extraction System
XML
Files
Collection
Managers
XMLDB
Indexers
query
result
Text
Mining
Tasks
Query
Processor
Grid Information Retrieval System
GridIR - Official working group of the Globus Alliance: http://www.gir-wg.org
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
4
Motivation


Support for IR systems
Goals:
 Find possible groups of documents based on their
similarities
 Find appropriate categories for selected documents

Text documents in various formats (plain text, HTML, XML,
PDF) and languages
 Integration of text documents
 Text extraction
 Transformation
Pre-process (potentially) large collections of documents

www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
5
GridMiner and Text Mining
 JBowl – (Java Bag-of-words library)
 Modular framework for pre-processing and indexing
of large text collections
 System developed in Java to support:
 Information retrieval
 Text mining tasks
 Creates and evaluates supervised and unsupervised
text-mining models
 Produces output in Predictive model markup language
(PMML)
 GridMiner
 Workflow execution and controlling
 Visualization
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
6
Text Mining Workflow
Pre-processing
Document analysis
Pre-processing: suitable way to transform data
into numbers
•Text tokenization (lexical units)
•Filters (Stop words , Stemming)
Indexing
•computing and storing some statistics
(terms, documents frequencies, etc..)
Building Text Model
Data Mining
Evaluation
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
7
Text Mining Workflow
 Vector-Space model


Most frequently used model
Bag of words representation
 Represents document collection
by a document/term matrix


Pre-processing
columns->terms
rows->documents
 TF-IDF weighting
(term freq.-inverse doc. freq.)
 Interprets local and global aspects
of the terms
Building Text Model
Data Mining
Evaluation
 Wij =tf * idf(di,tj)
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
8
Representation of Text Model
PMML (Predictive Markup Model Language) allows to define
Text Model in XML representation divided into six major parts:
 Model attributes
 Dictionary of terms
 Corpus of text documents
 Document-term matrix
 Text model normalization
 Text model similarity
Complete model is a huge XML file (all terms, document term
matrix,...)
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
9
Text Mining Workflow
Supported tasks:
Pre-processing
•Classification (Supervised learning)
Building Text Model
•Classifies documents into a number of
predefined categories - (multi label
classification)
•Target is ‘document category’
Data Mining
•Algorithms: C4.5,RIPPER,kNN,SVM
•Output: Classification Model (set of decision
trees)
•Clustering (Unsupervised learning)
•Groups similar documents together
•Algorithms: kMeans,SOM
•Output: Clustering Model (clusters with
similar documents)
Evaluation
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
10
Representation of Mined Models
Represented in PMML
 Classification
 Mined Schema
 Model Stats
 TreeModel
 (for each category - long binary tree)
 Clustering
 Mined Schema
 Model Stats
 ClusteringModel
 Clustering field
 Cluster
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
www.gridminer.org
11
Text Mining Workflow
Pre-processing
Evaluation
•Estimation of a model accuracy
Building Text Model
•To predict (with a high degree of accuracy )
the correct class (or cluster) into which the new
document belongs
Data Mining
•Is done on a set of previously unseen
documents (testing set)
•Output: stats (precision, recall ,F-measure)
Evaluation
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
12
Text Mining Service – First implementation
Training set
XML
File
GridMiner-TM
Service
Build Text Model
Terms Reduction
Testing Set
XML
File
Categorization
& Clustering
Model evaluation
Text
Model
Reduced
Text
Model
Classification/
Clustering
Model
Statistics
(PMML)
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
13
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
14
Text Mining Service & OGSA-DAI
Training/
Testing Sets
XML
Files
OGSA-DAI
GridMiner-TM
Service
Grid Data Mediation
(Documents Integration)
Terms Reduction
Text Mining Activity
(Build Text Model)
XMLDB
Delivery
(Deliver Model)
Text
Model
Text Categorization
& Clustering
Model evaluation
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
15
Future Work
 Distributed version of Text Mining service
 Distribution of Classification Task
decision trees
categories
categories
Slave Service
Classification
Model
TM-Service
decision trees
….
Slave Service
Document
Term Matrix
OGSA-DAI
Delivery Activity
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
16
Summary
 GridIR architecture extended by Text Mining
capabilities
 Stand alone Text Mining Service
 Classification
 Clustering
 Implementation of Building Text Model activity in
OGSA-DAI
 Distributed version of the Text Mining service
 Classification
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
17
TextMining Video
www.gridminer.org
2nd DIALOGUE Workshop, 9. Feb. 06, Edinburgh
18
Download