Class SimpleNaiveBayesDocumentClassifier
- java.lang.Object
-
- org.apache.lucene.classification.SimpleNaiveBayesClassifier
-
- org.apache.lucene.classification.document.SimpleNaiveBayesDocumentClassifier
-
- All Implemented Interfaces:
Classifier<BytesRef>
,DocumentClassifier<BytesRef>
public class SimpleNaiveBayesDocumentClassifier extends SimpleNaiveBayesClassifier implements DocumentClassifier<BytesRef>
A simplistic Lucene based NaiveBayes classifier, seehttp://en.wikipedia.org/wiki/Naive_Bayes_classifier
-
-
Field Summary
Fields Modifier and Type Field Description protected java.util.Map<java.lang.String,Analyzer>
field2analyzer
Analyzer
to be used for tokenizing document fields-
Fields inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
analyzer, classFieldName, indexReader, indexSearcher, query, textFieldNames
-
-
Constructor Summary
Constructors Constructor Description SimpleNaiveBayesDocumentClassifier(IndexReader indexReader, Query query, java.lang.String classFieldName, java.util.Map<java.lang.String,Analyzer> field2analyzer, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description private void
analyzeSeedDocument(Document inputDocument, java.util.Map<java.lang.String,java.util.List<java.lang.String[]>> fieldName2tokensArray, java.util.Map<java.lang.String,java.lang.Float> fieldName2boost)
This methods performs the analysis for the seed document and extract the boosts if present.ClassificationResult<BytesRef>
assignClass(Document document)
Assign a class (with score) to the givenDocument
private java.util.List<ClassificationResult<BytesRef>>
assignNormClasses(Document inputDocument)
private double
calculateLogLikelihood(java.lang.String[] tokenizedText, java.lang.String fieldName, Term term, int docsWithClass)
private double
calculateLogPrior(Term term, int docsWithClassSize)
private int
docCount(Term term)
java.util.List<ClassificationResult<BytesRef>>
getClasses(Document document)
Get all the classes (sorted by score, descending) assigned to the givenDocument
.java.util.List<ClassificationResult<BytesRef>>
getClasses(Document document, int max)
Get the firstmax
classes (sorted by score, descending) assigned to the given text String.private double
getTextTermFreqForClass(Term term, java.lang.String fieldName)
Returns the average number of unique terms times the number of docs belonging to the input classprotected java.lang.String[]
getTokenArray(TokenStream tokenizedText)
Returns a token array from theTokenStream
in inputprivate int
getWordFreqForClass(java.lang.String word, java.lang.String fieldName, Term term)
Returns the number of documents of the input class ( from the whole index or from a subset) that contains the word ( in a specific field or in all the fields if no one selected)-
Methods inherited from class org.apache.lucene.classification.SimpleNaiveBayesClassifier
assignClass, assignClassNormalizedList, countDocsWithClass, getClasses, getClasses, normClassificationResults, tokenize
-
-
-
-
Constructor Detail
-
SimpleNaiveBayesDocumentClassifier
public SimpleNaiveBayesDocumentClassifier(IndexReader indexReader, Query query, java.lang.String classFieldName, java.util.Map<java.lang.String,Analyzer> field2analyzer, java.lang.String... textFieldNames)
Creates a new NaiveBayes classifier.- Parameters:
indexReader
- the reader on the index to be used for classificationquery
- aQuery
to eventually filter the docs used for training the classifier, ornull
if all the indexed docs should be usedclassFieldName
- the name of the field used as the output for the classifier NOTE: must not be heavely analyzed as the returned class will be a token indexed for this fieldtextFieldNames
- the name of the fields used as the inputs for the classifier, they can contain boosting indication e.g. title^10
-
-
Method Detail
-
assignClass
public ClassificationResult<BytesRef> assignClass(Document document) throws java.io.IOException
Description copied from interface:DocumentClassifier
Assign a class (with score) to the givenDocument
- Specified by:
assignClass
in interfaceDocumentClassifier<BytesRef>
- Parameters:
document
- aDocument
to be classified. Fields are considered features for the classification.- Returns:
- a
ClassificationResult
holding assigned class of typeT
and score - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(Document document) throws java.io.IOException
Description copied from interface:DocumentClassifier
Get all the classes (sorted by score, descending) assigned to the givenDocument
.- Specified by:
getClasses
in interfaceDocumentClassifier<BytesRef>
- Parameters:
document
- aDocument
to be classified. Fields are considered features for the classification.- Returns:
- the whole list of
ClassificationResult
, the classes and scores. Returnsnull
if the classifier can't make lists. - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
getClasses
public java.util.List<ClassificationResult<BytesRef>> getClasses(Document document, int max) throws java.io.IOException
Description copied from interface:DocumentClassifier
Get the firstmax
classes (sorted by score, descending) assigned to the given text String.- Specified by:
getClasses
in interfaceDocumentClassifier<BytesRef>
- Parameters:
document
- aDocument
to be classified. Fields are considered features for the classification.max
- the number of return list elements- Returns:
- the whole list of
ClassificationResult
, the classes and scores. Cut for "max" number of elements. Returnsnull
if the classifier can't make lists. - Throws:
java.io.IOException
- If there is a low-level I/O error.
-
assignNormClasses
private java.util.List<ClassificationResult<BytesRef>> assignNormClasses(Document inputDocument) throws java.io.IOException
- Throws:
java.io.IOException
-
analyzeSeedDocument
private void analyzeSeedDocument(Document inputDocument, java.util.Map<java.lang.String,java.util.List<java.lang.String[]>> fieldName2tokensArray, java.util.Map<java.lang.String,java.lang.Float> fieldName2boost) throws java.io.IOException
This methods performs the analysis for the seed document and extract the boosts if present. This is done only one time for the Seed Document.- Parameters:
inputDocument
- the seed unseen documentfieldName2tokensArray
- a map that associated to a field name the list of token arrays for all its valuesfieldName2boost
- a map that associates the boost to the field- Throws:
java.io.IOException
- If there is a low-level I/O error
-
getTokenArray
protected java.lang.String[] getTokenArray(TokenStream tokenizedText) throws java.io.IOException
Returns a token array from theTokenStream
in input- Parameters:
tokenizedText
- the tokenized content of a field- Returns:
- a
String
array of the resulting tokens - Throws:
java.io.IOException
- If tokenization fails because there is a low-level I/O error
-
calculateLogLikelihood
private double calculateLogLikelihood(java.lang.String[] tokenizedText, java.lang.String fieldName, Term term, int docsWithClass) throws java.io.IOException
- Parameters:
tokenizedText
- the tokenized content of a fieldfieldName
- the input field nameterm
- theTerm
referring to the class to calculate the score ofdocsWithClass
- the total number of docs that have a class- Returns:
- a normalized score for the class
- Throws:
java.io.IOException
- If there is a low-level I/O error
-
getTextTermFreqForClass
private double getTextTermFreqForClass(Term term, java.lang.String fieldName) throws java.io.IOException
Returns the average number of unique terms times the number of docs belonging to the input class- Parameters:
term
- the class term- Returns:
- the average number of unique terms
- Throws:
java.io.IOException
- If there is a low-level I/O error
-
getWordFreqForClass
private int getWordFreqForClass(java.lang.String word, java.lang.String fieldName, Term term) throws java.io.IOException
Returns the number of documents of the input class ( from the whole index or from a subset) that contains the word ( in a specific field or in all the fields if no one selected)- Parameters:
word
- the token produced by the analyzerfieldName
- the field the word is coming fromterm
- the class term- Returns:
- number of documents of the input class
- Throws:
java.io.IOException
- If there is a low-level I/O error
-
calculateLogPrior
private double calculateLogPrior(Term term, int docsWithClassSize) throws java.io.IOException
- Throws:
java.io.IOException
-
docCount
private int docCount(Term term) throws java.io.IOException
- Throws:
java.io.IOException
-
-