i used lda build topic model 2 text documents , b. document highly related computer science , document b highly related geo-science. trained lda using command :
text<- c(a,b) # introduced above r <- corpus(vectorsource(text)) # create corpus object r <- tm_map(r, tolower) # convert text lower case r <- tm_map(r, removepunctuation) r <- tm_map(r, removenumbers) r <- tm_map(r, removewords, stopwords("english")) r.dtm <- termdocumentmatrix(r, control = list(minwordlength = 3)) my_lda <- lda(r.dtm,2)
now want use my_lda predict context of new document c , want see if related computer science or geo-science. know if use code prediction
x<-c# new document (a long string) introduced above prediction rp <- corpus(vectorsource(x)) # create corpus object rp <- tm_map(rp, tolower) # convert text lower case rp <- tm_map(rp, removepunctuation) rp <- tm_map(rp, removenumbers) rp <- tm_map(rp, removewords, stopwords("english")) rp.dtm <- termdocumentmatrix(rp, control = list(minwordlength = 3)) test.topics <- posterior(my_lda,rp.dtm)
it give me label 1 or 2 , don't have idea 1 or 2 represents... how can realize if means computer science related or geo-science related?
you can extract terms lda topicmodel , replace black-box numeric names many of them like. example isn't reproducible, here example illustrating how can this:
> library(topicmodels) > data(associatedpress) > > train <- associatedpress[1:100] > test <- associatedpress[101:150] > > train.lda <- lda(train,2) > > #returns black box names > test.topics <- posterior(train.lda,test)$topics > head(test.topics) 1 2 [1,] 0.57245696 0.427543038 [2,] 0.56281568 0.437184320 [3,] 0.99486888 0.005131122 [4,] 0.45298547 0.547014530 [5,] 0.72006712 0.279932882 [6,] 0.03164725 0.968352746 > #extract top 5 terms each topic , assign variable names > colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",") > head(test.topics) percent,year,i,new,last new,people,i,soviet,states [1,] 0.57245696 0.427543038 [2,] 0.56281568 0.437184320 [3,] 0.99486888 0.005131122 [4,] 0.45298547 0.547014530 [5,] 0.72006712 0.279932882 [6,] 0.03164725 0.968352746 > #round 1 topic if you'd prefer > test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)]) > head(test.topics) [1] "percent,year,i,new,last" "percent,year,i,new,last" "percent,year,i,new,last" [4] "new,people,i,soviet,states" "percent,year,i,new,last" "new,people,i,soviet,states"
Comments
Post a Comment