R topic modeling: lda model labeling function -


i used lda build topic model 2 text documents , b. document highly related computer science , document b highly related geo-science. trained lda using command :

     text<- c(a,b) # introduced above      r <- corpus(vectorsource(text)) # create corpus object      r <- tm_map(r, tolower) # convert text lower case      r <- tm_map(r, removepunctuation)       r <- tm_map(r, removenumbers)      r <- tm_map(r, removewords, stopwords("english"))      r.dtm <- termdocumentmatrix(r, control = list(minwordlength = 3))          my_lda <- lda(r.dtm,2) 

now want use my_lda predict context of new document c , want see if related computer science or geo-science. know if use code prediction

     x<-c# new document (a long string) introduced above prediction      rp <- corpus(vectorsource(x)) # create corpus object      rp <- tm_map(rp, tolower) # convert text lower case      rp <- tm_map(rp, removepunctuation)       rp <- tm_map(rp, removenumbers)      rp <- tm_map(rp, removewords, stopwords("english"))      rp.dtm <- termdocumentmatrix(rp, control = list(minwordlength = 3))          test.topics <- posterior(my_lda,rp.dtm) 

it give me label 1 or 2 , don't have idea 1 or 2 represents... how can realize if means computer science related or geo-science related?

you can extract terms lda topicmodel , replace black-box numeric names many of them like. example isn't reproducible, here example illustrating how can this:

> library(topicmodels) > data(associatedpress) >  > train <- associatedpress[1:100] > test <- associatedpress[101:150] >  > train.lda <- lda(train,2) >  > #returns black box names > test.topics <- posterior(train.lda,test)$topics > head(test.topics)               1           2 [1,] 0.57245696 0.427543038 [2,] 0.56281568 0.437184320 [3,] 0.99486888 0.005131122 [4,] 0.45298547 0.547014530 [5,] 0.72006712 0.279932882 [6,] 0.03164725 0.968352746 > #extract top 5 terms each topic , assign variable names > colnames(test.topics) <- apply(terms(train.lda,5),2,paste,collapse=",") > head(test.topics)      percent,year,i,new,last new,people,i,soviet,states [1,]              0.57245696                0.427543038 [2,]              0.56281568                0.437184320 [3,]              0.99486888                0.005131122 [4,]              0.45298547                0.547014530 [5,]              0.72006712                0.279932882 [6,]              0.03164725                0.968352746 > #round 1 topic if you'd prefer > test.topics <- apply(test.topics,1,function(x) colnames(test.topics)[which.max(x)]) > head(test.topics) [1] "percent,year,i,new,last"    "percent,year,i,new,last"    "percent,year,i,new,last"    [4] "new,people,i,soviet,states" "percent,year,i,new,last"    "new,people,i,soviet,states" 

Comments