Digital Humanities Questions & Answers » Topic: Topic Modeling (MALLET) with JSTOR Data For Research

Digital Humanities Questions & Answers » Topic: Topic Modeling (MALLET) with JSTOR Data For Research http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research Digital Humanities Questions & Answers » Topic: Topic Modeling (MALLET) with JSTOR Data For Research en-US Wed, 14 Sep 2016 13:34:43 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://digitalhumanities.org/answers/search.php Ben Marwick on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-2142 Thu, 30 Jan 2014 19:50:12 +0000 Ben Marwick 2142@http://digitalhumanities.org/answers/ Replying to @Ben Marwick's <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1842">post</a>: Just a short follow-up, I have now bundled my snippets into a more complete R package for working with JSTOR DFR data. It takes DFR output and does ngrams, word correlations over time, document clustering and topic modelling (with MALLET or in R, and inlcuding hot and cold topic identification): <a href="https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr" rel="nofollow">https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr</a> johnlaudun on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1932 Wed, 13 Mar 2013 18:43:40 +0000 johnlaudun 1932@http://digitalhumanities.org/answers/ Replying to @Michael Widner's <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1872">post</a>: I didn't know about GenSim. What a great library for Python, and the site has some really nice explanatory material on it, too. Thanks for the link. Michael Widner on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1872 Fri, 01 Feb 2013 15:11:14 +0000 Michael Widner 1872@http://digitalhumanities.org/answers/ If you're planning on working in Python anyway, you might want to look at the gensim library: <a href="http://radimrehurek.com/gensim/" rel="nofollow">http://radimrehurek.com/gensim/</a> It would let you perform some topic modeling in the code, so wouldn't require the stop-gap between word frequencies and mallet. Ben Marwick on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1842 Mon, 31 Dec 2012 18:35:36 +0000 Ben Marwick 1842@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/cforster'>cforster</a>'s <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1767">post</a>: Here are a few lines of R that I'm working with for topic modelling. These lines should: 1. read in the JSTOR CSV wordcount files to R 2. convert them from a table of words and their counts to a 'bag of words' 3. for each CSV file, create a txt file of the 'bag of words' ready for MALLET A sample of 1000 articles from 'American Antiquity' takes about 7 sec to read in the CSV files and about 6 sec to write the 'bag-of-words' txt files <div class="bb_syntax"><table><tr><td class="line_numbers"><pre>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 </pre></td><td class="code"><pre class="r" style="font-family:monospace;"># set working directory, ie. location of JSTOR DfR CSV # files on the computer setwd("C:\\some directory with JSTOR DfR CSV files")   # create a list of all the CSV files myFiles <- list.files(pattern="*.csv|CSV")   # read in all the CSV files to an R data object myData <- lapply(myFiles, read.csv)   # assign file names to each dataframe in the list names(myData) <- myFiles   # Here's the step where we turn the JSTOR DfR 'wordcount' into # the 'bag of words' that's typically needed for topic modelling # The R process is 'untable-ing' each CSV file into a # list of data frames, one data frame per file myUntabledData <- sapply(1:length(myData), function(x) {rep(myData[[x]]$WORDCOUNTS, times = myData[[x]]$WEIGHT)})   # And here's the step where we create individual txt files # for each data frame (formerly a CSV file) that should be suitable for # input into MALLET. names(myUntabledData) <- myFiles sapply(myFiles, function (x) write.table(myUntabledData[x], file=paste(x, "txt", sep="."), quote = FALSE, row.names = FALSE, eol = " " ))   # Look in the working directory to find the txt files</pre></td></tr></table></div> ` I have a few more snippets that use the citations CSV for filtering and attaching biblio data to the R data and topic modelling using R (both packages) and MALLET. Some of these are here: <a href="https://gist.github.com/benmarwick" rel="nofollow">https://gist.github.com/benmarwick</a> Andrew Goldstone on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1788 Mon, 12 Nov 2012 17:14:44 +0000 Andrew Goldstone 1788@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/cforster'>cforster</a>'s <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1773">post</a>: Looked again at this, am embarrased by that snippet. Here's a whole script for making bags of words from jstor's wordcount csv files: <a href="https://github.com/agoldst/dfr-analysis/blob/master/count2txt" rel="nofollow">https://github.com/agoldst/dfr-analysis/blob/master/count2txt</a> . It is still in perl, so there. I couldn't get <code>mallet train-topics</code> to work on an instance file produced from <code>mallet import-svmlight</code>, but I didn't try very hard. cforster on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1773 Mon, 29 Oct 2012 09:48:31 +0000 cforster 1773@http://digitalhumanities.org/answers/ Replying to @Andrew Goldstone's <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1771">post</a>: Thanks very much; I'll likely cobble together a similar script in Python (Perl? What is this, the nineties? I kid... I kid...). Very much appreciate it. And thanks for the "fla" filter tip. I may revise my query with that in mind. I'm a little suspicious of the data because I ran a quick word count for one file against the data they provided and it didn't seem to match up. I'll look into that at greater length and then get in touch with the DfR folks should my suspicion be borne out. cforster on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1772 Mon, 29 Oct 2012 09:45:36 +0000 cforster 1772@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/tedunderwood'>tedunderwood</a>'s <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1768">post</a>: Excellent; thanks very much. Andrew Goldstone on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1771 Mon, 29 Oct 2012 09:25:53 +0000 Andrew Goldstone 1771@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/cforster'>cforster</a>'s <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1767">post</a>: Glad you're trying out DFR too. Lots of treasures buried in there, I feel sure. As Ted says, when I applied mallet to the data from jstor, I just reconstituted a bag of words from each word-count file in csv format, basically with this perl: <div class="bb_syntax"><table><tr><td class="line_numbers"><pre>1 2 3 4 5 6 7 8 9 </pre></td><td class="code"><pre class="perl" style="font-family:monospace;">my $header = <INFILE>; die unless $header =~ /^WORDCOUNTS,WEIGHT/;   while(<INFILE>) { my ($word,$count) = split ','; if($word) { print OUTFILE "$word " for (1..$count); } }</pre></td></tr></table></div> and then passed the resulting files to mallet import-dir. It was quite cheap in time and space for the set of about 10^4 articles we were working on. I don't think the mallet command-line tool can take a file of word counts. The MALLET java library must operate on word counts in the end, and I think if you interface with MALLET through java you can feed it the counts directly: cf. <a href="http://mallet.cs.umass.edu/import-devel.php">MALLET: Data Import for Java Developers</a>. I went with the fast-and-dumb method because my java is too weak to figure this out in short order and I wanted instant gratification. If your data looks funky, write the dfr support e-mail address; they were quite helpful to Ted and me. As Ted says it's a beta, but you are supposed to get wordcounts for the full articles. Remember that lots of jstor items are not articles but reviews, front and back matter, etc. You can filter by item type "fla" (full-length article) to get only articles (put <code>ty:fla</code> in your search field). edit: Or possibly you could convert the csv wordcounts to "SVMLight-style" feature:value pairs? Didn't try this, but see the bottom of: <a href="http://mallet.cs.umass.edu/import.php">http://mallet.cs.umass.edu/import.php</a>. edit: One last note; see below. tedunderwood on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1768 Sun, 28 Oct 2012 22:05:49 +0000 tedunderwood 1768@http://digitalhumanities.org/answers/ Hey, Andrew Goldstone and I topic modeled data from DfR. Andrew used MALLET, and I believe he did it the way you describe: reconstituted "text" from the word counts. Perhaps you could ping him for his script. I use my own Java script for topic modeling, and it works directly with word counts, so I didn't have to do that. But I haven't packaged my script for distribution, and MALLET is faster, anyway, so you'd want to use it -- even with the weird input quirk you mention. My experience suggests that they're full word counts for the whole article. Anyway, for PMLA they were. But we were missing some metadata, as I recall, and had to reorder it. DfR is still semi kinda in beta mode. cforster on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1767 Sun, 28 Oct 2012 21:56:19 +0000 cforster 1767@http://digitalhumanities.org/answers/ Does anyone have any experience using topic modeling to analyze data from JSTOR's <a href="http://dfr.jstor.org/">"Data for Research"</a>? DFR lets you request datasets based on queries of the JSTOR database. Full-text of the JSTOR material, however, is not available. Instead one can request keywords, various ngrams (bi, tri, or quad), or word counts; requesting the word counts, one gets a set of files: one file per article with the word counts (CSV or XML), and a manifest (connecting filenames to complete[ish] citations). (Minor note: Looking at the raw counts, it seems like these may be samples of the articles, not the full word counts for the whole article, though I'm not totally sure; anyone have any experience with DFR?) My question: is there a way to get MALLET to take word counts as input rather than raw text? Since topic modeling treats texts/documents as bags of words, it should be able to work with the frequency counts as effectively as with raw text, right? I could write a script to reassemble texts in the proportions described by the word frequencies, but that seems so utterly absurd that I suspect (hope) I may be missing something. Anyone have any experience here?