<?xml version="1.0" encoding="UTF-8"?>
<!-- generator="bbPress/1.0.2" -->
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom">
	<channel>
		<title>Digital Humanities Questions &#38; Answers &#187; Tag: text analysis - Recent Posts</title>
		<link>http://digitalhumanities.org/answers/tags/text-analysis</link>
		<description>Digital Humanities Questions &amp; Answers &#187; Tag: text analysis - Recent Posts</description>
		<language>en-US</language>
		<pubDate>Tue, 21 May 2013 17:45:22 +0000</pubDate>
		<generator>http://bbpress.org/?v=1.0.2</generator>
		<textInput>
			<title><![CDATA[Search]]></title>
			<description><![CDATA[Search all topics from these forums.]]></description>
			<name>q</name>
			<link>http://digitalhumanities.org/answers/search.php</link>
		</textInput>
		<atom:link href="http://digitalhumanities.org/answers/rss/tags/text-analysis" rel="self" type="application/rss+xml" />

		<item>
			 
				<title>johnlaudun on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1932</link>
			<pubDate>Wed, 13 Mar 2013 22:43:40 +0000</pubDate>
			<dc:creator>johnlaudun</dc:creator>
			<guid isPermaLink="false">1932@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @Michael Widner's &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1872&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;I didn't know about GenSim. What a great library for Python, and the site has some really nice explanatory material on it, too. Thanks for the link.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Josh Honn on "How does one prepare and use data for network analysis with Gephi?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1885</link>
			<pubDate>Fri, 15 Feb 2013 03:02:35 +0000</pubDate>
			<dc:creator>Josh Honn</dc:creator>
			<guid isPermaLink="false">1885@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;John: Thanks for the links! As for more on what I want to do, I'll try my best to summarize it (though the idea is still in early formation/I don't necessarily know everything yet I want to do, look for, etc.). Also, the corpus, as it stands, is only 4 novels at around 800 pages, but this will increase as this author's novels are translated into English (and, in the future, I'd also like to add works from similar authors to the corpus). My ideas:&#60;/p&#62;
&#60;p&#62;1. Extract names, places, and titles of works from the text&#60;br /&#62;
2. Perform some kind of frequency rankings (within works and across corpus)&#60;br /&#62;
3. Visualize connections: e.g. authors &#38;amp; their works mentioned in relation to each other, which authors are most mentioned in relation to each other, groupings of real and imaginary authors (something important and often ambiguous in these texts), and, eventually, locate overlaps in the network of this network with similar networks created by other authors.&#60;/p&#62;
&#60;p&#62;In other words, these novels, especially taken as a whole, embody a fairly vast historical (and sometimes fictional) network of literary figures and their works and I'd like to extract these from the corpus in order to (a) better access and analyze them, while at the same time exposing this labyrinthine network that might get lost (in its overwhelming totality) within each narrative and even more easily across multiple &#34;distinct&#34; works. Other questions: can narrative arrive just from, say, a network of authors? What makes metafiction more than that + annotation? etc. &#60;/p&#62;
&#60;p&#62;Again, these are just my initial ideas, and this is an intellectual side project but one that has implications in my daily work as a librarian working in the digital humanities; often times we need a project of our own to really acquire and retain substantive new skills. For more insight into Vila-Matas' metafictions, &#60;a href=&#34;http://bit.ly/XHkPQ1&#34;&#62;here's a good essay&#60;/a&#62;.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>johnlaudun on "How does one prepare and use data for network analysis with Gephi?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1884</link>
			<pubDate>Fri, 15 Feb 2013 01:34:39 +0000</pubDate>
			<dc:creator>johnlaudun</dc:creator>
			<guid isPermaLink="false">1884@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @Josh Honn's &#60;a href=&#34;http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1883&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Depending on what you are doing, and depending on how much data we are talking about, you may not need to go the TEI route. If you are on a Mac, I've written about setting up NLTK here: &#60;a href=&#34;http://johnlaudun.org/20121230-macports-the-key-to-python-happiness/&#34; rel=&#34;nofollow&#34;&#62;http://johnlaudun.org/20121230-macports-the-key-to-python-happiness/&#60;/a&#62;. The TL;DR version is here: &#60;a href=&#34;http://johnlaudun.org/20121230-macports-for-nltk/&#34; rel=&#34;nofollow&#34;&#62;http://johnlaudun.org/20121230-macports-for-nltk/&#60;/a&#62;. &#60;/p&#62;
&#60;p&#62;Write back with more info and I'm sure more people will kick in with help.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Josh Honn on "How does one prepare and use data for network analysis with Gephi?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1883</link>
			<pubDate>Thu, 14 Feb 2013 22:53:08 +0000</pubDate>
			<dc:creator>Josh Honn</dc:creator>
			<guid isPermaLink="false">1883@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;Oh great, another Python book to read! Just kidding. Thanks, Korey (and Justin)! NLTK looks like the way to go. Just glancing over the extracting information chapter, and from others who have suggested TEI, it's clear that I need to, in some way, go from unstructured to structured text before I do anything else. And now back to my irregularly scheduled Python reading/learning.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Korey Jackson on "How does one prepare and use data for network analysis with Gephi?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1881</link>
			<pubDate>Thu, 14 Feb 2013 21:10:17 +0000</pubDate>
			<dc:creator>Korey Jackson</dc:creator>
			<guid isPermaLink="false">1881@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;Hey Josh,&#60;br /&#62;
Just talking to a friend of mine (Justin Joque at the spatial and numeric data library here at Michigan). One thing he mentioned was that Cytoscape might be a little more user- and spreadsheet-friendly...though one strength of Gephi is its ability to animate networks over time, if that's something you're looking for.&#60;/p&#62;
&#60;p&#62;More generally, he thought it sounded like you were needing to do some named entity extraction, and recommended the Python-based &#60;a href=&#34;http://nltk.org&#34;&#62;nltk.org&#60;/a&#62; for that. They also host a great resource--&#60;em&#62;Natural Language Processing with Python&#60;/em&#62;--at &#60;a href=&#34;http://nltk.org/book/&#34;&#62;nltk.org/book&#60;/a&#62;.&#60;/p&#62;
&#60;p&#62;Hope that helps!
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Josh Honn on "How does one prepare and use data for network analysis with Gephi?"</title>
						<link>http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1880</link>
			<pubDate>Thu, 14 Feb 2013 16:09:22 +0000</pubDate>
			<dc:creator>Josh Honn</dc:creator>
			<guid isPermaLink="false">1880@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;There are plenty of tutorials on the web that are useful for learning Gephi, but I've encountered a much steeper learning curve for the steps prior—such as extracting and preparing data from literary texts to be used with Gephi. In my example, I am interested in performing a network analysis of the social networks (both real and imaginary) in the works of Enrique Vila-Matas (I'd eventually like to expand this corpus to other authors). Acquiring the digital text and working with Gephi I can do (or the latter learn), but it's the very important intermediary steps of preparing the data (extraction of names and places to getting those in a form—database? spreadsheet?—analyzable by Gephi) that I need help with. Any good reading, tutorials, or other resources out there? Other recommendations? I'm working on a Mac and have a limited knowledge of Python, so any Mac-friendly software would be a big help. Thanks!
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Michael Widner on "Stopword list for Old English"</title>
						<link>http://digitalhumanities.org/answers/topic/stopword-list-for-old-english#post-1874</link>
			<pubDate>Fri, 01 Feb 2013 21:14:50 +0000</pubDate>
			<dc:creator>Michael Widner</dc:creator>
			<guid isPermaLink="false">1874@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;Thanks for the response. That fits what I was expecting given the variability of the language.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>scottkleinman on "Stopword list for Old English"</title>
						<link>http://digitalhumanities.org/answers/topic/stopword-list-for-old-english#post-1873</link>
			<pubDate>Fri, 01 Feb 2013 20:36:23 +0000</pubDate>
			<dc:creator>scottkleinman</dc:creator>
			<guid isPermaLink="false">1873@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;I've tried to generate Old English stop word lists for various experiments from time to time, but never to my satisfaction. Spelling variation is a problem, though not as bad as Middle English. But there is also a tremendous amount of homography--so much that it is almost better to generate a list of homographs and tag forms with separate meanings in the source text(s) before doing anything else. After that, you can just generate a list of the most frequent forms in the corpus. You'll probably then have to tailor this list by deleting forms like &#34;wæs&#34; or &#34;cwom&#34; if they're not relevant to your experiment.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Michael Widner on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1872</link>
			<pubDate>Fri, 01 Feb 2013 19:11:14 +0000</pubDate>
			<dc:creator>Michael Widner</dc:creator>
			<guid isPermaLink="false">1872@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;If you're planning on working in Python anyway, you might want to look at the gensim library: &#60;a href=&#34;http://radimrehurek.com/gensim/&#34; rel=&#34;nofollow&#34;&#62;http://radimrehurek.com/gensim/&#60;/a&#62; It would let you perform some topic modeling in the code, so wouldn't require the stop-gap between word frequencies and mallet.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Michael Widner on "Stopword list for Old English"</title>
						<link>http://digitalhumanities.org/answers/topic/stopword-list-for-old-english#post-1870</link>
			<pubDate>Fri, 01 Feb 2013 19:03:28 +0000</pubDate>
			<dc:creator>Michael Widner</dc:creator>
			<guid isPermaLink="false">1870@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;I haven't been able to turn up an existing stop word list for Old English. Does anyone know of one? Thanks!
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Ben Marwick on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1842</link>
			<pubDate>Mon, 31 Dec 2012 22:35:36 +0000</pubDate>
			<dc:creator>Ben Marwick</dc:creator>
			<guid isPermaLink="false">1842@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='http://digitalhumanities.org/answers/profile/cforster'&#62;cforster&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1767&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Here are a few lines of R that I'm working with for topic modelling. These lines should:&#60;/p&#62;
&#60;p&#62;1. read in the JSTOR CSV wordcount files to R&#60;br /&#62;
2. convert them from a table of words and their counts to a 'bag of words'&#60;br /&#62;
3. for each CSV file, create a txt file of the 'bag of words' ready for MALLET&#60;/p&#62;
&#60;p&#62;A sample of 1000 articles from 'American Antiquity' takes about 7 sec to read in the CSV files and about 6 sec to write the 'bag-of-words' txt files&#60;/p&#62;


&#60;div class=&#34;bb_syntax&#34;&#62;&#60;table&#62;&#60;tr&#62;&#60;td class=&#34;line_numbers&#34;&#62;&#60;pre&#62;1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
&#60;/pre&#62;&#60;/td&#62;&#60;td class=&#34;code&#34;&#62;&#60;pre class=&#34;r&#34; style=&#34;font-family:monospace;&#34;&#62;# set working directory, ie. location of JSTOR DfR CSV
# files on the computer
setwd(&#38;quot;C:\\some directory with JSTOR DfR CSV files&#38;quot;)
&#38;nbsp;
# create a list of all the CSV files
myFiles &#38;lt;- list.files(pattern=&#38;quot;*.csv&#124;CSV&#38;quot;)
&#38;nbsp;
# read in all the CSV files to an R data object
myData &#38;lt;-  lapply(myFiles, read.csv)
&#38;nbsp;
# assign file names to each dataframe in the list
names(myData) &#38;lt;- myFiles
&#38;nbsp;
# Here's the step where we turn the JSTOR DfR 'wordcount' into
# the 'bag of words' that's typically needed for topic modelling
# The R process is 'untable-ing' each CSV file into a
# list of data frames, one data frame per file
myUntabledData &#38;lt;- sapply(1:length(myData),
  function(x) {rep(myData[[x]]$WORDCOUNTS, times = myData[[x]]$WEIGHT)})
&#38;nbsp;
# And here's the step where we create individual txt files
# for each data frame (formerly a CSV file) that should be suitable for
# input into MALLET.
names(myUntabledData) &#38;lt;- myFiles
sapply(myFiles,
  function (x) write.table(myUntabledData[x], file=paste(x, &#38;quot;txt&#38;quot;, sep=&#38;quot;.&#38;quot;),
                          quote = FALSE, row.names = FALSE, eol = &#38;quot; &#38;quot; ))
&#38;nbsp;
# Look in the working directory to find the txt files&#60;/pre&#62;&#60;/td&#62;&#60;/tr&#62;&#60;/table&#62;&#60;/div&#62;



&#60;p&#62;`&#60;/p&#62;
&#60;p&#62;I have a few more snippets that use the citations CSV for filtering and attaching biblio data to the R data and topic modelling using R (both packages) and MALLET. Some of these are here: &#60;a href=&#34;https://gist.github.com/benmarwick&#34; rel=&#34;nofollow&#34;&#62;https://gist.github.com/benmarwick&#60;/a&#62;
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Andrew Goldstone on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1788</link>
			<pubDate>Mon, 12 Nov 2012 21:14:44 +0000</pubDate>
			<dc:creator>Andrew Goldstone</dc:creator>
			<guid isPermaLink="false">1788@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='http://digitalhumanities.org/answers/profile/cforster'&#62;cforster&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1773&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Looked again at this, am embarrased by that snippet. Here's a whole script for making bags of words from jstor's wordcount csv files: &#60;a href=&#34;https://github.com/agoldst/dfr-analysis/blob/master/count2txt&#34; rel=&#34;nofollow&#34;&#62;https://github.com/agoldst/dfr-analysis/blob/master/count2txt&#60;/a&#62; . It is still in perl, so there.&#60;/p&#62;
&#60;p&#62;I couldn't get &#60;code&#62;mallet train-topics&#60;/code&#62; to work on an instance file produced from &#60;code&#62;mallet import-svmlight&#60;/code&#62;, but I didn't try very hard.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>cforster on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1773</link>
			<pubDate>Mon, 29 Oct 2012 13:48:31 +0000</pubDate>
			<dc:creator>cforster</dc:creator>
			<guid isPermaLink="false">1773@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @Andrew Goldstone's &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1771&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Thanks very much; I'll likely cobble together a similar script in Python (Perl? What is this, the nineties? I kid... I kid...). Very much appreciate it.&#60;/p&#62;
&#60;p&#62;And thanks for the &#34;fla&#34; filter tip. I may revise my query with that in mind. I'm a little suspicious of the data because I ran a quick word count for one file against the data they provided and it didn't seem to match up. I'll look into that at greater length and then get in touch with the DfR folks should my suspicion be borne out.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>cforster on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1772</link>
			<pubDate>Mon, 29 Oct 2012 13:45:36 +0000</pubDate>
			<dc:creator>cforster</dc:creator>
			<guid isPermaLink="false">1772@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='http://digitalhumanities.org/answers/profile/tedunderwood'&#62;tedunderwood&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1768&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Excellent; thanks very much.
&#60;/p&#62;</description>
		</item>
		<item>
			 
				<title>Andrew Goldstone on "Topic Modeling (MALLET) with JSTOR Data For Research"</title>
						<link>http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1771</link>
			<pubDate>Mon, 29 Oct 2012 13:25:53 +0000</pubDate>
			<dc:creator>Andrew Goldstone</dc:creator>
			<guid isPermaLink="false">1771@http://digitalhumanities.org/answers/</guid>
			<description>&#60;p&#62;&#60;em&#62;Replying to @&#60;a href='http://digitalhumanities.org/answers/profile/cforster'&#62;cforster&#60;/a&#62;'s &#60;a href=&#34;http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1767&#34;&#62;post&#60;/a&#62;:&#60;/em&#62;&#60;/p&#62;
&#60;p&#62;Glad you're trying out DFR too. Lots of treasures buried in there, I feel sure.&#60;/p&#62;
&#60;p&#62;As Ted says, when I applied mallet to the data from jstor, I just reconstituted a bag of words from each word-count file in csv format, basically with this perl:&#60;/p&#62;


&#60;div class=&#34;bb_syntax&#34;&#62;&#60;table&#62;&#60;tr&#62;&#60;td class=&#34;line_numbers&#34;&#62;&#60;pre&#62;1
2
3
4
5
6
7
8
9
&#60;/pre&#62;&#60;/td&#62;&#60;td class=&#34;code&#34;&#62;&#60;pre class=&#34;perl&#34; style=&#34;font-family:monospace;&#34;&#62;&#60;span style=&#34;color: #b1b100;&#34;&#62;my&#60;/span&#62; &#60;span style=&#34;color: #0000ff;&#34;&#62;$header&#60;/span&#62; &#60;span style=&#34;color: #339933;&#34;&#62;=&#60;/span&#62; &#60;span style=&#34;color: #009999;&#34;&#62;&#38;lt;INFILE&#38;gt;&#60;/span&#62;&#60;span style=&#34;color: #339933;&#34;&#62;;&#60;/span&#62;
&#60;span style=&#34;color: #000066;&#34;&#62;die&#60;/span&#62; &#60;span style=&#34;color: #b1b100;&#34;&#62;unless&#60;/span&#62; &#60;span style=&#34;color: #0000ff;&#34;&#62;$header&#60;/span&#62; &#60;span style=&#34;color: #339933;&#34;&#62;=~&#60;/span&#62; &#60;span style=&#34;color: #009966; font-style: italic;&#34;&#62;/^WORDCOUNTS,WEIGHT/&#60;/span&#62;&#60;span style=&#34;color: #339933;&#34;&#62;;&#60;/span&#62;
&#38;nbsp;
&#60;span style=&#34;color: #b1b100;&#34;&#62;while&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#40;&#60;/span&#62;&#60;span style=&#34;color: #009999;&#34;&#62;&#38;lt;INFILE&#38;gt;&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#41;&#60;/span&#62; &#60;span style=&#34;color: #009900;&#34;&#62;&#38;#123;&#60;/span&#62;
    &#60;span style=&#34;color: #b1b100;&#34;&#62;my&#60;/span&#62; &#60;span style=&#34;color: #009900;&#34;&#62;&#38;#40;&#60;/span&#62;&#60;span style=&#34;color: #0000ff;&#34;&#62;$word&#60;/span&#62;&#60;span style=&#34;color: #339933;&#34;&#62;,&#60;/span&#62;&#60;span style=&#34;color: #0000ff;&#34;&#62;$count&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#41;&#60;/span&#62; &#60;span style=&#34;color: #339933;&#34;&#62;=&#60;/span&#62; &#60;span style=&#34;color: #000066;&#34;&#62;split&#60;/span&#62; &#60;span style=&#34;color: #ff0000;&#34;&#62;','&#60;/span&#62;&#60;span style=&#34;color: #339933;&#34;&#62;;&#60;/span&#62;
    &#60;span style=&#34;color: #b1b100;&#34;&#62;if&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#40;&#60;/span&#62;&#60;span style=&#34;color: #0000ff;&#34;&#62;$word&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#41;&#60;/span&#62; &#60;span style=&#34;color: #009900;&#34;&#62;&#38;#123;&#60;/span&#62;
        &#60;span style=&#34;color: #000066;&#34;&#62;print&#60;/span&#62; OUTFILE &#60;span style=&#34;color: #ff0000;&#34;&#62;&#38;quot;$word &#38;quot;&#60;/span&#62; &#60;span style=&#34;color: #b1b100;&#34;&#62;for&#60;/span&#62; &#60;span style=&#34;color: #009900;&#34;&#62;&#38;#40;&#60;/span&#62;1&#60;span style=&#34;color: #339933;&#34;&#62;..&#60;/span&#62;&#60;span style=&#34;color: #0000ff;&#34;&#62;$count&#60;/span&#62;&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#41;&#60;/span&#62;&#60;span style=&#34;color: #339933;&#34;&#62;;&#60;/span&#62;
    &#60;span style=&#34;color: #009900;&#34;&#62;&#38;#125;&#60;/span&#62;
&#60;span style=&#34;color: #009900;&#34;&#62;&#38;#125;&#60;/span&#62;&#60;/pre&#62;&#60;/td&#62;&#60;/tr&#62;&#60;/table&#62;&#60;/div&#62;



&#60;p&#62;and then passed the resulting files to mallet import-dir. It was quite cheap in time and space for the set of about 10^4 articles we were working on.&#60;/p&#62;
&#60;p&#62;I don't think the mallet command-line tool can take a file of word counts. The MALLET java library must operate on word counts in the end, and I think if you interface with MALLET through java you can feed it the counts directly: cf. &#60;a href=&#34;http://mallet.cs.umass.edu/import-devel.php&#34;&#62;MALLET: Data Import for Java Developers&#60;/a&#62;. I went with the fast-and-dumb method because my java is too weak to figure this out in short order and I wanted instant gratification.&#60;/p&#62;
&#60;p&#62;If your data looks funky, write the dfr support e-mail address; they were quite helpful to Ted and me. As Ted says it's a beta, but you are supposed to get wordcounts for the full articles. Remember that lots of jstor items are not articles but reviews, front and back matter, etc. You can filter by item type &#34;fla&#34; (full-length article) to get only articles (put &#60;code&#62;ty:fla&#60;/code&#62; in your search field).&#60;/p&#62;
&#60;p&#62;&#60;strong&#62;edit:&#60;/strong&#62; Or possibly you could convert the csv wordcounts to &#34;SVMLight-style&#34; feature:value pairs? Didn't try this, but see the bottom of: &#60;a href=&#34;http://mallet.cs.umass.edu/import.php&#34;&#62;http://mallet.cs.umass.edu/import.php&#60;/a&#62;.&#60;/p&#62;
&#60;p&#62;&#60;strong&#62;edit:&#60;/strong&#62; One last note; see below.
&#60;/p&#62;</description>
		</item>

	</channel>
</rss>
