Digital Humanities Questions & Answers » Tag: text analysis

Digital Humanities Questions & Answers » Tag: text analysis - Recent Posts http://digitalhumanities.org/answers/tags/text-analysis Digital Humanities Questions & Answers » Tag: text analysis - Recent Posts en-US Wed, 20 Apr 2016 22:32:14 +0000 http://bbpress.org/?v=1.0.2 <![CDATA[Search]]> q http://digitalhumanities.org/answers/search.php johnlaudun on "Which novels to assign for seminar in algorithmic criticism / distant reading?" http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2330 Wed, 20 May 2015 17:26:12 +0000 johnlaudun 2330@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/ullyot'>ullyot</a>'s <a href="http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2329">post</a>: Sure. Right now I have no travel plans for July. When you're ready, drop me a note. I'm going to be working on the repo some in June, and I'm making a note to add the PDF. Part of my plan is also to turn chunks of it into an iPython notebook (now Jupyter) which would allow there to be a mix of exposition (written in Markdown) with code that can be executed live. (I have a pretty complete run-down on how to set all this up on a Mac, if anyone is interested.) ullyot on "Which novels to assign for seminar in algorithmic criticism / distant reading?" http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2329 Wed, 20 May 2015 16:14:43 +0000 ullyot 2329@http://digitalhumanities.org/answers/ Replying to @<a href='/profile/johnlaudun'>johnlaudun</a>'s <a href="http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2328">post</a>: Thanks very much, John! I'd love to review the PDF version, too, and talk to you about exercises. Perhaps in July when I start developing the course? You can reach me at ullyot[@]ucalgary[.]ca or @<a href='/profile/ullyot'>ullyot</a> on Twitter. johnlaudun on "Which novels to assign for seminar in algorithmic criticism / distant reading?" http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2328 Fri, 15 May 2015 16:39:41 +0000 johnlaudun 2328@http://digitalhumanities.org/answers/ If you want something intermediate between Codecademy's Python tutorials and Jockers' R, as well as a small text through which you can work methodically with your students, I've got a small collection of commented Python scripts in a GitHub repository. It comes with a plain text version of "The Most Dangerous Game." I can also include a PDF version I've long used, which gives you a paginated source for comparison. If you're interested, I can also add notes on the exercises/activities that I use in conjunction with the text. (Essentially, I use it as a way to get students thinking about both the advantages of working through the particularities of texts computationally -- e.g., KWiC -- as well as some of the underlying statistics behind things like topic modeling. Here's the repo: <a href="https://github.com/johnlaudun/upst" rel="nofollow">https://github.com/johnlaudun/upst</a> ullyot on "Which novels to assign for seminar in algorithmic criticism / distant reading?" http://digitalhumanities.org/answers/topic/which-novels-to-assign-for-seminar-in-algorithmic-criticism-distant-reading#post-2327 Fri, 15 May 2015 13:29:58 +0000 ullyot 2327@http://digitalhumanities.org/answers/ I’m designing the graduate seminar I’ll teach in the Department of English this fall (2015) on the subject of ‘Algorithmic Criticism,’ a title I took from the subtitle of Stephen Ramsay’s 2011 book, Reading Machines. It’s an introduction to computational text-analysis for students of literature, from word frequency to topic modelling. By the end of the course, students will be comfortable moving between close reading and distant reading, or what Matthew Jockers calls micro-, meso-, and macro-analysis. (Along with Ramsay’s book, Jockers’ 2013 study Macroanalysis and his 2014 guide to Text Analysis with R for Students of Literature will be required readings.) Students will learn and implement some programming basics using Python and R, so they can see what happens when natural-language processing and other tools parse and rearrange the words in both individual texts and larger corpora. I haven’t developed more detailed course outcomes than that. We’ll use Codecademy’s Python tutorials alongside Jockers’ book on R. So which literary texts do you assign for old-fashioned linear close readings in a course like this? They should be long enough to have a lot of words to work with, and complex enough that they contain a lot of topics. They should provide good contrasts with each other – that is, contain a lot of different words and topics – yet be close enough in time that the comparison makes sense. And they should be in the public domain, so we have texts to manipulate in whatever repository we’re drawing them from. petris.it@googlemail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2293 Sat, 28 Feb 2015 10:12:42 +0000 petris.it@googlemail.com 2293@http://digitalhumanities.org/answers/ If you have tagged the chapters and if the amount of chapters is not that huge you could query the results for every chapter like this: tag="%" where tag="chapter1" boundary this one assumes you have a tag for each chapter tag="%" where tag="chapter" property="number" value="1" boundary this one assumes you have only one chapter tag and a property that holds the chapter number But you are right, there should be a way of extracting the positions of each tag. You could, as a workaround, extract the KWIC for each tag into its own CSV file, that gives you the positions of each instance that belongs to the tag you selected. If you add a tag column manually you will then be able to merge the contents of the per tag files into one file and get tags with positions. aliciapeaker@gmail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2292 Fri, 27 Feb 2015 09:38:11 +0000 aliciapeaker@gmail.com 2292@http://digitalhumanities.org/answers/ Wonderful! Thank you all for your replies! I've used CATMA to return the tag frequencies and then exported a CSV file with the compiled results. This gives me everything I need except the location in the text of each tag, which would enable to me to track frequencies by chapter (for which I have a list of CATMA locations). Is there a way I could search CATMA for tags within a set of location ranges to output a set of results for each chapter? petris.it@googlemail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2291 Fri, 27 Feb 2015 05:47:36 +0000 petris.it@googlemail.com 2291@http://digitalhumanities.org/answers/ You could simply use the CATMA Analyzer to count and extract the tagged information. Assuming you have loaded text and annotations into the Tagger: Click on "Analyze Document" Type: tag="%" into the query box and hit "Execute query" Select the tab "Result by markup" You'll see all tags with the frequency counts there. You can also export the results to a CSV file for further processing. So there is no need for painful XML XSLT hacking so far. Ondine on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2290 Thu, 26 Feb 2015 16:55:32 +0000 Ondine 2290@http://digitalhumanities.org/answers/ I can't pretend to be the most knowledgeable person about XML and especially not about querying it for data analysis purposes. But I have used the TEI for markup and delivery a good bit, generally in oXygen and generally with a fairly constrained set of the TEI P5 tags. From that, I can tell you that I rarely encounter anything as complicated as the markup CATMA is giving you. It seems far more complex than XML for most humanities encoding purposes would need to be, esp given that part of the point of XML, and esp TEI, is that it is human readable. Of course, for some of the more complex content analysis goals that some DHers are pursuing with enormous corpa of humanities texts, this kind of markup may be necessary. But based on what I *think* you're trying to do, the complexity here might be unnecessarily mystifying your markup of your content. If you simply need to measure the frequency of the presence of specific tags that appear in the text, based on--I assume--your own criteria for how those tags should be applied, then it may be that a straightforward TEI document in a transparent editor (oXygen would be my choice) would give you far more control. Simply counting the number of uses of a particular tag could be done in oXygen using an XPath query , which you can refine according to attributes, hierarchy, position, etc. The XPath wouldn't generate a new product from your XML, but it would give you results list (plain text) with a count and that shows where all the instances are. If you want a new product, you can use XSLT to generate a new XML document that retains just the elements you want and/or that adds sequential numbers to them, again based on attributes, hierarchy, position, etc., as a way to select exactly what you want. The CATMA document looks so complicated that I would expect it to be very difficult to parse with XSLT, but parsing a more straightforward TEI P5 document for a count of specific tags shouldn't be so difficult. All that said, I don't use either tool often enough--and haven't recently enough--to be able to offer concrete direction. For that, I recommend going on the TEI discussion list, which you can sign up for here: <a href="http://www.tei-c.org/Support/#tei-l" rel="nofollow">http://www.tei-c.org/Support/#tei-l</a> Ethan Gruber on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2287 Thu, 26 Feb 2015 13:54:52 +0000 Ethan Gruber 2287@http://digitalhumanities.org/answers/ To clarify, is there also an fsDecl for the "non-living" category which contains fDecls for "weather" and other tags? It's doable in XSLT. I don't think you need to count the segs in the XSLT because your TEI (presumably) contains an <fs> for every annotation you've created in your body. You'll need to iterate through every fsDecl and perform a count of every fs that occurs elsewhere in the document that as a @type that is equal to the @xml:id of the fsDecl. You'd have to tweak this somewhat to include counts of the total tagset and to initiate the counts per chapter instead of overall. Without seeing more, it's difficult to construct XPath to handle the document chapter by chapter. See this gist for a basic bit of XSLT: <a href="https://gist.github.com/ewg118/6b0b99d953ae1f4d8eaf" rel="nofollow">https://gist.github.com/ewg118/6b0b99d953ae1f4d8eaf</a> aliciapeaker@gmail.com on "How to extract tagged data and text from TEI file?" http://digitalhumanities.org/answers/topic/how-to-extract-tagged-data-and-text-from-tei-file#post-2286 Thu, 26 Feb 2015 12:56:54 +0000 aliciapeaker@gmail.com 2286@http://digitalhumanities.org/answers/ I’ve been using CATMA (<a href="http://www.catma.de/" rel="nofollow">http://www.catma.de/</a>) to markup a text with some analytical tags I’ve created. I then exported the file in TEI, and I’m now trying to extract the data I’ve marked up in order to measure tag frequencies, but am finding it quite difficult. Rather than tagging text with the labels I’ve created, CATMA has established a somewhat complicated (though likely necessary) system of identifiers. So, for example, I’ve tagged the word “clouds” in my text with the tag “weather,” which is a child of the tagset “non-living.” CATMA represents the tag in the text like this: <text> <body> <ab type=“catma”> Small feckless <seg ana="#CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26">clouds</seg> were hurried across the vast untroubled sky... </ab> </body> </text> The identifier then points to this feature statement after the body of the text: <text> <body> </body> <fs xml:id="CATMA_0036983F-4D37-48C2-8BC7-5846A8364D26" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE"> <f name="catma_displaycolor"> <string>-16710765</string> </f> <f name="catma_markupauthor"> <string>name@email</string> </f> </fs> </text> The id for the type of the fs then points back up to the feature statement declaration in the header: <teiHeader> <encodingDesc> <fsDecl xml:id="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE" n="2014-12-16T13:30:36.000+0000" type="CATMA_3CDE1FE4-CA5D-4460-9BFF-739537D753DE"> <fsDescr>Weather</fsDescr> <fDecl xml:id="CATMA_699BAC76-8D15-408E-A30A-984849115A71" name="catma_displaycolor"> <vRange> <vColl> <string>-16710765</string> </vColl> </vRange> </fDecl> <fDecl xml:id="CATMA_8653855B-B611-48E8-AE9D-00E0160A37DB" name="catma_markupauthor"> <vRange> <vColl> <string>name@email</string> </vColl> </vRange> </fDecl> </fsDecl> </encodingDesc> </teiHeader> I need to extract the text and data, perhaps in a csv file (or other output format, if it’s easier), into something that lists the tagged text (e.g. “clouds”) in one column, the first tag applied to it in the next column (e.g. "weather"), and the tagset or category to which that tag belongs in the next (e.g. "non-living). Or perhaps there’s a better way—really, what I’d like to be able to do is get the frequencies of each tag & tagset for each chapter. If there’s an easier way to mark up the text in TEI that would better allow for what I need, I’m open to re-encoding manually. I’ve also tried playing around a bit with some XSLT and a Python script (<a href="http://www.rdegges.com/quickly-extract-xml-data-with-python/" rel="nofollow">http://www.rdegges.com/quickly-extract-xml-data-with-python/</a>) but with very little experience with either, I find myself quickly out of my depths. Open to suggestions—and thanks in advance for your help! Ben Marwick on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-2142 Thu, 30 Jan 2014 19:50:12 +0000 Ben Marwick 2142@http://digitalhumanities.org/answers/ Replying to @Ben Marwick's <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1842">post</a>: Just a short follow-up, I have now bundled my snippets into a more complete R package for working with JSTOR DFR data. It takes DFR output and does ngrams, word correlations over time, document clustering and topic modelling (with MALLET or in R, and inlcuding hot and cold topic identification): <a href="https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr" rel="nofollow">https://github.com/UW-ARCHY-textual-macroanalysis-lab/JSTORr</a> johnlaudun on "Topic Modeling (MALLET) with JSTOR Data For Research" http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1932 Wed, 13 Mar 2013 18:43:40 +0000 johnlaudun 1932@http://digitalhumanities.org/answers/ Replying to @Michael Widner's <a href="http://digitalhumanities.org/answers/topic/topic-modeling-mallet-with-jstor-data-for-research#post-1872">post</a>: I didn't know about GenSim. What a great library for Python, and the site has some really nice explanatory material on it, too. Thanks for the link. Josh on "How does one prepare and use data for network analysis with Gephi?" http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1885 Thu, 14 Feb 2013 23:02:35 +0000 Josh 1885@http://digitalhumanities.org/answers/ John: Thanks for the links! As for more on what I want to do, I'll try my best to summarize it (though the idea is still in early formation/I don't necessarily know everything yet I want to do, look for, etc.). Also, the corpus, as it stands, is only 4 novels at around 800 pages, but this will increase as this author's novels are translated into English (and, in the future, I'd also like to add works from similar authors to the corpus). My ideas: 1. Extract names, places, and titles of works from the text 2. Perform some kind of frequency rankings (within works and across corpus) 3. Visualize connections: e.g. authors & their works mentioned in relation to each other, which authors are most mentioned in relation to each other, groupings of real and imaginary authors (something important and often ambiguous in these texts), and, eventually, locate overlaps in the network of this network with similar networks created by other authors. In other words, these novels, especially taken as a whole, embody a fairly vast historical (and sometimes fictional) network of literary figures and their works and I'd like to extract these from the corpus in order to (a) better access and analyze them, while at the same time exposing this labyrinthine network that might get lost (in its overwhelming totality) within each narrative and even more easily across multiple "distinct" works. Other questions: can narrative arrive just from, say, a network of authors? What makes metafiction more than that + annotation? etc. Again, these are just my initial ideas, and this is an intellectual side project but one that has implications in my daily work as a librarian working in the digital humanities; often times we need a project of our own to really acquire and retain substantive new skills. For more insight into Vila-Matas' metafictions, <a href="http://bit.ly/XHkPQ1">here's a good essay</a>. johnlaudun on "How does one prepare and use data for network analysis with Gephi?" http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1884 Thu, 14 Feb 2013 21:34:39 +0000 johnlaudun 1884@http://digitalhumanities.org/answers/ Replying to @Josh Honn's <a href="http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1883">post</a>: Depending on what you are doing, and depending on how much data we are talking about, you may not need to go the TEI route. If you are on a Mac, I've written about setting up NLTK here: <a href="http://johnlaudun.org/20121230-macports-the-key-to-python-happiness/" rel="nofollow">http://johnlaudun.org/20121230-macports-the-key-to-python-happiness/</a>. The TL;DR version is here: <a href="http://johnlaudun.org/20121230-macports-for-nltk/" rel="nofollow">http://johnlaudun.org/20121230-macports-for-nltk/</a>. Write back with more info and I'm sure more people will kick in with help. Josh on "How does one prepare and use data for network analysis with Gephi?" http://digitalhumanities.org/answers/topic/how-does-one-prepare-and-use-data-for-network-analysis-with-gephi#post-1883 Thu, 14 Feb 2013 18:53:08 +0000 Josh 1883@http://digitalhumanities.org/answers/ Oh great, another Python book to read! Just kidding. Thanks, Korey (and Justin)! NLTK looks like the way to go. Just glancing over the extracting information chapter, and from others who have suggested TEI, it's clear that I need to, in some way, go from unstructured to structured text before I do anything else. And now back to my irregularly scheduled Python reading/learning.