Comments on: Mine your own business http://virginia2012.thatcamp.org/04/20/mine-your-own-business/ The Humanities and Technology Camp Thu, 24 May 2012 00:58:51 +0000 hourly 1 https://wordpress.org/?v=4.9.12 By: cforster http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-416 Fri, 20 Apr 2012 23:48:56 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-416 I looked at some Conrad texts in MALLET today; I wrote a quick ruby script (it’s very stupid and liable to break, but I’m happy to share) to pull all the words in a text file out (proper names mostly) and then fed that into MALLET. I’m happy to share my results.

I have ~no~ idea how one would tune MALLET. I just went with the defaults. I’d love to hear from Lisa about her experience.

]]>
By: bpasanek http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-415 Fri, 20 Apr 2012 22:16:06 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-415 Justin and I were thinking we might start by playing (for ten minutes or so) with this GUI to get everyone up to speed on what topics’n’texts look like and then we could plunge into MALLET.
Here’s the GUI: code.google.com/p/topic-modeling-tool/

I spent today trying to force Shakespeare’s plays through MALLET. And can share some of my experiences.
The plays went through neatly enough on my first go, but I realized the topics were crowded with character names. More interesting topics, I thought, might be engineered by removing all the proper names from the plays. I, then, in my lame way, destroyed today trying to create a stop-name list to add to a MALLET’s en.txt list. After I finally got the new list together I tried to import my Shakespeare plays again. I got an IllegalArgumentException for the import-dir command. A typical moment of defeat after lots of work: the candle snuffs out, the canary croaks.

In some ways, what Justin and I were envisioning was a crash course in pre- and post-processing of texts and datasets. (Clearly, I haven’t gotten near the post-processing of the Shakespeare yet.) So, it might be nice for the computationally savvy to talk about how one writes shell scripts and walk us through how best to prep a found text (how to remove html tags, white space, etc. with sed/awk/grep and so on).

But maybe that’s too pedestrian? And we should all come with polished datasets to share and work directly with MALLET? We could also bring both raw and some cooked datasets. And do all of the above.

I like the idea of working with the modernist journals. And would love to hear about what Lisa’s been up to.

I’ll bring the Shakespeare… It may be raw or it may be cooked. We’ll see.

]]>
By: jet9r http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-411 Fri, 20 Apr 2012 20:09:39 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-411 Brad and I are very much dealing with preliminaries too, so there is certainly scope for others describing their experiences with MALLET and other similar tools. I first learned about WEKA earlier this week, but have yet to use it. It would be a good outcome if we were able to figure out whether there are particular tools that are suited to various levels of technical expertise or data-mining experience.

]]>
By: Lisa Rhody http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-410 Fri, 20 Apr 2012 19:57:43 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-410 I’ve been working with Mallet and can show some of the process if you are interested… but I have to say that the kinds of things we pull up are going to be very preliminary. I have a bit of refining to go. I would also be interested in looking at another Data Mining tool that I’ve just started with called WEKA. I’ll try to create a weka-ready dataset so we can look at that one too; however, I just have to add the caveat that I am not an expert in MALLET or WEKA…. Also, I’m specifically and purposefully working with a limited dataset. It does change the kinds of results you get… but… with all those caveats, I’m happy to show what I’ve got with the understanding that I’m still learning myself. 🙂

]]>
By: Eric Rettberg http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-409 Fri, 20 Apr 2012 19:52:52 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-409 count me in, too. it might also be a good chance to see what we can do with topic modeling in the Modernist Journals Project data that Chris was so interested in earlier this year (see cforster.com/2011/11/playing-with-mods/)

]]>
By: cforster http://virginia2012.thatcamp.org/04/20/mine-your-own-business/#comment-391 Fri, 20 Apr 2012 14:56:00 +0000 http://virginia2012.thatcamp.org/?p=1125#comment-391 Yes, yes, a thousand times yes. I just reference Ted Underwood’s intro to topic modeling in a DHAnswers post, and I’ll be returning to it this evening in anticipation of tomorrow. I recommend it to others as a good place to start thinking. Let’s play with MALLET and see what we can do.

Underwood, Topic Modeling Made Just Simple Enough.

]]>