• Mine your own business

    Yesterday afternoon, Brad Pasanek and I decided to play at text-mining. We started working with MALLET and this GUI tool but were soon lost in the mine, buried in code, with nary a respirating canary, shafted.

    Our proposal includes two potential approaches:

    (1) a session could look at how a scholar might begin to use topic modeling in the humanities. What do those of us with limited technical nous need to know in order to begin this type of work? We imagine a walk-through, cooking-show-like presentation that goes from A (here are some texts) to B (here is a visualization). Between A and B there are many difficult and perilous interactions with shell scripts, MALLET extrusions, statistics, spread sheets, and graphing tools. While we two are probably not capable of getting from A to B with elegance, flailing about in a group, roughing out a work flow, getting advice from sundry THATCampers, and making time for questions would be generally instructive—or so we submit.

    (2) An alternative approach assumes some basic success with topic-modeling, and focuses instead on working with the cooked results. How can my-mine-mein data (we would bring something to the session and invite others to do the same) be interpreted, processed, and visualised? This secondary concern may even be included in the visualization session that has already been proposed.

    Both bits assume a willingness to wield the MALLET and do some topic modeling. We aim primarily at a how-to and hack-and-help, and not a discussion of the pros and cons of topic modeling or text-mining in general.


  1. cforster says:

    Yes, yes, a thousand times yes. I just reference Ted Underwood’s intro to topic modeling in a DHAnswers post, and I’ll be returning to it this evening in anticipation of tomorrow. I recommend it to others as a good place to start thinking. Let’s play with MALLET and see what we can do.

    Underwood, Topic Modeling Made Just Simple Enough.

  2. Eric Rettberg says:

    count me in, too. it might also be a good chance to see what we can do with topic modeling in the Modernist Journals Project data that Chris was so interested in earlier this year (see cforster.com/2011/11/playing-with-mods/)

  3. Lisa Rhody says:

    I’ve been working with Mallet and can show some of the process if you are interested… but I have to say that the kinds of things we pull up are going to be very preliminary. I have a bit of refining to go. I would also be interested in looking at another Data Mining tool that I’ve just started with called WEKA. I’ll try to create a weka-ready dataset so we can look at that one too; however, I just have to add the caveat that I am not an expert in MALLET or WEKA…. Also, I’m specifically and purposefully working with a limited dataset. It does change the kinds of results you get… but… with all those caveats, I’m happy to show what I’ve got with the understanding that I’m still learning myself. 🙂

  4. jet9r says:

    Brad and I are very much dealing with preliminaries too, so there is certainly scope for others describing their experiences with MALLET and other similar tools. I first learned about WEKA earlier this week, but have yet to use it. It would be a good outcome if we were able to figure out whether there are particular tools that are suited to various levels of technical expertise or data-mining experience.

  5. bpasanek says:

    Justin and I were thinking we might start by playing (for ten minutes or so) with this GUI to get everyone up to speed on what topics’n’texts look like and then we could plunge into MALLET.
    Here’s the GUI: code.google.com/p/topic-modeling-tool/

    I spent today trying to force Shakespeare’s plays through MALLET. And can share some of my experiences.
    The plays went through neatly enough on my first go, but I realized the topics were crowded with character names. More interesting topics, I thought, might be engineered by removing all the proper names from the plays. I, then, in my lame way, destroyed today trying to create a stop-name list to add to a MALLET’s en.txt list. After I finally got the new list together I tried to import my Shakespeare plays again. I got an IllegalArgumentException for the import-dir command. A typical moment of defeat after lots of work: the candle snuffs out, the canary croaks.

    In some ways, what Justin and I were envisioning was a crash course in pre- and post-processing of texts and datasets. (Clearly, I haven’t gotten near the post-processing of the Shakespeare yet.) So, it might be nice for the computationally savvy to talk about how one writes shell scripts and walk us through how best to prep a found text (how to remove html tags, white space, etc. with sed/awk/grep and so on).

    But maybe that’s too pedestrian? And we should all come with polished datasets to share and work directly with MALLET? We could also bring both raw and some cooked datasets. And do all of the above.

    I like the idea of working with the modernist journals. And would love to hear about what Lisa’s been up to.

    I’ll bring the Shakespeare… It may be raw or it may be cooked. We’ll see.

  6. cforster says:

    I looked at some Conrad texts in MALLET today; I wrote a quick ruby script (it’s very stupid and liable to break, but I’m happy to share) to pull all the words in a text file out (proper names mostly) and then fed that into MALLET. I’m happy to share my results.

    I have ~no~ idea how one would tune MALLET. I just went with the defaults. I’d love to hear from Lisa about her experience.

Skip to toolbar