MiTextExplorer

Website: brenocon.com/mte

The Mutual information Text Explorer (MTE) is a tool that allows interactive exploration of text data and document covariates. See the paper or slides for information. Currently, an experimental system is available. It is very buggy, use with caution, etc etc. Contact brenocon@gmail.com (http://brenocon.com) with questions.

How to run

Get the application: mte.jar

Get one of the example datasets: bible.zip or sotu.zip.

Launch it with one argument, the configuration file of the corpus you want to view. For example:

java -jar mte.jar sotu/config.conf

This requires Java version 8 to be accessible from the commandline. Check the version with java -version; it must be at least "1.8.0". (Sometimes, you might have to give a flag to specify memory usage, like java -Xmx2g. I'm not sure when this is necessary.)

Data format

Each line is one document, encoded as a JSON object. There is one mandatory key:

There is one optional special key:

Other keys in the JSON object are covariates. They have to be listed in the schema configuration to be used.

TODO: automatic type detection

TODO: covariates in separate file and CSV

Configuration options

The application is launched by giving it the full path to the main config file. For an example to adapt to your own data, start with bible/config.conf.

Required configuration parameters include:

Other configuration parameters include:

In the schema object (or schema config file), every key is the name of a covariate, and the type is given. Legal types are

TODO: reconcile with / extend to http://dataprotocols.org/json-table-schema/ which seems to be a moderately sensible data column typing system (are there better ones? there are certainly many worse ones.)

The format for the config file is a lax form of JSON, described here. Any legal JSON can be used for the config file; it has a few niceties like commenting with #, being able to sometimes skip quoting, and leaving off commas when using a separate line per entry.

Source code

License is GPL v2 or later. I'd be happy to do BSD/MIT or something, but the software uses some GPL'd libraries which I find convenient.

Code is at github.com/brendano/mte.

Dependencies have to be placed in lib/ for ./build.sh to work. For development in an IDE, I just manually add them to the build path. I've placed a copy of them here: mte-deps.zip. The dependencies are currently:

config-1.2.1.jar
docking-frames-common.jar
docking-frames-core.jar
guava-13.0.1.jar
jackson-all-1.9.11.jar
myutil.jar
stanford-corenlp-3.2.0.jar
trove-3.0.3.jar