Bookworm 0.4, with new features and usability improvements

Sep 18 2015

Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)

Installation is easy

The most obvious change has to do with installation. Rather than a collection of scripts that you run in a specific clone repository, Bookworm is now a python module that can be invoked through a system-wide command line utility. I haven’t put it on pip just yet, but it’s easy enough to install system-wide by downloading a zip of the repo from github and running python setup.py install from inside the expanded directory. Anyone who plans to edit the code should clone the repo directly and install not by running python setup.py install, but python setup.py develop which lets you edit the code in-place.

The external wrapping is now a bundled executable in python rather than a Makefile.

Command-line documentation

Perhaps most importantly, this means that the executable now has its own documentation.

Type in bookworm --help, and you now get this useful page that tells you what arguments like “–log-level” do, as well as a series of actions that represent the actual commands.

➜  ~  bookworm --help
usage: bookworm [-h] [--configuration CONFIGURATION] [--database DATABASE]
                [--log-level {warning,info,debug}]
                {build,add_metadata,reload_memory,extension,query,tokenize,prep,init,serve,config}
                ...

Build and maintain a Bookworm database.

optional arguments:
  -h, --help            show this help message and exit
  --configuration CONFIGURATION, -c CONFIGURATION
                        The name of the configuration file to read options
                        from: by default, 'bookworm.cnf' in the current
                        directory.
  --database DATABASE, -d DATABASE
                        The name of the bookworm database in MySQL to connect
                        to: by default, read from the active configuration
                        file.
  --log-level {warning,info,debug}, -l {warning,info,debug}
                        The logging detail to use for errors. Default is
                        'warning', only significant problems; info gives a
                        fuller record, and 'debug' dumps many MySQL queries,
                        etc.

action:
  {build,add_metadata,reload_memory,extension,query,tokenize,prep,init,serve,config}
                        The commands to run with Bookworm
    build               Build up the component parts of a Bookworm. This is a
                        wrapper around `Make`; if you specify something far
                        along the line (for instance, the linechart GUI), it
                        will build all prior files as well.
    add_metadata        Supplement the metadata with new items. They can be
                        keyed to any field already in the database.
    reload_memory       Reload the memory tables for the designated Bookworm;
                        this must be done after every MySQL restart
    extension           Install Extensions to the current directory
    query               Run a query using the Bookworm API
    tokenize            tokenize (and optionally, encode) text. Requires a
                        stream to stdin as input.
    prep                Build individual components: primarily used by the
                        Makefile.
    init                Initialize the current directory as a bookworm
                        directory
    serve               Launch a webserver on the current bookworm. This is
                        much easier than configuring apache, but considerably
                        less secure.
    config              Some helpers to configure a running bookworm, or to
                        manage your server-wide configuration.

Each of the “actions” has additional help. For example:

usage: bookworm reload_memory [-h] [--force-reload] [--skip-reload] [--all]

optional arguments:
  -h, --help      show this help message and exit
  --force-reload  Force reload on all memory tables. Use '--skip-reload' for
                  faster execution. On by default .
  --skip-reload   Don't reload memory tables which have at least one entry in
                  them. Significantly faster, but may produce bad results if
                  the underlying tables have been changed. Good for
                  maintenance, bad for actively updated installations.
  --all           Search for all bookworm installations on the server, and
                  reload memory tables for each of them.

As a result certain processes are now accessible system-wide; if you want to tokenize something consistently with Bookworm’s complicated tokenization regex, that’s accessible by piping input into the command bookworm tokenize token_stream. There are still a few kinks, but I’ve found this to be extremely useful even when loading text in other applications.

This also makes syste scripting for things like memory-table reloads significantly easier.

On-board API

The API was previously bundled as a separate module. While there is still an (automatically-installed) CGI script to handle apache calls, you can also run the API from the command line. This means that for data analysis purposes you don’t need a webserver. You can just create any arbitrary subsets of data using the API with calls like bookworm query '{"format":"tsv",[...]}.

On-board webserver.

Of course you may want a webserver, since visualization has always been one of the primary outputs. I still recommend you use Apache for any public-facing installations. But for local testing and data exploration, Bookworm now can take advantage of Python’s CGI Server module to put up some charts for you to explore locally or share with trusted collaborators over the web. Just run bookworm serve after build a database.

Cleaner syntaxes for adding metadata

I’m putting this last, but I actually think it is one of the most important and useful features we have. It is now extremely simple to add metadata to an existing bookworm using bookworm add_metadata and a tsv or json file containing any new information. If you already have an “author” field, for example, and you have a TSV that contains additional information about some or all of the authors, you can instantly make those additional fields accessible for queries. More information is available in the docs.

Test suite

There is not yet a full test suite, but there are elements.