Bookworm 0.4, with new features and usability improvements
Bookworm 0.4 is now released on github. It contains a number of improvements to the code from over the summer. It makes the existing code much, much more sensible for anyone wanting to build a bookworm on their own collections of texts based on the experience of many using it so far. All the stages: installation, configuration, and testing are now a lot easier. So if you have a collection of texts you wish to explore, I welcome you to test it out. (I’ll explain at more length later, but for the absolute lowest investment of time you can just run a prebuilt bookworm virtual machine using vagrant.)
Installation is easy
The most obvious change has to do with installation. Rather than a collection of scripts that you run in a specific clone repository, Bookworm is now a python module that can be invoked through a system-wide command line utility. I haven’t put it on pip just yet, but it’s easy enough to install system-wide by downloading a zip of the repo from github and running python setup.py install
from inside the expanded directory.
Anyone who plans to edit the code should clone the repo directly and install not by running python setup.py install
, but python setup.py develop
which lets you edit the code in-place.
The external wrapping is now a bundled executable in python rather than a Makefile.
Command-line documentation
Perhaps most importantly, this means that the executable now has its own documentation.
Type in bookworm --help
, and you now get this useful page that tells you what arguments like “–log-level” do, as well as a series of actions that represent the actual commands.
➜ ~ bookworm --help
usage: bookworm [-h] [--configuration CONFIGURATION] [--database DATABASE]
[--log-level {warning,info,debug}]
{build,add_metadata,reload_memory,extension,query,tokenize,prep,init,serve,config}
...
Build and maintain a Bookworm database.
optional arguments:
-h, --help show this help message and exit
--configuration CONFIGURATION, -c CONFIGURATION
The name of the configuration file to read options
from: by default, 'bookworm.cnf' in the current
directory.
--database DATABASE, -d DATABASE
The name of the bookworm database in MySQL to connect
to: by default, read from the active configuration
file.
--log-level {warning,info,debug}, -l {warning,info,debug}
The logging detail to use for errors. Default is
'warning', only significant problems; info gives a
fuller record, and 'debug' dumps many MySQL queries,
etc.
action:
{build,add_metadata,reload_memory,extension,query,tokenize,prep,init,serve,config}
The commands to run with Bookworm
build Build up the component parts of a Bookworm. This is a
wrapper around `Make`; if you specify something far
along the line (for instance, the linechart GUI), it
will build all prior files as well.
add_metadata Supplement the metadata with new items. They can be
keyed to any field already in the database.
reload_memory Reload the memory tables for the designated Bookworm;
this must be done after every MySQL restart
extension Install Extensions to the current directory
query Run a query using the Bookworm API
tokenize tokenize (and optionally, encode) text. Requires a
stream to stdin as input.
prep Build individual components: primarily used by the
Makefile.
init Initialize the current directory as a bookworm
directory
serve Launch a webserver on the current bookworm. This is
much easier than configuring apache, but considerably
less secure.
config Some helpers to configure a running bookworm, or to
manage your server-wide configuration.
Each of the “actions” has additional help. For example:
usage: bookworm reload_memory [-h] [--force-reload] [--skip-reload] [--all]
optional arguments:
-h, --help show this help message and exit
--force-reload Force reload on all memory tables. Use '--skip-reload' for
faster execution. On by default .
--skip-reload Don't reload memory tables which have at least one entry in
them. Significantly faster, but may produce bad results if
the underlying tables have been changed. Good for
maintenance, bad for actively updated installations.
--all Search for all bookworm installations on the server, and
reload memory tables for each of them.
As a result certain processes are now accessible system-wide; if you want to tokenize something consistently with Bookworm’s complicated tokenization regex, that’s accessible by piping input into the command bookworm tokenize token_stream
. There are still a few kinks, but I’ve found this to be extremely useful even when loading text in other applications.
This also makes syste scripting for things like memory-table reloads significantly easier.
On-board API
The API was previously bundled as a separate module. While there is still an (automatically-installed) CGI script to handle apache calls, you can also run the API from the command line. This means that for data analysis purposes you don’t need a webserver. You can just create any arbitrary subsets of data using the API with calls like bookworm query '{"format":"tsv",[...]}
.
On-board webserver.
Of course you may want a webserver, since visualization has always been one of the primary outputs. I still recommend you use Apache for any public-facing installations. But for local testing and data exploration, Bookworm now can take advantage of Python’s CGI Server module to put up some charts for you to explore locally or share with trusted collaborators over the web. Just run bookworm serve
after build a database.
Cleaner syntaxes for adding metadata
I’m putting this last, but I actually think it is one of the most important and useful features we have. It is now extremely simple to add metadata to an existing bookworm using bookworm add_metadata
and a tsv or json file containing any new information. If you already have an “author” field, for example, and you have a TSV that contains additional information about some or all of the authors, you can instantly make those additional fields accessible for queries. More information is available in the docs.
Test suite
There is not yet a full test suite, but there are elements.