Skip to content
This repository was archived by the owner on Jul 3, 2024. It is now read-only.

DEV-667: stage item in repo & index with catalog & full-text #5

Merged
merged 3 commits into from
Apr 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
vendor
.bundle
.env
stage-item/*.xml
stage-item/*.zip

# other repositories

catalog/
common/
hathitrust_catalog_indexer/
ht-pairtree/
imgsrv-sample-data/
imgsrv/
lss_solr_configs/
pt/
sample-data/
slip/
ssd/
logs/
cache/
219 changes: 0 additions & 219 deletions Dockerfile

This file was deleted.

111 changes: 80 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,64 +8,113 @@ Clone all the repositories in a working directory.
We're going to be running docker from this working directory,
so `babel-local-dev` has access to the other repositories.

There's a lot, because we're replicating running on the
dev servers with `debug_local=1` enabled.

```
$ mkdir workdir
$ cd workdir
$ git clone [email protected]:hathitrust/babel-local-dev.git
$ git clone [email protected]:hathitrust/catalog.git
$ git clone [email protected]:hathitrust/common.git
$ git clone [email protected]:hathitrust/imgsrv.git
$ git clone [email protected]:hathitrust/pt.git
$ git clone [email protected]:hathitrust/mdp-lib.git
$ git clone [email protected]:hathitrust/slip-lib.git
$ git clone [email protected]:hathitrust/plack-lib.git
$ git clone [email protected]:hathitrust/imgsrv-sample-data.git
# more to come
First clone this repository:
```bash
git clone [email protected]:hathitrust/babel-local-dev.git babel
```

## Step 2: intialize all the submodules
Then run:

*Insert fancy one liner if available.*
```bash
cd babel
./setup.sh
```

This will check out the other repositories along with their submodules.
There's a lot, because we're replicating running on the dev servers with
`debug_local=1` enabled.

## Step 3: build the `babel-local-dev` environment

In your workdir:

```
docker-compose -f ./babel-local-dev/docker-compose.yml build
docker-compose build
```

## Step 4: run `babel-local-dev`:

In your workdir:

```
docker-compose -f ./babel-local-dev/docker-compose.yml up
docker-compose up
```

In your browser:

* http://localhost:8080/Search/Home
* http://localhost:8080/cgi/pt?id=test.pd_open
* catalog: `http://localhost:8080/Search/Home`
* catalog solr: `http://localhost:9033`
* full-text solr: `http://localhost:8983`

PageTurner & imgsrv:

* `http://localhost:8080/cgi/pt?id=test.pd_open`
* `http://localhost:8080/cgi/imgsrv/cover?id=test.pd_open`
* `http://localhost:8080/cgi/imgsrv/image?id=test.pd_open&seq=1`
* `http://localhost:8080/cgi/imgsrv/html?id=test.pd_open&seq=1`
* `http://localhost:8080/cgi/imgsrv/download/pdf?id=test.pd_open&seq=1&attachment=0`

mysql is exposed at 127.0.0.1:3307. The default username & password with write
access is `mdp-admin` / `mdp-admin` (needless to say, do not use this image in
production!)

```bash
mysql -h 127.0.0.1 -p 3307 -u mdp-admin -p
```
Huzzah!

Not yet configured:
* `http://localhost:8080/cgi/mb`
* `http://localhost:8080/cgi/ls`
* `http://localhost:8080/cgi/whoami`
* `http://localhost:8080/cgi/ping`
* etc

## How this works (for now)

The `docker-commpose` provides a custom catalog configuration to the `nginx` service to
proxy `babel` CGI requests to the `apache-cgi` service, and serve `common` requests from
the local `common` checkout.
* catalog runs nginx + php
* babel cgi apps run under apache in a single container
* imgsrv plack/psgi process runs in its own container

## Staging an Item

`apache-cgi` is there because `nginx` can only speak FastCGI/HTTP and running *all* the babel
apps under FastCGI/HTTP is still aspirational.
First, get a HathiTrust ZIP and METS. The easiest way to do this is probably by
using the [Data API client](https://babel.hathitrust.org/cgi/htdc) to download
a public domain item unencumbered by any contractual restrictions, for example
`uc2.ark:/13960/t4mk66f1d`. Select "Download" and in turn select "Item METS
file" and "entire item" and submit the form; this will download the ZIP and
METS respectively.

Running the stage item script requires a Ruby runtime. It will automate putting
the item in the appropriate location under `imgsrv-sample-data`, fetch the
bibliographic data, and extract and index the full text.

First make sure all the dependencies are running:

```bash
docker-compose build
docker-compose up
```

Then, install dependencies for the `stage-item` script and run it with the
downloaded zip and METS:

```bash
docker-compose run traject bundle install
cd stage-item
bundle config set --local path 'vendor/bundle'
bundle install
bundle exec ruby stage_item.rb uc2.ark:/13960/t4mk66f1d ark+=13960=t4mk66f1d.zip ark+=13960=t4mk66f1d.mets.xml
```

Note that the zip and METS must be named as they are in the actual
repository -- if you name them "foo.zip" or "foo.xml" they will not be renamed,
and full-text indexing and PageTurner will not be able to find the item.

## TODO

- [ ] merge the `imgsrv` DEV-231-grok branch and update the `Dockerfile`s to include `grok`
- [ ] update `slip-lib/Searcher.pm` to set `wt=xml` because the new solr defaults return JSON
- [ ] adding `pt` requires filling out more of the `ht_web` tables (namely `mb_*`)
- [ ] add `mb` and `ls`
- [ ] ensure database user can write to relevant tables
- [ ] link to documentation for important tasks - e.g. running apps under debugging, updating css/js, etc
- [ ] easy mechanism to generate placeholder volumes in `imgsrv-sample-data` that correspond to the records in the catalog

- [ ] make it easier to fetch real volumes
Empty file added cache/.keep
Empty file.
Loading