diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8046394..b9f1935 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -1,5 +1,14 @@
 # Changelog 
 
+## 2020.3.3 - Pre-release
+
+NOTE: This version includes changes to both the way that model artefacts are packaged and saved, and the way that data are laded and parsed from tsv files. This results in a significantly faster training time (c.14 hours -> c.0.5 hour), but older models will no longer be compatible. For compatibility you must use multitask modles > 2020.3.19, splitting models > 2020.3.6, and parisng models > 2020.3.8. These models currently perform less well than previous versions, but performance is expected to improve with more data and experimentation predominatly around sequence length.
+
+* Adds support for a Multitask models as in the original Rodrigues paper
+* Combines artefacts into a single `indices.pickle` rather than the several previous pickles. Now the model just requires the embedding, `indices.pickle`, and `weights.h5`.
+* Updates load_tsv to better handle quoting.
+
+
 ## 2020.3.2 - Pre-release
 
 * Adds parse command that can be called with `python -m deep_reference_parser parse` 
diff --git a/Makefile b/Makefile
index 313b8a2..b86f0a1 100644
--- a/Makefile
+++ b/Makefile
@@ -83,7 +83,10 @@ datasets = data/splitting/2019.12.0_splitting_train.tsv \
            data/splitting/2019.12.0_splitting_valid.tsv \
 		   data/parsing/2020.3.2_parsing_train.tsv \
            data/parsing/2020.3.2_parsing_test.tsv \
-           data/parsing/2020.3.2_parsing_valid.tsv
+           data/parsing/2020.3.2_parsing_valid.tsv \
+           data/multitask/2020.3.19_multitask_train.tsv \
+           data/multitask/2020.3.19_multitask_test.tsv \
+           data/multitask/2020.3.19_multitask_valid.tsv
 
 
 rodrigues_datasets = data/rodrigues/clean_train.txt \
diff --git a/README.md b/README.md
index 680d699..3ba0aef 100644
--- a/README.md
+++ b/README.md
@@ -2,54 +2,77 @@
 
 # Deep Reference Parser
 
-Deep Reference Parser is a Bi-direction Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF) for identifying references from text. It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) tool to replace a number of existing machine learning models which find references, and extract the constituent parts (e.g. author, year, publication, volume, etc).
+Deep Reference Parser is a Deep Learning Model for recognising references in free text. In this context we mean references to other works, for example an academic paper, or a book. Given an arbitrary block of text (nominally a section containing references), the model will extract the limits of the individual references, and identify key information like: authors, year published, and title.
 
-The BiLSTM model is based on Rodrigues et al. (2018), and like this project, the intention is to implement a MultiTask model which will complete three tasks simultaneously: reference span detection (splitting), reference component detection (parsing), and reference type classification (classification) in a single neural network and stacked CRF.
+The model itself is a Bi-directional Long Short Term Memory (BiLSTM) Deep Neural Network with a stacked Conditional Random Field (CRF). It is designed to be used in the [Reach](https://github.com/wellcometrust/reach) application to replace a number of existing machine learning models which find references, and extract the constituent parts.
+
+The BiLSTM model is based on [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing) who developed a model to find (split) references, parse them into contituent parts, and classify them according to the type of reference (e.g. primary reference, secondary reference, etc). This implementation of the model implements a the first two tasks and is intened for use in the medical field. Three models are implemented here: individual splitting and parsing models, and a combined multitask model which both splits and parses. We have not yet attempted to include reference type classification, but this may be done in the future.
 
 ### Current status:
 
 |Component|Individual|MultiTask|
 |---|---|---|
-|Spans (splitting)|✔️ Implemented|❌ Not Implemented|
-|Components (parsing)|✔️ Implemented|❌ Not Implemented|
+|Spans (splitting)|✔️ Implemented|✔️ Implemented|
+|Components (parsing)|✔️ Implemented|✔️ Implemented|
 |Type (classification)|❌ Not Implemented|❌ Not Implemented|
 
 ### The model
 
 The model itself is based on the work of [Rodrigues et al. (2018)](https://github.com/dhlab-epfl/LinkedBooksDeepReferenceParsing), although the implemention here differs significantly. The main differences are:
 
-* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have labelled ourselves. No Rodrigues et al. data are included in the test and validation sets.
-* We also use a new word embedding that has been trained on documents relevant to the medicine.
+* We use a combination of the training data used by Rodrigues, et al. (2018) in addition to data that we have annotated ourselves. No Rodrigues et al. data are included in the test and validation sets.
+* We also use a new word embedding that has been trained on documents relevant to the field of medicine.
 * Whereas Rodrigues at al. split documents on lines, and sent the lines to the model, we combine the lines of the document together, and then send larger chunks to the model, giving it more context to work with when training and predicting.
-* Whilst the model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
+* Whilst the splitter model makes predictions at the token level, it outputs references by naively splitting on these tokens ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/tokens_to_references.py)).
 * Hyperparameters are passed to the model in a config (.ini) file. This is to keep track of experiments, but also because it is difficult to save the model with the CRF architecture, so it is necesary to rebuild (not re-train!) the model object each time you want to use it. Storing the hyperparameters in a config file makes this easier.
-* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2019.12.0.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
+* The package ships with a [config file](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/configs/2020.3.19_multitask.ini) which defines the latest, highest performing model. The config file defines where to find the various objects required to build the model (index dictionaries, weights, embeddings), and will automatically fetch them when run, if they are not found locally.
 * The model includes a command line interface inspired by [SpaCy](https://github.com/explosion/spaCy); functions can be called from the command line with `python -m deep_reference_parser` ([source](https://github.com/wellcometrust/deep_reference_parser/blob/master/deep_reference_parser/predict.py)).
-* Python version updated to 3.7, along with dependencies (although more to do)
+* Python version updated to 3.7, along with dependencies (although more to do).
 
 ### Performance
 
 On the validation set.
 
-#### Span detection (splitting)
+#### Finding references spans (splitting)
 
-|token|f1|support|
-|---|---|---|
-|b-r|0.9364|2472|
-|e-r|0.9312|2424|
-|i-r|0.9833|92398|
-|o|0.9561|32666|
-|weighted avg|0.9746|129959|
+Current mode version: *2020.3.6_splitting*
 
-#### Components (parsing)
+|token|f1|
+|---|---|
+|b-r|0.8146|
+|e-r|0.7075|
+|i-r|0.9623|
+|o|0.8463|
+|weighted avg|0.9326|
 
-|token|f1|support|
-|---|---|---|
-|author|0.9467|2818|
-|title|0.8994|4931|
-|year|0.8774|418|
-|o|0.9592|13685|
-|weighted avg|0.9425|21852|
+#### Identifying reference components (parsing)
+
+Current mode version: *2020.3.8_parsing*
+
+|token|f1|
+|---|---|
+|author|0.9053|
+|title|0.8607|
+|year|0.0.8639|
+|o|0.0.9340|
+|weighted avg|0.9124|
+
+#### Multitask model (splitting and parsing)
+
+Current mode version: *2020.3.19_multitask*
+
+|token|f1|
+|---|---|
+|author|0.9102|
+|title|0.8809|
+|year|0.7469|
+|o|0.8892|
+|parsing weighted avg|0.8869|
+|b-r|0.8254|
+|e-r|0.7908|
+|i-r|0.9563|
+|o|0.7560|
+|weighted avg|0.9240|
 
 #### Computing requirements
 
@@ -57,8 +80,9 @@ Models are trained on AWS instances using CPU only.
 
 |Model|Time Taken|Instance type|Instance cost (p/h)|Total cost|
 |---|---|---|---|---|
-|Span detection|16:02:00|m4.4xlarge|$0.88|$14.11|
-|Components|11:02:59|m4.4xlarge|$0.88|$9.72|
+|Span detection|00:26:41|m4.4xlarge|$0.88|$0.39|
+|Components|00:17:22|m4.4xlarge|$0.88|$0.25|
+|MultiTask|00:19:56|m4.4xlarge|$0.88|$0.29|
 
 ## tl;dr: Just get me to the references!
 
@@ -77,15 +101,20 @@ cat > references.txt <<EOF
 EOF
 
 
-# Run the splitter model. This will take a little time while the weights and 
+# Run the MultiTask model. This will take a little time while the weights and 
 # embeddings are downloaded. The weights are about 300MB, and the embeddings 
 # 950MB.
 
-python -m deep_reference_parser split "$(cat references.txt)"
+python -m deep_reference_parser split_parse -t "$(cat references.txt)"
 
 # For parsing:
 
 python -m deep_reference_parser parse "$(cat references.txt)"
+
+# For splitting:
+
+python -m deep_reference_parser split "$(cat references.txt)"
+
 ```
 
 ## The longer guide
@@ -106,7 +135,9 @@ A [config file](https://github.com/wellcometrust/deep_reference_parser/blob/mast
 
 ```
 [DEFAULT]
-version = 2019.12.0
+version = 2020.3.19_multitask
+description = Same as 2020.3.13 but with adam rather than rmsprop
+deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692
 
 [data]
 test_proportion = 0.25
@@ -114,14 +145,14 @@ valid_proportion = 0.25
 data_path = data/
 respect_line_endings = 0
 respect_doc_endings = 1
-line_limit = 250
-policy_train = data/2019.12.0_train.tsv
-policy_test = data/2019.12.0_test.tsv
-policy_valid = data/2019.12.0_valid.tsv
+line_limit = 150
+policy_train = data/multitask/2020.3.19_multitask_train.tsv
+policy_test = data/multitask/2020.3.19_multitask_test.tsv
+policy_valid = datamultitask/2020.3.19_multitask_valid.tsv
 s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
 
 [build]
-output_path = models/2020.2.0/
+output_path = models/multitask/2020.3.19_multitask/
 output = crf
 word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
 pretrained_embedding = 0
@@ -133,13 +164,10 @@ char_embedding_type = BILSTM
 optimizer = rmsprop
 
 [train]
-epochs = 10
+epochs = 60
 batch_size = 100
 early_stopping_patience = 5
 metric = val_f1
-
-[evaluate]
-out_file = evaluation_data.tsv
 ```
 
 ### Getting help
@@ -198,21 +226,21 @@ Data must be prepared in the following tab separated format (tsv). We use [prodi
 You must provide the train/test/validation data splits in this format in pre-prepared files that are defined in the config file.
 
 ```
-References  o
-1   o
-The	b-r
-potency	i-r
-of	i-r
-history	i-r
-was	i-r
-on	i-r
-display	i-r
-at	i-r
-a	i-r
-workshop	i-r
-held	i-r
-in	i-r
-February	i-r
+References  o o
+1   o   o
+The	b-r title
+potency	i-r title
+of	i-r title
+history	i-r title
+was	i-r title
+on	i-r title
+display	i-r title
+at	i-r title
+a	i-r title
+workshop    i-r title
+held	i-r title
+in	i-r title
+February	i-r title
 ```
 
 ### Making predictions
diff --git a/deep_reference_parser/__main__.py b/deep_reference_parser/__main__.py
index 1040272..bd91cc4 100644
--- a/deep_reference_parser/__main__.py
+++ b/deep_reference_parser/__main__.py
@@ -12,11 +12,13 @@
     from .train import train
     from .split import split
     from .parse import parse
+    from .split_parse import split_parse
 
     commands = {
         "split": split,
         "parse": parse,
         "train": train,
+        "split_parse": split_parse,
     }
 
     if len(sys.argv) == 1:
diff --git a/deep_reference_parser/__version__.py b/deep_reference_parser/__version__.py
index 2ccb55f..328f8b7 100644
--- a/deep_reference_parser/__version__.py
+++ b/deep_reference_parser/__version__.py
@@ -1,9 +1,10 @@
 __name__ = "deep_reference_parser"
-__version__ = "2020.3.2"
+__version__ = "2020.3.3"
 __description__ = "Deep learning model for finding and parsing references"
 __url__ = "https://github.com/wellcometrust/deep_reference_parser"
 __author__ = "Wellcome Trust DataLabs Team"
 __author_email__ = "Grp_datalabs-datascience@Wellcomecloud.onmicrosoft.com"
 __license__ = "MIT"
-__splitter_model_version__ = "2019.12.0_splitting"
-__parser_model_version__ = "2020.3.2_parsing"
+__splitter_model_version__ = "2020.3.6_splitting"
+__parser_model_version__ = "2020.3.8_parsing"
+__splitparser_model_version__ = "2020.3.19_multitask"
diff --git a/deep_reference_parser/common.py b/deep_reference_parser/common.py
index 14b8714..9cd3e14 100644
--- a/deep_reference_parser/common.py
+++ b/deep_reference_parser/common.py
@@ -5,8 +5,12 @@
 from logging import getLogger
 from urllib import parse, request
 
+from .__version__ import (
+    __parser_model_version__,
+    __splitparser_model_version__,
+    __splitter_model_version__,
+)
 from .logger import logger
-from .__version__ import __splitter_model_version__, __parser_model_version__
 
 
 def get_path(path):
@@ -15,6 +19,7 @@ def get_path(path):
 
 SPLITTER_CFG = get_path(f"configs/{__splitter_model_version__}.ini")
 PARSER_CFG = get_path(f"configs/{__parser_model_version__}.ini")
+MULTITASK_CFG = get_path(f"configs/{__splitparser_model_version__}.ini")
 
 
 def download_model_artefact(artefact, s3_slug):
@@ -47,13 +52,8 @@ def download_model_artefacts(model_dir, s3_slug, artefacts=None):
     if not artefacts:
 
         artefacts = [
-            "char2ind.pickle",
-            "ind2label.pickle",
-            "ind2word.pickle",
-            "label2ind.pickle",
-            "maxes.pickle",
+            "indices.pickle" "maxes.pickle",
             "weights.h5",
-            "word2ind.pickle",
         ]
 
     for artefact in artefacts:
diff --git a/deep_reference_parser/configs/2019.12.0_splitting.ini b/deep_reference_parser/configs/2019.12.0_splitting.ini
deleted file mode 100644
index 1fc02a3..0000000
--- a/deep_reference_parser/configs/2019.12.0_splitting.ini
+++ /dev/null
@@ -1,35 +0,0 @@
-[DEFAULT]
-version = 2019.12.0_splitting
-
-[data]
-test_proportion = 0.25
-valid_proportion = 0.25
-data_path = data/
-respect_line_endings = 0
-respect_doc_endings = 1
-line_limit = 250
-policy_train = data/splitting/2019.12.0_splitting_train.tsv
-policy_test = data/splitting/2019.12.0_splitting_test.tsv
-policy_valid = data/splitting/2019.12.0_splitting_valid.tsv
-s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
-
-[build]
-output_path = models/splitting/2019.12.0_splitting/
-output = crf
-word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
-pretrained_embedding = 0
-dropout = 0.5
-lstm_hidden = 400
-word_embedding_size = 300
-char_embedding_size = 100
-char_embedding_type = BILSTM
-optimizer = rmsprop
-
-[train]
-epochs = 10
-batch_size = 100
-early_stopping_patience = 5
-metric = val_f1
-
-[evaluate]
-out_file = evaluation_data.tsv
diff --git a/deep_reference_parser/configs/2020.3.19_multitask.ini b/deep_reference_parser/configs/2020.3.19_multitask.ini
new file mode 100644
index 0000000..0228ecc
--- /dev/null
+++ b/deep_reference_parser/configs/2020.3.19_multitask.ini
@@ -0,0 +1,37 @@
+[DEFAULT]
+version = 2020.3.19_multitask
+description = Same as 2020.3.13 but with adam rather than rmsprop
+deep_reference_parser_version = b61de984f95be36445287c40af4e65a403637692
+
+[data]
+# Note that test and valid proportion are only used for data creation steps, 
+# not when running the train command.
+test_proportion = 0.25
+valid_proportion = 0.25
+data_path = data/
+respect_line_endings = 0
+respect_doc_endings = 1
+line_limit = 150
+policy_train = data/multitask/2020.3.19_multitask_train.tsv
+policy_test = data/multitask/2020.3.19_multitask_test.tsv
+policy_valid = data/multitask/2020.3.19_multitask_valid.tsv
+s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
+
+[build]
+output_path = models/multitask/2020.3.19_multitask/
+output = crf
+word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
+pretrained_embedding = 0
+dropout = 0.5
+lstm_hidden = 400
+word_embedding_size = 300
+char_embedding_size = 100
+char_embedding_type = BILSTM
+optimizer = rmsprop
+
+[train]
+epochs = 60
+batch_size = 100
+early_stopping_patience = 5
+metric = val_f1
+
diff --git a/deep_reference_parser/configs/2020.3.2_parsing.ini b/deep_reference_parser/configs/2020.3.2_parsing.ini
deleted file mode 100644
index c3f9dfb..0000000
--- a/deep_reference_parser/configs/2020.3.2_parsing.ini
+++ /dev/null
@@ -1,39 +0,0 @@
-[DEFAULT]
-version = 2020.3.2_parsing
-description = First experiment which includes Reach labelled data in the
-    training set. All annotated parsing data were combined, and then split using
-    a 50% (train), 25% (test), 25% (valid) split. The Rodrigues data is then
-    added to the training set to bulk it out.
-
-[data]
-test_proportion = 0.25
-valid_proportion = 0.25
-data_path = data/
-respect_line_endings = 0
-respect_doc_endings = 1
-line_limit = 250
-policy_train = data/parsing/2020.3.2_parsing_train.tsv
-policy_test = data/parsing/2020.3.2_parsing_test.tsv
-policy_valid = data/parsing/2020.3.2_parsing_valid.tsv
-s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
-
-[build]
-output_path = models/parsing/2020.3.2_parsing/
-output = crf
-word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
-pretrained_embedding = 0
-dropout = 0.5
-lstm_hidden = 400
-word_embedding_size = 300
-char_embedding_size = 100
-char_embedding_type = BILSTM
-optimizer = rmsprop
-
-[train]
-epochs = 10
-batch_size = 100
-early_stopping_patience = 5
-metric = val_f1
-
-[evaluate]
-out_file = evaluation_data.tsv
diff --git a/deep_reference_parser/configs/2020.3.6_splitting.ini b/deep_reference_parser/configs/2020.3.6_splitting.ini
new file mode 100644
index 0000000..a678cdf
--- /dev/null
+++ b/deep_reference_parser/configs/2020.3.6_splitting.ini
@@ -0,0 +1,39 @@
+[DEFAULT]
+version = 2020.3.6_splitting
+description = Splitting model trained on a combination of Reach and Rodrigues 
+    data. The Rodrigues data have been concatenated into a single continuous
+    document and then cut into sequences of length=line_length, so that the
+    Rodrigues data and Reach data have the same lengths without need for much
+    padding or truncating.
+deep_reference_parser_version = e489f7efa31072b95175be8f728f1fcf03a4cabb
+
+[data]
+test_proportion = 0.25
+valid_proportion = 0.25
+data_path = data/
+respect_line_endings = 0
+respect_doc_endings = 1
+line_limit = 250
+policy_train = data/splitting/2020.3.6_splitting_train.tsv
+policy_test = data/splitting/2020.3.6_splitting_test.tsv
+policy_valid = data/splitting/2020.3.6_splitting_valid.tsv
+s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
+
+[build]
+output_path = models/splitting/2020.3.6_splitting/
+output = crf
+word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
+pretrained_embedding = 0
+dropout = 0.5
+lstm_hidden = 400
+word_embedding_size = 300
+char_embedding_size = 100
+char_embedding_type = BILSTM
+optimizer = rmsprop
+
+[train]
+epochs = 30
+batch_size = 100
+early_stopping_patience = 5
+metric = val_f1
+
diff --git a/deep_reference_parser/configs/2020.3.8_parsing.ini b/deep_reference_parser/configs/2020.3.8_parsing.ini
new file mode 100644
index 0000000..77fc78c
--- /dev/null
+++ b/deep_reference_parser/configs/2020.3.8_parsing.ini
@@ -0,0 +1,38 @@
+[DEFAULT]
+version = 2020.3.8_parsing
+description = Parsing model trained on a combination of Reach and Rodrigues 
+    data. The Rodrigues data have been concatenated into a single continuous
+    document and then cut into sequences of length=line_length, so that the
+    Rodrigues data and Reach data have the same lengths without need for much
+    padding or truncating.
+deep_reference_parser_version = e489f7efa31072b95175be8f728f1fcf03a4cabb
+
+[data]
+test_proportion = 0.25
+valid_proportion = 0.25
+data_path = data/
+respect_line_endings = 0
+respect_doc_endings = 1
+line_limit = 100
+policy_train = data/parsing/2020.3.8_parsing_train.tsv
+policy_test = data/parsing/2020.3.8_parsing_test.tsv
+policy_valid = data/parsing/2020.3.8_parsing_valid.tsv
+s3_slug = https://datalabs-public.s3.eu-west-2.amazonaws.com/deep_reference_parser/
+
+[build]
+output_path = models/parsing/2020.3.8_parsing/
+output = crf
+word_embeddings = embeddings/2020.1.1-wellcome-embeddings-300.txt
+pretrained_embedding = 0
+dropout = 0.5
+lstm_hidden = 400
+word_embedding_size = 300
+char_embedding_size = 100
+char_embedding_type = BILSTM
+optimizer = rmsprop
+
+[train]
+epochs = 30
+batch_size = 100
+early_stopping_patience = 5
+metric = val_f1
diff --git a/deep_reference_parser/deep_reference_parser.py b/deep_reference_parser/deep_reference_parser.py
index b9b28cd..c658c6e 100644
--- a/deep_reference_parser/deep_reference_parser.py
+++ b/deep_reference_parser/deep_reference_parser.py
@@ -13,6 +13,10 @@
 
 import numpy as np
 
+
+from functools import partial
+import h5py
+from keras.engine import saving
 from keras.callbacks import EarlyStopping
 from keras.layers import (
     LSTM,
@@ -30,7 +34,6 @@
 from keras.models import Model
 from keras.optimizers import Adam, RMSprop
 from keras_contrib.layers import CRF
-from keras_contrib.utils import save_load_utils
 from sklearn_crfsuite import metrics
 
 from deep_reference_parser.logger import logger
@@ -47,7 +50,7 @@
     save_confusion_matrix,
     word2vec_embeddings,
 )
-from .io import load_tsv, read_pickle, write_pickle, write_to_csv
+from .io import read_pickle, write_pickle, write_to_csv, write_tsv
 
 
 class DeepReferenceParser:
@@ -72,6 +75,7 @@ def __init__(
         y_train=None,
         y_test=None,
         y_valid=None,
+        max_len=250,
         digits_word="$NUM$",
         ukn_words="out-of-vocabulary",
         padding_style="pre",
@@ -126,9 +130,8 @@ def __init__(
         self.X_validation = list()
         self.X_testing = list()
 
-        self.max_len = int()
+        self.max_len = max_len
         self.max_char = int()
-        self.max_words = int()
 
         # Defined in prepare_data
 
@@ -156,24 +159,22 @@ def prepare_data(self, save=False):
             Save(bool): If True, then data objects will be saved to
                 `self.output_path`.
         """
-        self.max_len = max([len(xx) for xx in self.X_train])
+        #self.max_len = max([len(xx) for xx in self.X_train])
 
         self.X_train_merged, self.X_test_merged, self.X_valid_merged = merge_digits(
             [self.X_train, self.X_test, self.X_valid], self.digits_word
         )
 
-        # Compute indexes for words+labels in the training data
+        # Compute indices for words+labels in the training data
 
         self.word2ind, self.ind2word = index_x(self.X_train_merged, self.ukn_words)
-        self.label2ind, ind2label = index_y(self.y_train)
 
-        # NOTE: The original code expected self.ind2label to be a list,
-        # in case you are training a multi-task model. For this reason,
-        # self.index2label is wrapped in a list.
+        y_labels = list(map(index_y, self.y_train))
 
-        self.ind2label.append(ind2label)
+        self.ind2label = [ind2label for _, ind2label in y_labels]
+        self.label2ind = [label2ind for label2ind, _ in y_labels]
 
-        # Convert data into indexes data
+        # Convert data into indices data
 
         # Encode X variables
 
@@ -209,33 +210,53 @@ def prepare_data(self, save=False):
 
         # Encode y variables
 
-        self.y_train_encoded = encode_y(
-            self.y_train, self.label2ind, self.max_len, self.padding_style
-        )
+        for i, labels in enumerate(self.y_train):
+            self.y_train_encoded.append(
+                encode_y(
+                    labels,
+                    self.label2ind[i],
+                    self.max_len,
+                    self.padding_style
+                )
+            )
 
-        self.y_test_encoded = encode_y(
-            self.y_test, self.label2ind, self.max_len, self.padding_style
-        )
+        for i, labels in enumerate(self.y_test):
+            self.y_test_encoded.append(
+                encode_y(
+                    labels,
+                    self.label2ind[i],
+                    self.max_len,
+                    self.padding_style
+                )
+            )
 
-        self.y_valid_encoded = encode_y(
-            self.y_valid, self.label2ind, self.max_len, self.padding_style
-        )
+        for i, labels in enumerate(self.y_valid):
+            self.y_valid_encoded.append(
+                encode_y(
+                    labels,
+                    self.label2ind[i],
+                    self.max_len,
+                    self.padding_style
+                )
+            )
+
+
+        logger.debug("Training target dimensions: %s", self.y_train_encoded[0].shape)
+        logger.debug("Test target dimensions: %s", self.y_test_encoded[0].shape)
+        logger.debug("Validation target dimensions: %s", self.y_valid_encoded[0].shape)
 
-        logger.debug("Training target dimensions: %s", self.y_train_encoded.shape)
-        logger.debug("Test target dimensions: %s", self.y_test_encoded.shape)
-        logger.debug("Validation target dimensions: %s", self.y_valid_encoded.shape)
 
         # Create character level data
 
         # Create the character level data
-        self.char2ind, self.max_words, self.max_char = character_index(
+        self.char2ind, self.max_char = character_index(
             self.X_train, self.digits_word
         )
 
         self.X_train_char = character_data(
             self.X_train,
             self.char2ind,
-            self.max_words,
+            self.max_len,
             self.max_char,
             self.digits_word,
             self.padding_style,
@@ -244,7 +265,7 @@ def prepare_data(self, save=False):
         self.X_test_char = character_data(
             self.X_test,
             self.char2ind,
-            self.max_words,
+            self.max_len,
             self.max_char,
             self.digits_word,
             self.padding_style,
@@ -253,7 +274,7 @@ def prepare_data(self, save=False):
         self.X_valid_char = character_data(
             self.X_valid,
             self.char2ind,
-            self.max_words,
+            self.max_len,
             self.max_char,
             self.digits_word,
             self.padding_style,
@@ -265,45 +286,38 @@ def prepare_data(self, save=False):
 
         if save:
 
-            # Save intermediate objects to data
-
-            write_pickle(self.word2ind, "word2ind.pickle", path=self.output_path)
-            write_pickle(self.ind2word, "ind2word.pickle", path=self.output_path)
-            write_pickle(self.label2ind, "label2ind.pickle", path=self.output_path)
-            write_pickle(self.ind2label, "ind2label.pickle", path=self.output_path)
-            write_pickle(self.char2ind, "char2ind.pickle", path=self.output_path)
-
-            maxes = {
-                "max_words": self.max_words,
+            indices = {
+                "word2ind": self.word2ind,
+                "ind2word": self.ind2word,
+                "label2ind": self.label2ind,
+                "ind2label": self.ind2label,
+                "char2ind": self.char2ind,
                 "max_char": self.max_char,
                 "max_len": self.max_len,
             }
 
-            write_pickle(maxes, "maxes.pickle", path=self.output_path)
+            # Save intermediate objects to data
+
+            write_pickle(indices, "indices.pickle", path=self.output_path)
 
     def load_data(self, out_path):
         """
         Loads the intermediate model objects created that are created and saved
         out by prepare_data. But not the data used to train the model.
-
-        NOTE: This method is not yet fully tested.
         """
 
-        self.word2ind = read_pickle("word2ind.pickle", path=out_path)
-        self.ind2word = read_pickle("ind2word.pickle", path=out_path)
-        self.label2ind = read_pickle("label2ind.pickle", path=out_path)
-        self.ind2label = read_pickle("ind2label.pickle", path=out_path)
-        self.char2ind = read_pickle("char2ind.pickle", path=out_path)
-
-        maxes = read_pickle("maxes.pickle", path=out_path)
+        indices = read_pickle("indices.pickle", path=out_path)
 
-        self.max_len = maxes["max_len"]
-        self.max_char = maxes["max_char"]
-        self.max_words = maxes["max_words"]
+        self.word2ind = indices["word2ind"]
+        self.ind2word = indices["ind2word"]
+        self.label2ind = indices["label2ind"]
+        self.ind2label = indices["ind2label"]
+        self.char2ind = indices["char2ind"]
+        self.max_len = indices["max_len"]
+        self.max_char = indices["max_char"]
 
         logger.debug("Setting max_len to %s", self.max_len)
         logger.debug("Setting max_char to %s", self.max_char)
-        logger.debug("Setting max_words to %s", self.max_words)
 
     def build_model(
         self,
@@ -352,7 +366,7 @@ def build_model(
 
         if word_embeddings:
 
-            word_input = Input((self.max_words,))
+            word_input = Input((self.max_len,))
             inputs.append(word_input)
 
             # TODO: More sensible handling of options for pretrained embedding.
@@ -388,7 +402,7 @@ def build_model(
 
         if self.max_char != 0:
 
-            character_input = Input((self.max_words, self.max_char,))
+            character_input = Input((self.max_len, self.max_char,))
 
             char_embedding = self.character_embedding_layer(
                 char_embedding_type=char_embedding_type,
@@ -456,7 +470,7 @@ def build_model(
 
         self.model = model
 
-#        logger.debug(self.model.summary(line_length=150))
+        #logger.debug(self.model.summary(line_length=150))
 
     def train_model(
         self, epochs=25, batch_size=100, early_stopping_patience=5, metric="val_f1"
@@ -481,10 +495,8 @@ def train_model(
 
         # Use custom classification scores callback
 
-        # NOTE: X lists are important for input here
-
         classification_scores = Classification_Scores(
-            [self.X_training, [self.y_train_encoded]], self.ind2label, self.weights_path
+            [self.X_training, self.y_train_encoded], self.ind2label, self.weights_path
         )
 
         callbacks.append(classification_scores)
@@ -503,12 +515,12 @@ def train_model(
 
         hist = self.model.fit(
             x=self.X_training,
-            y=[self.y_train_encoded],
-            validation_data=[self.X_testing, [self.y_test_encoded]],
+            y=self.y_train_encoded,
+            validation_data=[self.X_testing, self.y_test_encoded],
             epochs=epochs,
             batch_size=batch_size,
             callbacks=callbacks,
-            verbose=2,
+            verbose=1,
         )
 
         logger.info(
@@ -541,7 +553,6 @@ def evaluate(
         test_set=False,
         validation_set=False,
         print_padding=False,
-        out_file=None,
     ):
         """
         Evaluate model results
@@ -553,8 +564,6 @@ def evaluate(
                 validation set.
             print_padding(bool): Should the confusion matrix include the
                 the prediction of padding characters?
-            out_file(str): File into which the predictions and targets will be
-                saved. Defaults to `None` which saves nothing if not set.
         """
 
         if load_weights:
@@ -569,7 +578,7 @@ def evaluate(
             # under a multi-task scenario. This will need adjusting when
             # using this syntax for a multi-task model.
 
-            for i, y_target in enumerate([self.y_test_encoded]):
+            for i, y_target in enumerate(self.y_test_encoded):
 
                 # Compute predictions, flatten
 
@@ -600,11 +609,7 @@ def evaluate(
 
             # Compute classification report
 
-            # NOTE: self.y_valid_encoded goes in a list here, as it would
-            # under a multi-task scenario. This will need adjusting when
-            # using this syntax for a multi-task model.
-
-            for i, y_target in enumerate([self.y_valid_encoded]):
+            for i, y_target in enumerate(self.y_valid_encoded):
 
                 # Compute predictions, flatten
 
@@ -658,65 +663,58 @@ def evaluate(
                         figure_path=figure_path,
                     )
 
-                    if out_file:
+                    # Save out the predictions
 
-                        tokens = list(itertools.chain.from_iterable(self.X_valid))
+                    tokens = list(itertools.chain.from_iterable(self.X_valid))
 
-                        # Strip out the padding
+                    # Strip out the padding
 
-                        target_len = np.mean([len(line) for line in target])
-                        prediction_len = np.mean([len(line) for line in predictions])
+                    target_len = np.mean([len(line) for line in target])
+                    prediction_len = np.mean([len(line) for line in predictions])
 
-                        # Strip out the nulls from the target
+                    # Strip out the nulls from the target
 
-                        clean_target = [
-                            [label for label in line if label != "null"]
-                            for line in target
-                        ]
-
-                        # Strip out the nulls in the predictions that match the
-                        # nulls in the target
-
-                        clean_predictions = remove_padding_from_predictions(
-                            clean_target, predictions, self.padding_style
-                        )
+                    clean_target = [
+                        [label for label in line if label != "null"]
+                        for line in target
+                    ]
 
-                        # Record any token length mismatches.
+                    # Strip out the nulls in the predictions that match the
+                    # nulls in the target
 
-                        num_mismatches = len(clean_target) - np.sum(
-                            [
-                                len(x) == len(y)
-                                for x, y in zip(clean_target, clean_predictions)
-                            ]
-                        )
-
-                        logger.info("Number of mismatches: %s", num_mismatches)
+                    clean_predictions = remove_padding_from_predictions(
+                        clean_target, predictions, self.padding_style
+                    )
 
-                        # Flatten the target and predicted into one list.
+                    # Record any token length mismatches.
 
-                        clean_target = list(itertools.chain.from_iterable(clean_target))
-                        clean_predictions = list(
-                            itertools.chain.from_iterable(clean_predictions)
-                        )
-                        # NOTE: this needs some attention. The current outputs
-                        # seem to have different lengths and will therefore be
-                        # offset unequally. - Don't trust them!
-
-                        logger.info("tokens: %s", len(tokens))
-                        logger.info("target: %s", len(clean_target))
-                        logger.info("predictions: %s", len(clean_predictions))
+                    num_mismatches = len(clean_target) - np.sum(
+                        [
+                            len(x) == len(y)
+                            for x, y in zip(clean_target, clean_predictions)
+                        ]
+                    )
 
-                        out = list(zip(tokens, clean_target, clean_predictions))
+                    logger.debug("Number of mismatches: %s", num_mismatches)
 
-                        out_file_path = os.path.join(self.output_path, out_file)
+                    # Flatten the target and predicted into one list.
 
-                        logger.info("Writing results to %s", out_file_path)
+                    clean_target = list(itertools.chain.from_iterable(clean_target))
+                    clean_predictions = list(
+                        itertools.chain.from_iterable(clean_predictions)
+                    )
 
-                        with open(out_file_path, "w") as fb:
-                            writer = csv.writer(fb, delimiter="\t")
+                    logger.debug("tokens: %s", len(tokens))
+                    logger.debug("target: %s", len(clean_target))
+                    logger.debug("predictions: %s", len(clean_predictions))
 
-                            for i in out:
-                                writer.writerow(i)
+                    out = list(zip(tokens, clean_target, clean_predictions))
+                    out_file_path = os.path.join(
+                        self.output_path, 
+                        f"validation_predictions_{i}.tsv"
+                    )
+                    
+                    write_tsv(out, out_file_path)
 
     def character_embedding_layer(
         self,
@@ -928,7 +926,7 @@ def compute_predictions(self, X, y, labels, nbrTask=-1):
 
     def prepare_X_data(self, X):
         """
-        Convert data to encoded word and character indexes
+        Convert data to encoded word and character indices
 
         TODO: Create a more generic function that can also be used in
         `self.prepare_data()`.
@@ -964,7 +962,7 @@ def prepare_X_data(self, X):
         X_char = character_data(
             X,
             self.char2ind,
-            self.max_words,
+            self.max_len,
             self.max_char,
             self.digits_word,
             self.padding_style,
@@ -981,26 +979,17 @@ def load_weights(self):
 
         if not self.model:
 
-            # Assumes that model has been buit with build_model!
-
             logger.exception(
                 "No model. you must build the model first with build_model"
             )
 
-        # NOTE: This is not required if incldue_optimizer is set to false in
-        # load_all_weights.
-
-        # Run the model for one epoch to initialise network weights. Then load
-        # full trained weights
-
-        # self.model.fit(x=self.X_testing, y=self.y_test_encoded,
-        #    batch_size=2500, epochs=1)
-
         logger.debug("Loading weights from %s", self.weights_path)
 
-        save_load_utils.load_all_weights(
-            self.model, self.weights_path, include_optimizer=False
-        )
+        with h5py.File(self.weights_path, mode='r') as f:
+            saving.load_weights_from_hdf5_group(
+                f['model_weights'], self.model.layers
+            )
+
 
     def predict(self, X, load_weights=False):
         """
@@ -1032,35 +1021,38 @@ def predict(self, X, load_weights=False):
 
         _, X_combined = self.prepare_X_data(X)
 
-        pred = self.model.predict(X_combined)
-
-        pred = np.asarray(pred)
-
         # Compute validation score
 
+        pred = np.asarray(self.model.predict(X_combined))
+        pred = np.asarray(pred)
         pred_index = np.argmax(pred, axis=-1)
 
-        # NOTE: indexing ind2label[0] will only work in the case of making
-        # predictions with a single task model.
 
-        ind2labelNew = self.ind2label[0].copy()
+        # Add 0 to labels to account for padding
 
-        # Index 0 in the predictions refers to padding
+        ind2labelNew = self.ind2label.copy()
+        [labels.update({0: "null"}) for labels in ind2labelNew]
 
-        ind2labelNew.update({0: "null"})
+        # Compute the labels for each prediction for each task
 
-        # Compute the labels for each prediction
-        pred_label = [[ind2labelNew[x] for x in a] for a in pred_index]
+        # If running a single task model, wrap pred_index in a list so that it
+        # can use the same logic as multitask models.
+
+        if len(ind2labelNew) == 1:
+            pred_index = [pred_index]
 
+        pred_label = []
+        for i in range(len(ind2labelNew)):
+            out = [[ind2labelNew[i][x] for x in a] for a in pred_index[i]]
+            pred_label.append(out)
         # Flatten data
 
         # Remove the padded tokens. This is done by counting the number of
         # tokens in the input example, and then removing the additional padded
-        # tokens that are added before this. It has to be done this way because
-        # the model can predict padding tokens, and sometimes it gets it wrong
-        # so if we remove all padding tokens, then we end up with mismatches in
-        # the length of input tokens and the length of predictions.
+        # tokens that are added before this.
+
+        # This is performed on each set of predictions relating to each task
 
-        out = remove_padding_from_predictions(X, pred_label, self.padding_style)
+        out = list(map(lambda x: remove_padding_from_predictions(X, x, self.padding_style), pred_label))
 
         return out
diff --git a/deep_reference_parser/io/io.py b/deep_reference_parser/io/io.py
index 3f03102..92d1a69 100644
--- a/deep_reference_parser/io/io.py
+++ b/deep_reference_parser/io/io.py
@@ -98,7 +98,7 @@ def load_tsv(filepath, split_char="\t"):
         filepath.
 
     """
-    df = pd.read_csv(filepath, delimiter=split_char, header=None, skip_blank_lines=False)
+    df = pd.read_csv(filepath, delimiter=split_char, header=None, skip_blank_lines=False, encoding="utf-8", quoting=csv.QUOTE_NONE, engine="python")
 
     tuples = _split_list_by_linebreaks(df.to_records(index=False))
 
@@ -110,7 +110,7 @@ def load_tsv(filepath, split_char="\t"):
 
     out = _unpack(unpacked_tuples)
 
-    logger.info("Loaded %s training examples", len(out[0]))
+    logger.debug("Loaded %s training examples", len(out[0]))
 
     return tuple(out)
 
diff --git a/deep_reference_parser/model_utils.py b/deep_reference_parser/model_utils.py
index 74fa504..fe3892d 100644
--- a/deep_reference_parser/model_utils.py
+++ b/deep_reference_parser/model_utils.py
@@ -154,9 +154,9 @@ def encode_y(y, label2ind, max_len, padding_style):
 
     # Encode y (with pad)
 
-    # Transform each label into its index in the data
+    # Transform each label into its index and adding "pre" padding
 
-    y_pad = [[0] * (max_len - len(ey)) + [label2ind[c] for c in ey] for ey in y]
+    y_pad = [[0] * (max_len - len(yi)) + [label2ind[label] for label in yi] for yi in y]
 
     # One-hot-encode label
 
@@ -205,10 +205,9 @@ def character_index(X, digits_word):
 
     # For padding
 
-    max_words = max([len(s) for s in X])
     max_char = max([len(w) for s in X for w in s])
 
-    return char2ind, max_words, max_char
+    return char2ind, max_char
 
 
 def character_data(X, char2ind, max_words, max_char, digits_word, padding_style):
@@ -457,7 +456,7 @@ def compute_epoch_training_F1(self):
         """
 
         in_length = len(self.model._input_layers)
-        out_length = len(self.model.layers)
+        out_length = len(self.model._output_layers)
         predictions = self.model.predict(self.train_data[0])
 
         if len(predictions) != out_length:
@@ -552,7 +551,7 @@ def on_epoch_end(self, epoch, logs={}):
 
         # Number of tasks
 
-        out_length = len(self.model.layers)
+        out_length = len(self.model._output_layers)
 
         # Compute the model predictions
 
diff --git a/deep_reference_parser/parse.py b/deep_reference_parser/parse.py
index be32d45..58ad803 100644
--- a/deep_reference_parser/parse.py
+++ b/deep_reference_parser/parse.py
@@ -33,26 +33,26 @@ def __init__(self, config_file):
 
         cfg = get_config(config_file)
 
+        # Build config
+        try:
+            OUTPUT_PATH = cfg["build"]["output_path"]
+            S3_SLUG = cfg["data"]["s3_slug"]
+        except KeyError:
+            config_dir, missing_config = os.path.split(config_file)
+            files = os.listdir(config_dir)
+            other_configs = [f for f in os.listdir(config_dir) if os.path.isfile(os.path.join(config_dir, f))]
+            msg.fail(f"Could not find config {missing_config}, perhaps you meant one of {other_configs}")
+
         msg.info(
             f"Attempting to download model artefacts if they are not found locally in {cfg['build']['output_path']}. This may take some time..."
         )
 
-        # Build config
-
-        OUTPUT_PATH = cfg["build"]["output_path"]
-        S3_SLUG = cfg["data"]["s3_slug"]
-
         # Check whether the necessary artefacts exists and download them if
         # not.
 
         artefacts = [
-            "char2ind.pickle",
-            "ind2label.pickle",
-            "ind2word.pickle",
-            "label2ind.pickle",
-            "maxes.pickle",
+            "indices.pickle",
             "weights.h5",
-            "word2ind.pickle",
         ]
 
         for artefact in artefacts:
@@ -63,7 +63,7 @@ def __init__(self, config_file):
                     msg.good(f"Found {artefact}")
                 except:
                     msg.fail(f"Could not download {S3_SLUG}{artefact}")
-                    logger.exception()
+                    logger.exception("Could not download %s%s", S3_SLUG, artefact)
 
         # Check on word embedding and download if not exists
 
@@ -75,7 +75,7 @@ def __init__(self, config_file):
                 msg.good(f"Found {WORD_EMBEDDINGS}")
             except:
                 msg.fail(f"Could not download {S3_SLUG}{WORD_EMBEDDINGS}")
-                logger.exception()
+                logger.exception("Could not download %s", WORD_EMBEDDINGS)
 
         OUTPUT = cfg["build"]["output"]
         PRETRAINED_EMBEDDING = cfg["build"]["pretrained_embedding"]
@@ -116,7 +116,7 @@ def parse(self, text, verbose=False):
 
         preds = self.drp.predict(tokens, load_weights=True)
 
-        flat_predictions = list(itertools.chain.from_iterable(preds))
+        flat_predictions = list(itertools.chain.from_iterable(preds))[0]
         flat_X = list(itertools.chain.from_iterable(tokens))
         rows = [i for i in zip(flat_X, flat_predictions)]
 
diff --git a/deep_reference_parser/split.py b/deep_reference_parser/split.py
index eccabb1..425924e 100644
--- a/deep_reference_parser/split.py
+++ b/deep_reference_parser/split.py
@@ -37,26 +37,26 @@ def __init__(self, config_file):
 
         cfg = get_config(config_file)
 
+        # Build config
+        try:
+            OUTPUT_PATH = cfg["build"]["output_path"]
+            S3_SLUG = cfg["data"]["s3_slug"]
+        except KeyError:
+            config_dir, missing_config = os.path.split(config_file)
+            files = os.listdir(config_dir)
+            other_configs = [f for f in os.listdir(config_dir) if os.path.isfile(os.path.join(config_dir, f))]
+            msg.fail(f"Could not find config {missing_config}, perhaps you meant one of {other_configs}")
+
         msg.info(
             f"Attempting to download model artefacts if they are not found locally in {cfg['build']['output_path']}. This may take some time..."
         )
 
-        # Build config
-
-        OUTPUT_PATH = cfg["build"]["output_path"]
-        S3_SLUG = cfg["data"]["s3_slug"]
-
         # Check whether the necessary artefacts exists and download them if
         # not.
 
         artefacts = [
-            "char2ind.pickle",
-            "ind2label.pickle",
-            "ind2word.pickle",
-            "label2ind.pickle",
-            "maxes.pickle",
+            "indices.pickle",
             "weights.h5",
-            "word2ind.pickle",
         ]
 
         for artefact in artefacts:
@@ -67,7 +67,7 @@ def __init__(self, config_file):
                     msg.good(f"Found {artefact}")
                 except:
                     msg.fail(f"Could not download {S3_SLUG}{artefact}")
-                    logger.exception()
+                    logger.exception("Could not download %s%s", S3_SLUG, artefact)
 
         # Check on word embedding and download if not exists
 
@@ -79,7 +79,7 @@ def __init__(self, config_file):
                 msg.good(f"Found {WORD_EMBEDDINGS}")
             except:
                 msg.fail(f"Could not download {S3_SLUG}{WORD_EMBEDDINGS}")
-                logger.exception()
+                logger.exception("Could not download %s", WORD_EMBEDDINGS)
 
         OUTPUT = cfg["build"]["output"]
         PRETRAINED_EMBEDDING = cfg["build"]["pretrained_embedding"]
@@ -124,7 +124,7 @@ def split(self, text, return_tokens=False, verbose=False):
 
         if return_tokens:
 
-            flat_predictions = list(itertools.chain.from_iterable(preds))
+            flat_predictions = list(itertools.chain.from_iterable(preds))[0]
             flat_X = list(itertools.chain.from_iterable(tokens))
             rows = [i for i in zip(flat_X, flat_predictions)]
 
@@ -145,7 +145,7 @@ def split(self, text, return_tokens=False, verbose=False):
 
             # Otherwise convert the tokens into references and return
 
-            refs = tokens_to_references(tokens, preds)
+            refs = tokens_to_references(tokens, preds[0])
 
             if verbose:
 
diff --git a/deep_reference_parser/split_parse.py b/deep_reference_parser/split_parse.py
new file mode 100644
index 0000000..390ee11
--- /dev/null
+++ b/deep_reference_parser/split_parse.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+# coding: utf-8
+"""
+Run predictions from a pre-trained model
+"""
+
+import itertools
+import json
+import os
+
+import en_core_web_sm
+import plac
+import spacy
+import wasabi
+
+import warnings
+
+with warnings.catch_warnings():
+    warnings.filterwarnings("ignore", category=DeprecationWarning)
+
+    from deep_reference_parser import __file__
+    from deep_reference_parser.__version__ import __splitter_model_version__
+    from deep_reference_parser.common import MULTITASK_CFG, download_model_artefact
+    from deep_reference_parser.deep_reference_parser import DeepReferenceParser
+    from deep_reference_parser.logger import logger
+    from deep_reference_parser.model_utils import get_config
+    from deep_reference_parser.reference_utils import break_into_chunks
+    from deep_reference_parser.tokens_to_references import tokens_to_references
+
+msg = wasabi.Printer(icons={"check": "\u2023"})
+
+
+class SplitParser:
+    def __init__(self, config_file):
+
+        msg.info(f"Using config file: {config_file}")
+
+        cfg = get_config(config_file)
+
+        try:
+            OUTPUT_PATH = cfg["build"]["output_path"]
+            S3_SLUG = cfg["data"]["s3_slug"]
+        except KeyError:
+            config_dir, missing_config = os.path.split(config_file)
+            files = os.listdir(config_dir)
+            other_configs = [f for f in os.listdir(config_dir) if os.path.isfile(os.path.join(config_dir, f))]
+            msg.fail(f"Could not find config {missing_config}, perhaps you meant one of {other_configs}")
+
+        # Check whether the necessary artefacts exists and download them if
+        # not.
+
+        artefacts = [
+            "indices.pickle",
+            "weights.h5",
+        ]
+
+        for artefact in artefacts:
+            with msg.loading(f"Could not find {artefact} locally, downloading..."):
+                try:
+                    artefact = os.path.join(OUTPUT_PATH, artefact)
+                    download_model_artefact(artefact, S3_SLUG)
+                    msg.good(f"Found {artefact}")
+                except:
+                    msg.fail(f"Could not download {S3_SLUG}{artefact}")
+                    logger.exception("Could not download %s%s", S3_SLUG, artefact)
+
+        # Check on word embedding and download if not exists
+
+        WORD_EMBEDDINGS = cfg["build"]["word_embeddings"]
+
+        with msg.loading(f"Could not find {WORD_EMBEDDINGS} locally, downloading..."):
+            try:
+                download_model_artefact(WORD_EMBEDDINGS, S3_SLUG)
+                msg.good(f"Found {WORD_EMBEDDINGS}")
+            except:
+                msg.fail(f"Could not download {S3_SLUG}{WORD_EMBEDDINGS}")
+                logger.exception("Could not download %s", WORD_EMBEDDINGS)
+
+        OUTPUT = cfg["build"]["output"]
+        PRETRAINED_EMBEDDING = cfg["build"]["pretrained_embedding"]
+        DROPOUT = float(cfg["build"]["dropout"])
+        LSTM_HIDDEN = int(cfg["build"]["lstm_hidden"])
+        WORD_EMBEDDING_SIZE = int(cfg["build"]["word_embedding_size"])
+        CHAR_EMBEDDING_SIZE = int(cfg["build"]["char_embedding_size"])
+
+        self.MAX_WORDS = int(cfg["data"]["line_limit"])
+
+        # Evaluate config
+
+        self.drp = DeepReferenceParser(output_path=OUTPUT_PATH)
+
+        # Encode data and load required mapping dicts. Note that the max word and
+        # max char lengths will be loaded in this step.
+
+        self.drp.load_data(OUTPUT_PATH)
+
+        # Build the model architecture
+
+        self.drp.build_model(
+            output=OUTPUT,
+            word_embeddings=WORD_EMBEDDINGS,
+            pretrained_embedding=PRETRAINED_EMBEDDING,
+            dropout=DROPOUT,
+            lstm_hidden=LSTM_HIDDEN,
+            word_embedding_size=WORD_EMBEDDING_SIZE,
+            char_embedding_size=CHAR_EMBEDDING_SIZE,
+        )
+
+    def split_parse(self, text, return_tokens=False, verbose=False):
+
+        nlp = en_core_web_sm.load()
+        doc = nlp(text)
+        chunks = break_into_chunks(doc, max_words=self.MAX_WORDS)
+        tokens = [[token.text for token in chunk] for chunk in chunks]
+
+        preds = self.drp.predict(tokens, load_weights=True)
+
+        # If tokens argument passed, return the labelled tokens
+
+        if return_tokens:
+
+            flat_preds_list = list(map(itertools.chain.from_iterable,preds))
+            flat_X = list(itertools.chain.from_iterable(tokens))
+            rows = [i for i in zip(*[flat_X] + flat_preds_list)]
+
+            if verbose:
+
+                msg.divider("Token Results")
+
+                header = tuple(["token"] + ["label"] * len(flat_preds_list))
+                aligns = tuple(["r"] +  ["l"] * len(flat_preds_list))
+                formatted = wasabi.table(
+                    rows, header=header, divider=True, aligns=aligns
+                )
+                print(formatted)
+
+            out = rows
+
+        else:
+
+            # TODO: return references with attributes (author, title, year)
+            # in json format. For now just return predictions as they are to
+            # allow testing of endpoints.
+
+            return preds
+
+        #    # Otherwise convert the tokens into references and return
+
+        #    refs = tokens_to_references(tokens, preds)
+
+        #    if verbose:
+
+        #        msg.divider("Results")
+
+        #        if refs:
+
+        #            msg.good(f"Found {len(refs)} references.")
+        #            msg.info("Printing found references:")
+
+        #            for ref in refs:
+        #                msg.text(ref, icon="check", spaced=True)
+
+        #        else:
+
+        #            msg.fail("Failed to find any references.")
+
+        #    out = refs
+
+        #return out
+
+
+@plac.annotations(
+    text=("Plaintext from which to extract references", "positional", None, str),
+    config_file=("Path to config file", "option", "c", str),
+    tokens=("Output tokens instead of complete references", "flag", "t", str),
+    outfile=("Path to json file to which results will be written", "option", "o", str),
+)
+def split_parse(text, config_file=MULTITASK_CFG, tokens=False, outfile=None):
+    """
+    Runs the default splitting model and pretty prints results to console unless
+    --outfile is parsed with a path. Files output to the path specified in
+    --outfile will be a valid json. Can output either tokens (with -t|--tokens)
+    or split naively into references based on the b-r tag (default).
+
+    NOTE: that this function is provided for examples only and should not be used
+    in production as the model is instantiated each time the command is run. To
+    use in a production setting, a more sensible approach would be to replicate
+    the split or parse functions within your own logic.
+    """
+    mt = SplitParser(config_file)
+    if outfile:
+        out = mt.split_parse(text, return_tokens=tokens, verbose=True)
+
+        try:
+            with open(outfile, "w") as fb:
+                json.dump(out, fb)
+            msg.good(f"Wrote model output to {outfile}")
+        except:
+            msg.fail(f"Failed to write output to {outfile}")
+
+    else:
+        out = mt.split_parse(text, return_tokens=tokens, verbose=True)
diff --git a/deep_reference_parser/train.py b/deep_reference_parser/train.py
index 929321c..ebf118d 100644
--- a/deep_reference_parser/train.py
+++ b/deep_reference_parser/train.py
@@ -47,7 +47,7 @@ def train(config_file):
             msg.good(f"Found {WORD_EMBEDDINGS}")
         except:
             msg.fail(f"Could not download {WORD_EMBEDDINGS}")
-            logger.exception()
+            logger.exception("Could not download %s", WORD_EMBEDDINGS)
 
     OUTPUT = cfg["build"]["output"]
     WORD_EMBEDDINGS = cfg["build"]["word_embeddings"]
@@ -56,6 +56,7 @@ def train(config_file):
     LSTM_HIDDEN = int(cfg["build"]["lstm_hidden"])
     WORD_EMBEDDING_SIZE = int(cfg["build"]["word_embedding_size"])
     CHAR_EMBEDDING_SIZE = int(cfg["build"]["char_embedding_size"])
+    MAX_LEN = int(cfg["data"]["line_limit"])
 
     # Train config
 
@@ -64,19 +65,33 @@ def train(config_file):
     EARLY_STOPPING_PATIENCE = int(cfg["train"]["early_stopping_patience"])
     METRIC = cfg["train"]["metric"]
 
-    # Evaluate config
+    # Load policy data
 
-    OUT_FILE = cfg["evaluate"]["out_file"]
+    train_data = load_tsv(POLICY_TRAIN)
+    test_data = load_tsv(POLICY_TEST)
+    valid_data = load_tsv(POLICY_VALID)
 
-    # Load policy data
+    X_train, y_train = train_data[0], train_data[1:]
+    X_test, y_test = test_data[0], test_data[1:]
+    X_valid, y_valid = valid_data[0], valid_data[1:]
+
+    import statistics
+
+    logger.debug("Max token length %s", max([len(i) for i in X_train]))
+    logger.debug("Min token length %s", min([len(i) for i in X_train]))
+    logger.debug("Mean token length %s", statistics.median([len(i) for i in X_train]))
+
+    logger.debug("Max token length %s", max([len(i) for i in X_test]))
+    logger.debug("Min token length %s", min([len(i) for i in X_test]))
+    logger.debug("Mean token length %s", statistics.median([len(i) for i in X_test]))
 
-    X_train, y_train = load_tsv(POLICY_TRAIN)
-    X_test, y_test = load_tsv(POLICY_TEST)
-    X_valid, y_valid = load_tsv(POLICY_VALID)
+    logger.debug("Max token length %s", max([len(i) for i in X_valid]))
+    logger.debug("Min token length %s", min([len(i) for i in X_valid]))
+    logger.debug("Mean token length %s", statistics.median([len(i) for i in X_valid]))
 
-    logger.info("X_train, y_train examples: %s, %s", len(X_train), len(y_train))
-    logger.info("X_test, y_test  examples: %s, %s", len(X_test), len(y_test))
-    logger.info("X_valid, y_valid  examples: %s, %s", len(X_valid), len(y_valid))
+    logger.info("X_train, y_train examples: %s, %s", len(X_train), list(map(len, y_train)))
+    logger.info("X_test, y_test examples: %s, %s", len(X_test), list(map(len, y_test)))
+    logger.info("X_valid, y_valid examples: %s, %s", len(X_valid), list(map(len, y_valid)))
 
     drp = DeepReferenceParser(
         X_train=X_train,
@@ -85,6 +100,7 @@ def train(config_file):
         y_train=y_train,
         y_test=y_test,
         y_valid=y_valid,
+        max_len=MAX_LEN,
         output_path=OUTPUT_PATH,
     )
 
@@ -121,5 +137,4 @@ def train(config_file):
         test_set=True,
         validation_set=True,
         print_padding=False,
-        out_file=cfg["evaluate"]["out_file"],
     )
diff --git a/setup.py b/setup.py
index ffb031f..ddbdb9d 100644
--- a/setup.py
+++ b/setup.py
@@ -35,6 +35,7 @@
         "deep_reference_parser": [
             f"configs/{about['__splitter_model_version__']}.ini",
             f"configs/{about['__parser_model_version__']}.ini",
+            f"configs/{about['__splitparser_model_version__']}.ini",
         ]
     },
     classifiers=[
diff --git a/tests/test_deep_reference_parser.py b/tests/test_deep_reference_parser.py
index ba61936..8f5bf31 100644
--- a/tests/test_deep_reference_parser.py
+++ b/tests/test_deep_reference_parser.py
@@ -23,13 +23,8 @@ def cfg():
     cfg = get_config(TEST_CFG)
 
     artefacts = [
-        "char2ind.pickle",
-        "ind2label.pickle",
-        "ind2word.pickle",
-        "label2ind.pickle",
-        "maxes.pickle",
+        "indices.pickle",
         "weights.h5",
-        "word2ind.pickle",
     ]
 
     S3_SLUG = cfg["data"]["s3_slug"]
@@ -68,7 +63,7 @@ def test_DeepReferenceParser_train(tmpdir, cfg):
     X_test, y_test = load_tsv(TEST_TSV_TRAIN)
 
     X_test = X_test[0:100]
-    y_test = y_test[0:100]
+    y_test = [y_test[0:100]]
 
     drp = DeepReferenceParser(
         X_train=X_test,
@@ -77,7 +72,9 @@ def test_DeepReferenceParser_train(tmpdir, cfg):
         y_train=y_test,
         y_test=y_test,
         y_valid=y_test,
+        max_len=250,
         output_path=tmpdir,
+
     )
 
     # Prepare the data
@@ -149,7 +146,7 @@ def test_DeepReferenceParser_predict(tmpdir, cfg):
         "And so is this".split(" "),
     ]
 
-    preds = drp.predict(examples, load_weights=True)
+    preds = drp.predict(examples, load_weights=True)[0]
 
     assert len(preds) == len(examples)
 
diff --git a/tests/test_deep_reference_parser_entrypoints.py b/tests/test_deep_reference_parser_entrypoints.py
index c74c0f8..6936cbf 100644
--- a/tests/test_deep_reference_parser_entrypoints.py
+++ b/tests/test_deep_reference_parser_entrypoints.py
@@ -3,8 +3,9 @@
 
 import pytest
 
-from deep_reference_parser.split import Splitter
 from deep_reference_parser.parse import Parser
+from deep_reference_parser.split import Splitter
+from deep_reference_parser.split_parse import SplitParser
 
 from .common import TEST_CFG, TEST_REFERENCES
 
@@ -19,6 +20,11 @@ def parser():
     return Parser(TEST_CFG)
 
 
+@pytest.fixture
+def split_parser():
+    return SplitParser(TEST_CFG)
+
+
 @pytest.fixture
 def text():
     with open(TEST_REFERENCES, "r") as fb:
@@ -53,17 +59,18 @@ def test_parser_list_output(text, parser):
     assert isinstance(out, list)
 
 
-# Allow to xfail as this depends on the model
-@pytest.mark.xfail
-def test_splitter_output_length(text, splitter):
+@pytest.mark.slow
+def test_split_parser_list_output(text, split_parser):
     """
-    For now use a minimal set of weights which may fail to predict anything
-    useful. Hence this test is xfailed.
+    Test that the parser entrypoint works as expected.
+
+    If the model artefacts and embeddings are not present this test will
+    downloaded them, which can be slow.
     """
-    out = splitter.split(text, return_tokens=False, verbose=False)
+    out = split_parser.split_parse(text, verbose=False)
+    print(out)
 
-    assert isinstance(out[0], str)
-    assert len(out) == 3
+    assert isinstance(out, list)
 
 
 def test_splitter_tokens_output(text, splitter):
@@ -88,3 +95,18 @@ def test_parser_tokens_output(text, parser):
     assert len(out[0]) == 2
     assert isinstance(out[0][0], str)
     assert isinstance(out[0][1], str)
+
+
+def test_split_parser_tokens_output(text, split_parser):
+    """
+    """
+    out = split_parser.split_parse(text, verbose=False)
+
+    assert isinstance(out, list)
+
+    # NOTE: full functionality of split_parse is not yet implemented.
+
+    # assert isinstance(out[0], tuple)
+    # assert len(out[0]) == 2
+    # assert isinstance(out[0][0], str)
+    # assert isinstance(out[0][1], str)