A Modern Revisit to the TREC Dataset Using Deep Learning Approaches
Dataset: Link Link
Paper Reference: Li & Roth, 2002
Question classification is a foundational task in natural language understanding, with applications in question answering systems, chatbots, and information retrieval. This project revisits the classic TREC dataset using modern deep learning models, starting with a convolutional baseline (TextCNN) and expanding to more complex architectures in subsequent phases. Our approach emphasizes rigorous experimentation, hyperparameter tuning, and semantic error analysis.
We use the TREC 6-way
classification dataset consisting of:
- 5,500 training questions
- 500 testing questions
- Duplicate removal from both training and test sets
- Tokenization and numerical encoding via Keras
- Padding sequences to fixed length
- 80/20 training-validation split for model tuning
- Framework: PyTorch
- Optimization Tools: Optuna (for hyperparameter tuning), MLflow (for experiment tracking)
- Hardware: NVIDIA GeForce RTX 3060 (CUDA-enabled)
- Objective: Minimize validation loss
The first model implemented is a multi-kernel convolutional neural network (TextCNN) inspired by Kim (2014). It captures n-gram level semantics through parallel convolutional filters of various sizes.
After 25 tuning trials, the best configuration was:
{
'embedding_dim': 512,
'num_filters': 128,
'kernels': '1,3,5'
}
This architecture yielded the lowest validation loss: 0.0646
Loss Function: CrossEntropyLoss
Optimizer: Adam (lr=0.001)
TextCNN(
(embedding): Embedding(8482, 512)
(conv1): Conv1d(512, 128, kernel_size=(1,), padding=same)
(conv2): Conv1d(512, 128, kernel_size=(3,), padding=same)
(conv3): Conv1d(512, 128, kernel_size=(5,), padding=same)
(fc): Linear(384 → 6)
)
Class | Precision | Recall | F1-Score | Support |
---|---|---|---|---|
ABBR | 1.00 | 1.00 | 1.00 | 9 |
DESC | 1.00 | 1.00 | 1.00 | 138 |
ENTY | 0.90 | 0.95 | 0.92 | 94 |
HUM | 1.00 | 1.00 | 1.00 | 65 |
LOC | 0.92 | 0.95 | 0.93 | 81 |
NUM | 1.00 | 0.93 | 0.96 | 113 |
accuracy | 0.97 | 500 | ||
macro avg | 0.97 | 0.97 | 0.97 | 500 |
weighted avg | 0.97 | 0.97 | 0.97 | 500 |
- Kernels
[1, 3, 5]
outperform[3, 5, 7]
- The 1-gram kernel captures strong lexical cues (e.g. "who" -> HUM, "where" -> LOC)
- Higher dimensions (up to 512) lead to better performance
- Suggests richer semantic representation helps in capturing question intent
- 64 vs 128 filters perform similarly
- Adding more filters beyond 128 may yield diminishing returns
Based on a semantic review of the misclassified samples, the errors generally fall into four key categories:
- Overreliance on Surface-Level Keywords
CNNs capture local patterns (n-grams), but when questions use generic structures like:
- "What is the..." or "How many..."
- ...without strong class-indicative keywords (e.g., "who", "where", "how much"), the model may guess the most likely class seen during training—commonly ENTY.
See Sample 4 -> NUM, but predicted ENTY due to the abstract phrasing and missing numeric hints.
- Named Entity Confusion
The presence of named places or countries (e.g., "New York", "Minnesota", "Madrid", "Canada") misleads the model into predicting LOC, even when the actual question is about:
- sales tax (NUM)
- energy output (ENTY)
- population size (NUM)
The model associates geographic entities with LOC, regardless of the actual intent of the question.
- Information Loss via
<UNK>
Tokens
Several samples contain <UNK>
tokens—representing words not in the vocabulary. These often occur in critical semantic positions, like:
- key nouns: "melting point of
<UNK>
" - disambiguating words: "line between
<UNK>
and<UNK>
"
CNN fails to form a meaningful n-gram when part of it is unknown, weakening its semantic grasp.
- Ambiguous or Overlapping Categories
Some real-world questions inherently straddle multiple labels, e.g.:
- "What is the temperature of the sun?" Could be NUM (value) or ENTY (property).
- "What are Canada’s two territories?" Could be LOC or ENTY, depending on interpretation.
The hard class boundaries in TREC labels don't always reflect natural question semantics, causing difficulty in edge cases.
No | Text | True Label | Predicted |
---|---|---|---|
1 | other what is the longest major league baseball winning streak | ENTY | LOC |
2 | other what imaginary line is between the north and south | LOC | ENTY |
3 | other what is the life expectancy of a dollar bill | NUM | ENTY |
4 | other what is the life expectancy for | NUM | ENTY |
5 | temp the sun 's what is the temperature | NUM | ENTY |
6 | other what is the major line near kentucky | ENTY | LOC |
7 | other what is the world 's population | NUM | LOC |
8 | other what is the electrical output in madrid spain | ENTY | LOC |
9 | other what is the point of gold | NUM | ENTY |
10 | other what is the sales tax in minnesota | ENTY | LOC |
11 | mount which mountain range in north america from maine to georgia | LOC | ENTY |
12 | money mexican are worth what in u s dollars | NUM | ENTY |
13 | other what are canada 's two territories | LOC | ENTY |
14 | other what is the sales tax rate in new york | NUM | LOC |
15 | other what is the point of copper | NUM | ENTY |
16 | other what is the source of natural gas | ENTY | LOC |
17 | other in the late 's british were used to which colony | LOC | ENTY |
Observation: Model misclassifies when literal keywords are missing or semantics rely on external knowledge.
The TextCNN baseline provides a strong foundation for question classification, achieving high accuracy and insightfully capturing local textual patterns. However, it lacks deeper contextual understanding—especially when dealing with ambiguous phrases or tokenized entities. Future models will aim to address these limitations using sequential and attention-based architectures.