Learning Question Classifiers

A Modern Revisit to the TREC Dataset Using Deep Learning Approaches

Dataset: Link Link
Paper Reference: Li & Roth, 2002

1. Introduction

Question classification is a foundational task in natural language understanding, with applications in question answering systems, chatbots, and information retrieval. This project revisits the classic TREC dataset using modern deep learning models, starting with a convolutional baseline (TextCNN) and expanding to more complex architectures in subsequent phases. Our approach emphasizes rigorous experimentation, hyperparameter tuning, and semantic error analysis.

2. Experimental Setup

2.1 Data Preprocessing

We use the TREC 6-way classification dataset consisting of:

5,500 training questions
500 testing questions

Preprocessing Steps

Duplicate removal from both training and test sets
Tokenization and numerical encoding via Keras
Padding sequences to fixed length
80/20 training-validation split for model tuning

2.2 Training Environment

Framework: PyTorch
Optimization Tools: Optuna (for hyperparameter tuning), MLflow (for experiment tracking)
Hardware: NVIDIA GeForce RTX 3060 (CUDA-enabled)
Objective: Minimize validation loss

3. Model1: TextCNN (Baseline)

The first model implemented is a multi-kernel convolutional neural network (TextCNN) inspired by Kim (2014). It captures n-gram level semantics through parallel convolutional filters of various sizes.

3.1 Best Trial via Optuna

After 25 tuning trials, the best configuration was:

{
    'embedding_dim': 512, 
    'num_filters': 128, 
    'kernels': '1,3,5'
}

This architecture yielded the lowest validation loss: 0.0646

3.2 Model Architecture

Loss Function: CrossEntropyLoss
Optimizer: Adam (lr=0.001)

TextCNN(
  (embedding): Embedding(8482, 512)
  (conv1): Conv1d(512, 128, kernel_size=(1,), padding=same)
  (conv2): Conv1d(512, 128, kernel_size=(3,), padding=same)
  (conv3): Conv1d(512, 128, kernel_size=(5,), padding=same)
  (fc): Linear(384 → 6)
)

3.3 Performance on Test Set

Class	Precision	Recall	F1-Score	Support
ABBR	1.00	1.00	1.00	9
DESC	1.00	1.00	1.00	138
ENTY	0.90	0.95	0.92	94
HUM	1.00	1.00	1.00	65
LOC	0.92	0.95	0.93	81
NUM	1.00	0.93	0.96	113

accuracy			0.97	500
macro avg	0.97	0.97	0.97	500
weighted avg	0.97	0.97	0.97	500

3.4 Insights from Hyperparameter Tuning

Kernel Sizes

Kernels [1, 3, 5] outperform [3, 5, 7]
The 1-gram kernel captures strong lexical cues (e.g. "who" -> HUM, "where" -> LOC)

Embedding Dimensions

Higher dimensions (up to 512) lead to better performance
Suggests richer semantic representation helps in capturing question intent

Filter Counts

64 vs 128 filters perform similarly
Adding more filters beyond 128 may yield diminishing returns

3.5 Semantic Error Analysis

Based on a semantic review of the misclassified samples, the errors generally fall into four key categories:

Overreliance on Surface-Level Keywords

CNNs capture local patterns (n-grams), but when questions use generic structures like:

"What is the..." or "How many..."
...without strong class-indicative keywords (e.g., "who", "where", "how much"), the model may guess the most likely class seen during training—commonly ENTY.

See Sample 4 -> NUM, but predicted ENTY due to the abstract phrasing and missing numeric hints.

Named Entity Confusion

The presence of named places or countries (e.g., "New York", "Minnesota", "Madrid", "Canada") misleads the model into predicting LOC, even when the actual question is about:

sales tax (NUM)
energy output (ENTY)
population size (NUM)

The model associates geographic entities with LOC, regardless of the actual intent of the question.

Information Loss via <UNK> Tokens

Several samples contain <UNK> tokens—representing words not in the vocabulary. These often occur in critical semantic positions, like:

key nouns: "melting point of <UNK>"
disambiguating words: "line between <UNK> and <UNK>"

CNN fails to form a meaningful n-gram when part of it is unknown, weakening its semantic grasp.

Ambiguous or Overlapping Categories

Some real-world questions inherently straddle multiple labels, e.g.:

"What is the temperature of the sun?" Could be NUM (value) or ENTY (property).
"What are Canada’s two territories?" Could be LOC or ENTY, depending on interpretation.

The hard class boundaries in TREC labels don't always reflect natural question semantics, causing difficulty in edge cases.

No	Text	True Label	Predicted
1	other what is the longest major league baseball winning streak	ENTY	LOC
2	other what imaginary line is between the north and south	LOC	ENTY
3	other what is the life expectancy of a dollar bill	NUM	ENTY
4	other what is the life expectancy for	NUM	ENTY
5	temp the sun 's what is the temperature	NUM	ENTY
6	other what is the major line near kentucky	ENTY	LOC
7	other what is the world 's population	NUM	LOC
8	other what is the electrical output in madrid spain	ENTY	LOC
9	other what is the point of gold	NUM	ENTY
10	other what is the sales tax in minnesota	ENTY	LOC
11	mount which mountain range in north america from maine to georgia	LOC	ENTY
12	money mexican are worth what in u s dollars	NUM	ENTY
13	other what are canada 's two territories	LOC	ENTY
14	other what is the sales tax rate in new york	NUM	LOC
15	other what is the point of copper	NUM	ENTY
16	other what is the source of natural gas	ENTY	LOC
17	other in the late 's british were used to which colony	LOC	ENTY

Observation: Model misclassifies when literal keywords are missing or semantics rely on external knowledge.

3.6 Conslusion

The TextCNN baseline provides a strong foundation for question classification, achieving high accuracy and insightfully capturing local textual patterns. However, it lacks deeper contextual understanding—especially when dealing with ambiguous phrases or tokenized entities. Future models will aim to address these limitations using sequential and attention-based architectures.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
images		images
models		models
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
textcnn.ipynb		textcnn.ipynb
textcnn.pt		textcnn.pt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Learning Question Classifiers

1. Introduction

2. Experimental Setup

2.1 Data Preprocessing

Preprocessing Steps

2.2 Training Environment

3. Model1: TextCNN (Baseline)

3.1 Best Trial via Optuna

3.2 Model Architecture

3.3 Performance on Test Set

3.4 Insights from Hyperparameter Tuning

Kernel Sizes

Embedding Dimensions

Filter Counts

3.5 Semantic Error Analysis

3.6 Conslusion

4. Model2: TextCNN + Attention (Soon)

5. Model3: Soon

About

Uh oh!

Releases

Packages

Languages

License

Andreas-UI/learning-question-classifiers

Folders and files

Latest commit

History

Repository files navigation

Learning Question Classifiers

1. Introduction

2. Experimental Setup

2.1 Data Preprocessing

Preprocessing Steps

2.2 Training Environment

3. Model1: TextCNN (Baseline)

3.1 Best Trial via Optuna

3.2 Model Architecture

3.3 Performance on Test Set

3.4 Insights from Hyperparameter Tuning

Kernel Sizes

Embedding Dimensions

Filter Counts

3.5 Semantic Error Analysis

3.6 Conslusion

4. Model2: TextCNN + Attention (Soon)

5. Model3: Soon

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages