Objective: Building a Deep Learning Model for Sequential Sentence Classification, to convert “Harder to Read” text into “Easier to Read” text.
The PubMed 20k RCT dataset is based on PubMed for sequential sentence classification. The dataset consists of approximately 20,000 abstracts of randomized controlled trials. Each sentence of each abstract is labeled with its role in the abstract using one of the following classes: background, objective, method, result, or conclusion. This dataset aims to enhance tools for efficiently skimming through literature, particularly in the medical field.
- |V| represents the vocabulary size.
- The training, validation, and test sets are detailed with the number of abstracts and sentences.
Distribution Insights:
- Fig 1: Most abstracts have between 7 to 15 sentences.
- Fig 2: Majority of tokens per sentence range between 0 to 55.
- Fig 3: Most character sequences are between 0 to 200 characters long.
- Naïve Bayes with TF-IDF Encoder (Baseline Model)
- Conv1D with Token Embedding
- Pretrained Feature Extractor (Using Universal Sentence Encoder)
- Conv1D with Character Encoding
- Naïve Bayes with TF-IDF Encoder: Uses TF-IDF to convert sentences to numbers and MultinomialNB for classification.
- Conv1D with Token Embedding: Converts text inputs into numerical sequences using token embedding and applies a 1D Convolution layer.
- Pretrained Feature Extractor (USE): Uses Universal Sentence Encoder for tokenization and convolution layer.
- Conv1D with Character Encoding: Splits sequences into characters, creates feature vectors for each character, and applies character embedding layer.
- Conv1D with Token Embedding consistently outperforms other models with higher accuracy (82.45), precision (82.22), recall (82.45), and F1 score (82.15).
- Conv1D with Token Embedding is the top performer in the test set as well.
- Conv1D with Token Embedding shows the highest F1 score, indicating strong performance in both validation and test sets.
Observations:
- The model excels in classifying Conclusions, Methods, and Results.
- Challenges in distinguishing Background and Objective may be due to fewer samples and potential similarities in embeddings.
- Top 50 Inaccurate Predictions: Highlights instances where the model struggles, potentially due to dataset biases or sample imbalances.
- Conv1D with Token Embedding is the best-performing model on both validation and test datasets.
- Challenges include distinguishing Background and Objective categories, potentially due to sample size and embedding similarities.
- The confusion matrix supports these findings, with notable confusion between Background and Objective classes, and consistent behavior in Methods and Results classes.
Feel free to explore the provided figures and tables for more detailed insights into the model performance and dataset characteristics.