Automated BIDS dataset citation tracking system with AI-powered confidence scoring for 300+ neuroscience datasets.
Track and analyze citations for OpenNeuro datasets with a complete pipeline from discovery to interactive dashboards. Features Google Scholar integration, semantic similarity scoring, network analysis, and automated monthly updates via GitHub Actions.
Key Features: Dataset discovery • Citation tracking • AI confidence scoring • Network analysis • Interactive dashboards • JSON/CSV export • GitHub Actions automation
git clone https://github.com/sccn/nemar-citations.git
cd nemar-citations
pip install -e ".[dev,test]"
Requirements: Python 3.11+ • ScraperAPI key • GitHub token (optional)
# 1. Setup environment (choose .env or .secrets)
# Option A: Using .env file
echo "SCRAPERAPI_KEY=your_key_here" > .env
echo "GITHUB_TOKEN=your_token_here" >> .env
# Option B: Using .secrets file (auto-loaded by workflow script)
echo "SCRAPERAPI_KEY=your_key_here" > .secrets
echo "GITHUB_TOKEN=your_token_here" >> .secrets
# 2. Run complete pipeline
chmod +x run_end_to_end_workflow.sh
./run_end_to_end_workflow.sh test # Test mode (no API calls)
./run_end_to_end_workflow.sh full # Full pipeline (recommended)
./run_end_to_end_workflow.sh local-ci-test # Test CI/CD test workflow locally
./run_end_to_end_workflow.sh local-ci-update # Test CI/CD update workflow locally
The repository includes several shell scripts for different workflows:
Script | Purpose | Runtime | When to Use |
---|---|---|---|
run_end_to_end_workflow.sh |
Complete pipeline from discovery to dashboard | 1-3 hours | Production updates, full analysis |
run_full_analysis.sh |
Analysis and dashboard generation only | 10-30 min | When citations already exist |
migrate_to_json.sh |
Convert pickle files to JSON format | 1-2 min | One-time migration |
The run_end_to_end_workflow.sh
script automates the entire workflow:
Mode | Description | Runtime | API Calls | Steps Executed | Branch/PR |
---|---|---|---|---|---|
test |
Controlled test data (3-8 citations) | ~1 min | None | 4-5 only (Analyze, Generate) | No |
full |
Recommended: Direct pipeline execution | 1-3 hours | Google Scholar, GitHub | 1-5 (All steps) | Yes (auto) |
local-ci-test |
Test GitHub Actions test workflow via Docker | ~5-10 min | None | Runs test suite | No |
local-ci-update |
Test GitHub Actions update workflow via Docker | ~10-30 min | Real API calls | 1-5 (All steps) | Yes (auto) |
Workflow Steps:
- Discover → Find BIDS datasets (EEG/MEG/iEEG)
- Collect → Fetch citations from Google Scholar
- Enhance → Add metadata & AI confidence scores
- Analyze → Network, temporal, theme analysis
- Generate → Interactive HTML dashboard
Mode Selection Guide:
- Use
test
for quick validation during development - Use
full
for actual citation updates (runs natively, faster) - Use
local-ci-test
to test/debug GitHub Actions test workflow issues - Use
local-ci-update
to test/debug GitHub Actions update workflow issues
Branch Protection: Both full
and local-ci-update
modes automatically create a feature branch and pull request to protect the main branch from direct commits.
# 1. Create update script
cat > ~/update_citations.sh << 'EOF'
#!/bin/bash
cd /path/to/dataset_citations
source ~/miniconda3/etc/profile.d/conda.sh
conda activate dataset-citations
./run_end_to_end_workflow.sh full
EOF
chmod +x ~/update_citations.sh
# 2. Add to crontab (choose one)
crontab -e
0 2 1 * * ~/update_citations.sh >> ~/citations.log 2>&1 # Monthly
0 3 * * 0 ~/update_citations.sh >> ~/citations.log 2>&1 # Weekly
0 4 * * * ~/update_citations.sh >> ~/citations.log 2>&1 # Daily
# 3. Monitor
tail -f ~/citations.log
from dataset_citations.core import citation_utils
from dataset_citations.quality.confidence_scoring import CitationConfidenceScorer
from dataset_citations.quality.dataset_metadata import DatasetMetadataRetriever
# Convert pickle to JSON
json_path = citation_utils.migrate_pickle_to_json(
'citations/pickle/ds002718.pkl',
'citations/json',
'ds002718'
)
# Load citation data
citations = citation_utils.load_citation_json(json_path)
print(f"Dataset {citations['dataset_id']} has {citations['num_citations']} citations")
# Calculate confidence scores
scorer = CitationConfidenceScorer()
confidence_scores = scorer.score_citations_for_dataset('ds002718', citations, dataset_metadata)
# Retrieve dataset metadata
retriever = DatasetMetadataRetriever()
metadata = retriever.get_dataset_metadata('ds002718')
# Discovery & Updates
dataset-citations-discover # Find datasets
dataset-citations-update # Fetch citations
dataset-citations-migrate # Pickle→JSON
# Quality & Analysis
dataset-citations-retrieve-metadata # Get GitHub data
dataset-citations-score-confidence # AI scoring
dataset-citations-analyze-temporal # Trends
dataset-citations-analyze-networks # Networks
# Dashboards
dataset-citations-create-interactive-reports # Generate HTML
# All commands support --help for detailed usage
{
"dataset_id": "ds002718",
"num_citations": 13,
"citation_details": [{
"title": "Paper title",
"author": "Authors",
"year": 2021,
"confidence_score": 0.82 // AI similarity score
}]
}
AI-powered relevance scoring (0.0-1.0) using sentence transformers to compare dataset metadata with citation abstracts. Helps filter high-confidence citations and identify misattributions.
# Setup
git clone https://github.com/sccn/nemar-citations.git
cd nemar-citations
conda create -n dataset-citations python=3.11
conda activate dataset-citations
pip install -e ".[dev,test]"
# Testing
pytest tests/ -v # Fast tests
pytest --cov=dataset_citations # With coverage
# Code quality
black src/ tests/ # Format
ruff check --fix src/ tests/ # Lint
Core Components:
- Discovery: Find BIDS datasets via GitHub API
- Collection: Google Scholar citation fetching with proxy rotation
- Processing: Parallel processing, format conversion, validation
- Analysis: Network graphs, temporal trends, theme clustering
- Dashboard: Interactive HTML with D3.js visualizations
Data Flow: Discovery → Fetching → Processing → Analysis → Dashboard
Issue | Solution |
---|---|
ScraperAPI key not found | Add SCRAPERAPI_KEY to .env |
Google Scholar rate limit | Wait for proxy rotation |
GitHub API rate limit | Add GITHUB_TOKEN to .env |
MPS memory error (macOS) | Use --device cpu |
Import errors | Reinstall: pip install -e ".[dev,test]" |
Debug: Add --verbose
flag to any command
Support: GitHub Issues
- Fork & create feature branch
- Make changes with tests
- Run
pytest
andblack
- Submit PR with issue reference
Guidelines: Type hints • Docstrings • Tests • No mocks
CC BY-NC-SA 4.0 - Attribution, NonCommercial, ShareAlike
If you use this software in your research, please cite:
@software{shirazi2025nemarcitations,
title={NEMAR Citations: Automated BIDS Dataset Citation Tracking System},
author={Shirazi, Seyed Yahya},
year={2025},
url={https://github.com/sccn/nemar-citations},
organization={Swartz Center for Computational Neuroscience (SCCN)}
}
- Author: Seyed Yahya Shirazi
- Organization: Swartz Center for Computational Neuroscience (SCCN)
- Project: NEMAR - NeuroElectroMagnetic Archive
- GitHub: @neuromechanist
Built with ❤️ for NEMAR and the neuroscience open science community.
Last updated: September 19, 2025