- 1995-lecun-convolutional
- convolutional networks, sigmoid, average pooling
- precursor of RCNN for multi-object recognition
- digits and handwriting
- 2013-krizhevsky-imagenet
- ReLU, GPU training, local response normalization, pooling layers, dropout
- Imagenet dataset
- 2014-srivastava-dropout
- dropouts as ensembles of networks
- intended to prevent overtraining, improve generalization
- standard test cases (CIFAR, MNIST, etc.)
- 2014-simonyan-maxpool-very-deep
- 19 weight layers, multicrop evaluation, "VGG team" ILSVRC-2014 challenge
- 2015-ioffe-batch-normalization
- introduces batch normalization for faster training
- 2015-szegedy-rethinking-inception
- label smoothing, separable convolutions
- 2015-szegedy-going-deeper
- "inception modules", modular construction
- 2016-szegedy-inception
- "inception modules", modular construction
- 2015-he-resnet
- Introduces Resnet architecture
- 2015-jaderberg-spatial-transformer
- adds spatial transformations/distortions to learnable primitives
- 2017-dai-deformable
- adds deformable convolutions to learnable primitives
OCR:
- 2013-breuel-high-performance-ocr-lstm
- LSTM for printed OCR
- 2013-goodfellow-multidigit
- Google SVHN digits, 200k numbers with bounding boxes
- 8 layer convnet, ad-hoc sequence modeling
- 2017-breuel-lstm-ocr
- comparison of different convnet+LSTM architectures for OCR
- 2015-dong-superresolution
- explicit upscaling of images
- 2015-ronneberger-unet
- general U-net architecture for image-to-image mappings
- 2015-byeon-mdlstm-segmentation
- MDLSTM for image segmentation
- 2015-stollenga-pyramid-lstm
- pyramid LSTM architecture
- 2015-long-convnet-semantic-segmentation
- semantic segmentation with convolutional networks
- 2015-girshick-rich-feature-hierarchies
- semantic segmentation with convolutional networks (multitask)
- 2015-noh-deconvolutional-networks
- atrous convolutions
- 2017-blogpost-semantic-segmentation
- survey of semantic segmentation architectures
- 2016-chen-deeplab
- 2017-chen-deeplab-atrous
- 2017-chen-rethinking-atrous
- atrous convolutions to learnable primitives, deeplab v3
OCR:
- 2015-afzal-binarization-mdlstm
- MDLSTM for binarization (image-to-image transformation)
- 2017-breuel-mdlstm-layout
- layout analysis with MDLSTM
- 2017-chen-convnet-page-segmentation
- layout analysis with convolutional nteworks
- 2017-he-semantic-page-segmentation
- layout analysis with convolutional nteworks
- 2018-mohan-layout-error-correction-using-dnn
- layout analysis with convolutional nteworks
- 2014-lecun-overfeat
- convolutional network, generic feature extraction
- sliding window at multiple scales across image
- regression network
- 2015-liu-multibox
- input image and ground truth boxes
- 2015-ren-faster-rcnn-v3
- region proposal network (object/not object, box coords at each loc)
- translation invariant anchors
OCR:
- 2014-jaderberg-convnet-ocr-wild
- convnet, R-CNN, bounding box regression
- synthetic, ICDAR scene text, IIT Scene Text, IIT 5k words, IIT Sports-10k, BBC News
- no bounding boxes in general; initial detector trained on positive word samples, negative images
- 10k proposals per image
- 2014-jiang-saliency
- explicit computation of salience
- 2015-zhou-class-attention-mapping
- gradient-based mapping of class-related features
- 2016-selvaraju-gradient-mapping
- gradient-based mapping of class-related features
- 2013-zeiler-visualizing-cnns
- learns inverses to layers via unpooling, transposed convolutions
- 2016-yu-visualizing-vgg
- applied to VGG16
- 2018-li-pyramid-attention
- combines multiresolution and attention
- 1999-gers-lstm
- introduces the LSTM architecture
- 2005-graves-bdlstm
- introduces bidirectional LSTM
- 2006-graves-ctc
- introduces CTC alignment (a kind of forward-backward algorithm)
OCR:
- 2012-elaguni-ocr-in-video
- manually labeled training data on small dataset
- multiscale, convnet features, BLSTM, CTC
- 2014-bluche-comparison-sequence-trained
- HMM, GMM-HMM, MLP-HMM, LSTM
- Rimes, IAM; decoding with Kaldi (ASR toolkit)
- 2016-he-reading-scene-text
- large CNN, Maxout units, LSTM, CTC
- Street View Text, IIT 5k-word, PhotoOCR, etc., using bounding boxes for training
- 2017-wang-gru-ocr
- 2009-graves-multidimensional
- applies LSTM to multidimensional problems
- 2014-byeon-supervised-texture
- supervised image segmentation using multidimensional LSTM
- 2016-visin-reseg
- separable multidimensional LSTMs for image segmentation
- 2015-sonderby-convolutional
- convolutional LSTM architecture and attention
- 2016-shi-convolutional-lstm
- convolutional LSTM architecture
OCR:
- 2015-visin-renet
- separable multidimensional LSTMs for OCR
- 2012-graves-sequence-transduction
- introduces sequence transduction as an alternative to CTC
- 2015-bahdanau-attention
- content-based attention mechanisms for sequence to sequence tasks
- 2015-zhang-character-level-convnets-text
- simple use of convolutional networks as alternatives to n-grams, sequence models
- 2016-chorowski-better-decoding
- label smoothing and beam search
- 2017-vaswani-attention-is-all-you-need
- high performance sequence-to-sequence with attention
- masked, multi-head attention
- 2017-prabhavalkar-s2s-comparison
- a comparison of different sequence-to-sequence approaches
- 2017-gehring-convolutional-s2s
- purely convolutional sequence-to-sequence with attention
OCR:
- 2015-sahu-s2s-ocr
- standard seq2seq encoder/decoder approach
- TSNE visualizations of encoded word images
- word images from scanned books
- 2017-nam-dual-attention
- joint visual and text attention networks
OCR:
- 2016-bluche-end-to-end-hw-mdlstm-attention
- full paragraph handwriting recognition without explicit segmentation
- MDLSTM plus attention, tracking, etc.
- IAM database, pretraining LSTM+CTC, curriculum learning
- 2016-lee-recursive-recurrent-attention-wild
- recursive convolutional layers, tied weights, followed by attention, character level modeling
- ICDAR 2003, 2013, SVT, IIT5k, Synth90k using bounding boxes for training