Skip to content

HashingVectorizer fails when data has None values #12347

@mrocklin

Description

@mrocklin

Description

HashingVectorizer fails when data has None values, such as comes up with missing values. It could be that this is out of scope for HashingVectorizer and that users should handle this separately, but I didn't find a related issue on this topic (surprisingly) so thought I'd bring it up just in case.

Steps/Code to Reproduce

from sklearn.feature_extraction.text import HashingVectorizer
HashingVectorizer().transform(['a', 'b', None, 'c'])

Expected Results

A scipy.sparse matrix with None treated as another value, possibly 0 or maxval for interpretability.

Actual Results

It fails when preprocessing (traceback below) but presumably we would want to short-circuit this before it got to that point.

AttributeError                            Traceback (most recent call last)
<ipython-input-2-52ba2560fd22> in <module>()
----> 1 HashingVectorizer().transform(['a', 'b', None, 'c'])

~/workspace/scikit-learn/sklearn/feature_extraction/text.py in transform(self, X)
    601 
    602         analyzer = self.build_analyzer()
--> 603         X = self._get_hasher().transform(analyzer(doc) for doc in X)
    604         if self.binary:
    605             X.data.fill(1)

~/workspace/scikit-learn/sklearn/feature_extraction/hashing.py in transform(self, raw_X)
    166         indices, indptr, values = \
    167             _hashing_transform(raw_X, self.n_features, self.dtype,
--> 168                                self.alternate_sign)
    169         n_samples = indptr.shape[0] - 1
    170 

~/workspace/scikit-learn/sklearn/feature_extraction/_hashing.pyx in sklearn.feature_extraction._hashing.transform()

~/workspace/scikit-learn/sklearn/feature_extraction/hashing.py in <genexpr>(.0)
    163             raw_X = (_iteritems(d) for d in raw_X)
    164         elif self.input_type == "string":
--> 165             raw_X = (((f, 1) for f in x) for x in raw_X)
    166         indices, indptr, values = \
    167             _hashing_transform(raw_X, self.n_features, self.dtype,

~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <genexpr>(.0)
    601 
    602         analyzer = self.build_analyzer()
--> 603         X = self._get_hasher().transform(analyzer(doc) for doc in X)
    604         if self.binary:
    605             X.data.fill(1)

~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <lambda>(doc)
    306                                                tokenize)
    307             return lambda doc: self._word_ngrams(
--> 308                 tokenize(preprocess(self.decode(doc))), stop_words)
    309 
    310         else:

~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <lambda>(x)
    254 
    255         if self.lowercase:
--> 256             return lambda x: strip_accents(x.lower())
    257         else:
    258             return strip_accents

AttributeError: 'NoneType' object has no attribute 'lower'

Versions

System:
    python: 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19)  [GCC 7.2.0]
executable: /home/mrocklin/Software/anaconda/bin/python
   machine: Linux-4.13.0-26-generic-x86_64-with-debian-stretch-sid

BLAS:
    macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
  lib_dirs: /home/mrocklin/Software/anaconda/lib
cblas_libs: mkl_rt, pthread

Python deps:
       pip: 9.0.1
setuptools: 40.4.3
   sklearn: 0.21.dev0
     numpy: 1.15.1
     scipy: 1.1.0
    Cython: 0.27.3
    pandas: 0.23.4

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions