-
-
Notifications
You must be signed in to change notification settings - Fork 26.2k
Closed
Description
Description
HashingVectorizer fails when data has None
values, such as comes up with missing values. It could be that this is out of scope for HashingVectorizer and that users should handle this separately, but I didn't find a related issue on this topic (surprisingly) so thought I'd bring it up just in case.
Steps/Code to Reproduce
from sklearn.feature_extraction.text import HashingVectorizer
HashingVectorizer().transform(['a', 'b', None, 'c'])
Expected Results
A scipy.sparse matrix with None treated as another value, possibly 0 or maxval for interpretability.
Actual Results
It fails when preprocessing (traceback below) but presumably we would want to short-circuit this before it got to that point.
AttributeError Traceback (most recent call last)
<ipython-input-2-52ba2560fd22> in <module>()
----> 1 HashingVectorizer().transform(['a', 'b', None, 'c'])
~/workspace/scikit-learn/sklearn/feature_extraction/text.py in transform(self, X)
601
602 analyzer = self.build_analyzer()
--> 603 X = self._get_hasher().transform(analyzer(doc) for doc in X)
604 if self.binary:
605 X.data.fill(1)
~/workspace/scikit-learn/sklearn/feature_extraction/hashing.py in transform(self, raw_X)
166 indices, indptr, values = \
167 _hashing_transform(raw_X, self.n_features, self.dtype,
--> 168 self.alternate_sign)
169 n_samples = indptr.shape[0] - 1
170
~/workspace/scikit-learn/sklearn/feature_extraction/_hashing.pyx in sklearn.feature_extraction._hashing.transform()
~/workspace/scikit-learn/sklearn/feature_extraction/hashing.py in <genexpr>(.0)
163 raw_X = (_iteritems(d) for d in raw_X)
164 elif self.input_type == "string":
--> 165 raw_X = (((f, 1) for f in x) for x in raw_X)
166 indices, indptr, values = \
167 _hashing_transform(raw_X, self.n_features, self.dtype,
~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <genexpr>(.0)
601
602 analyzer = self.build_analyzer()
--> 603 X = self._get_hasher().transform(analyzer(doc) for doc in X)
604 if self.binary:
605 X.data.fill(1)
~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <lambda>(doc)
306 tokenize)
307 return lambda doc: self._word_ngrams(
--> 308 tokenize(preprocess(self.decode(doc))), stop_words)
309
310 else:
~/workspace/scikit-learn/sklearn/feature_extraction/text.py in <lambda>(x)
254
255 if self.lowercase:
--> 256 return lambda x: strip_accents(x.lower())
257 else:
258 return strip_accents
AttributeError: 'NoneType' object has no attribute 'lower'
Versions
System:
python: 3.6.4 |Anaconda custom (64-bit)| (default, Jan 16 2018, 18:10:19) [GCC 7.2.0]
executable: /home/mrocklin/Software/anaconda/bin/python
machine: Linux-4.13.0-26-generic-x86_64-with-debian-stretch-sid
BLAS:
macros: SCIPY_MKL_H=None, HAVE_CBLAS=None
lib_dirs: /home/mrocklin/Software/anaconda/lib
cblas_libs: mkl_rt, pthread
Python deps:
pip: 9.0.1
setuptools: 40.4.3
sklearn: 0.21.dev0
numpy: 1.15.1
scipy: 1.1.0
Cython: 0.27.3
pandas: 0.23.4
Metadata
Metadata
Assignees
Labels
No labels