We’ve been using this text data from the 20 News Group dataset. Now that we have these numeric representations of this textual data, there is so much we can do that we couldn’t do before!īut let’s make this more concrete. We could look at term frequency, we could remove stop words, we could visualize things, and we could try and cluster. Well, for one, we could do a bunch of analysis.
SUPER VECTORIZER SAFE HOW TO
So, you may be wondering what now? We know how to vectorize these things based on counts, but what can we actually do with any of this information? Our output: Vocabulary: (Abbreviated.) Sample 0 (vectorized): Sample 0 (vectorized) length: 101631 Sample 0 (vectorized) sum: 85 To the source:, dtype='
We will be creating vectors that have a dimensionality equal to the size of our vocabulary, and if the text data features that vocab word, we will put a one in that dimension.
Today, we will be looking at one of the most basic ways we can represent text data numerically: one-hot encoding (or count vectorization).