How Handling Feature with Text Unstructured Data?
Contents
Preface
Text is a one of the most common form we can see in our raw data. Knowing how to extract the information from the unstructured data like text would greatly improve data utilization and model performance.
Process of Handling Unstructured Text
Text Cleaning -> Tokenization -> Vectorization
Text Cleaning
- Stemming
- Fix rule such as removing
ing
,s
to derive a base word - ex
- working -> work
- dogs -> dog
- Fix rule such as removing
- Lemmatization
- Use knowledge of a language to derive a base word
- ex
- running -> run
- Stop word Removing
- Remove the redundant words
- ex
- the, and, is
- ex
- Remove the redundant words
- Remove Noise
- Removing redundant white space, comma
- Normalization
- Synonym replacement
- Upper case to lower case
Tokenization
- Word Tokenization
- Tokenize by the white space
- N-gram Tokenization
- Get by the statistic model
- Tokenize the N sub word which has the most probability be continuous
- 2-gram ex
- Natural language processing is fun -> [(‘Natural’, ’language’), (’language’, ‘processing’), (‘processing’, ‘is’), (‘is’, ‘fun.’)]
- WordPiece
BERT
ex
- hugging ->
hu
+##gging
- hugging ->
Reference
Vectorization
- BOW
- Vectorized by mapping the tokens appear in the sentence with the frequency
- ex
- Tokenizer: {“I”: 0, “dog”: 1, “want”: 2, “pen”: 3}
- I want a pen with pen
- Vector -> [1, 0, 1, 2]
- TF-IDF
- Penalize on the frequent appeared tokens in the corpus
- ex
- Corpus
- “Natural language processing is fun.”
- “Text processing with Python is easy.”
- Assume that we get the token by 1-gram tokenizer
- The weight on
Python
would be larger thanprocessing
- Corpus
- BM25
- Penalize on the token which appear too frequently
- Elasticsearch default
- BERT Encoder
- Semantic meaning
Code Examples
|
|