Contents

How Handling Feature with Text Unstructured Data?

Preface

Text is a one of the most common form we can see in our raw data. Knowing how to extract the information from the unstructured data like text would greatly improve data utilization and model performance.

Process of Handling Unstructured Text

Text Cleaning -> Tokenization -> Vectorization

Text Cleaning

  • Stemming
    • Fix rule such as removing ing, s to derive a base word
    • ex
      • working -> work
      • dogs -> dog
  • Lemmatization
    • Use knowledge of a language to derive a base word
    • ex
      • running -> run
  • Stop word Removing
    • Remove the redundant words
      • ex
        • the, and, is
  • Remove Noise
    • Removing redundant white space, comma
  • Normalization
    • Synonym replacement
    • Upper case to lower case

Tokenization

  • Word Tokenization
    • Tokenize by the white space
  • N-gram Tokenization
    • Get by the statistic model
    • Tokenize the N sub word which has the most probability be continuous
    • 2-gram ex
      • Natural language processing is fun -> [(‘Natural’, ’language’), (’language’, ‘processing’), (‘processing’, ‘is’), (‘is’, ‘fun.’)]
  • WordPiece

Vectorization

  • BOW
    • Vectorized by mapping the tokens appear in the sentence with the frequency
    • ex
      • Tokenizer: {“I”: 0, “dog”: 1, “want”: 2, “pen”: 3}
      • I want a pen with pen
      • Vector -> [1, 0, 1, 2]
  • TF-IDF
    • Penalize on the frequent appeared tokens in the corpus
    • ex
      • Corpus
        • “Natural language processing is fun.”
        • “Text processing with Python is easy.”
      • Assume that we get the token by 1-gram tokenizer
      • The weight on Python would be larger than processing
  • BM25
    • Penalize on the token which appear too frequently
    • Elasticsearch default
  • BERT Encoder
    • Semantic meaning

Code Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import copy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


nltk.download('stopwords')
nltk.download('wordnet')


texts = ["This is a sample sentence.!! I love sentence", "Text processing in Python is fun sentence.", "Data science is an interesting field sentence."]

def clean_text(text):
    text = copy.deepcopy(text)
    text = text.lower()
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])
    words = text.split()

    words = [word for word in words if word not in stopwords.words('english')]

    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Text cleaning
cleaned_texts = [clean_text(text) for text in texts]
print("before cleaning")
print(texts)
print("clean text")
print(cleaned_texts)

vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(cleaned_texts)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(cleaned_texts)

# Showing the result
print("Bag of Words Features:\n", bow_features.toarray())
print("\nTF-IDF Features:\n", tfidf_features.toarray())