How Handling Feature with Text Unstructured Data?

2023-12-12 433 words 3 minutes

/how_to_handle_unstructured_text_data/featured-image.png

Contents

Preface

Text is a one of the most common form we can see in our raw data. Knowing how to extract the information from the unstructured data like text would greatly improve data utilization and model performance.

Process of Handling Unstructured Text

Text Cleaning -> Tokenization -> Vectorization

Text Cleaning

Stemming
- Fix rule such as removing ing, s to derive a base word
- ex
  - working -> work
  - dogs -> dog
Lemmatization
- Use knowledge of a language to derive a base word
- ex
  - running -> run
Stop word Removing
- Remove the redundant words
  - ex
    - the, and, is
Remove Noise
- Removing redundant white space, comma
Normalization
- Synonym replacement
- Upper case to lower case

Tokenization

Word Tokenization
- Tokenize by the white space
N-gram Tokenization
- Get by the statistic model
- Tokenize the N sub word which has the most probability be continuous
- 2-gram ex
  - Natural language processing is fun -> [(‘Natural’, ’language’), (’language’, ‘processing’), (‘processing’, ‘is’), (‘is’, ‘fun.’)]
WordPiece
- BERT
- ex
  - hugging -> hu + ##gging
- Reference
  - https://huggingface.co/learn/nlp-course/chapter6/6?fw=pt

Vectorization

BOW
- Vectorized by mapping the tokens appear in the sentence with the frequency
- ex
  - Tokenizer: {“I”: 0, “dog”: 1, “want”: 2, “pen”: 3}
  - I want a pen with pen
  - Vector -> [1, 0, 1, 2]
TF-IDF
- Penalize on the frequent appeared tokens in the corpus
- ex
  - Corpus
    - “Natural language processing is fun.”
    - “Text processing with Python is easy.”
  - Assume that we get the token by 1-gram tokenizer
  - The weight on Python would be larger than processing
BM25
- Penalize on the token which appear too frequently
- Elasticsearch default
BERT Encoder
- Semantic meaning

Code Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

import copy
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer


nltk.download('stopwords')
nltk.download('wordnet')


texts = ["This is a sample sentence.!! I love sentence", "Text processing in Python is fun sentence.", "Data science is an interesting field sentence."]

def clean_text(text):
    text = copy.deepcopy(text)
    text = text.lower()
    text = ''.join([char for char in text if char.isalpha() or char.isspace()])
    words = text.split()

    words = [word for word in words if word not in stopwords.words('english')]

    lemmatizer = WordNetLemmatizer()
    words = [lemmatizer.lemmatize(word) for word in words]
    return ' '.join(words)

# Text cleaning
cleaned_texts = [clean_text(text) for text in texts]
print("before cleaning")
print(texts)
print("clean text")
print(cleaned_texts)

vectorizer = CountVectorizer()
bow_features = vectorizer.fit_transform(cleaned_texts)

# TF-IDF
tfidf_vectorizer = TfidfVectorizer()
tfidf_features = tfidf_vectorizer.fit_transform(cleaned_texts)

# Showing the result
print("Bag of Words Features:\n", bow_features.toarray())
print("\nTF-IDF Features:\n", tfidf_features.toarray())