Home Big Data Natural Language Data Augmentation using nlpaug

Big Data Data Science

Natural Language Data Augmentation using nlpaug

By Prasad KulkarniDec 09, 2022, 23:25 pm0

1612

Machine Learning Algorithms learn from data. And, with the advent of Deep Learning techniques, this is more profound than ever. Given large compute resources at their disposal, Deep Learning Algorithms thrive with massive amounts of data. However, data stores used for training have limited data. Moreover, they may also be curated. So what can you do to increase the volume and variety of the training data, without necessarily acquiring more data? The answer is Data Augmentation.

Data Augmentation creates additional synthetic data from your existing data. This is quite common in Computer Vision tasks since image data is hard to acquire. Data Augmentation is performed in the following ways in images:

Cropping
Flipping
Rotation
Noise Injection
Others

Having said that, this is less common in text/natural language data. For text data, the following methods are applicable:

Synonym Replacement
Random Insertion
Random Swapping
Random Deletion
Many More…

Moreover, Natural Language Data Augmentation could be performed at the following levels:

Character level
Word level
Sentence level

Fortunately, enough pre-built libraries exist in programming languages like python. One of them is nlpaug. Read more about nlpaug here.

Spelling Augmentation

Amongst the prominent Natural Language/Text data, we have demographic data such as names, addresses etc. In this example, we will show a word-level augmentation using SpellingAug class of the nlpaug library. To know more about SpellingAug, read this.

SpellingAug augments the spellings of text data. It takes the following parameters:

Parameters:

Parameters:	dict_path (str) – Path of misspelling dictionary aug_p (float) – The percentage of a word will be augmented. aug_min (int) – Minimum number of word will be augmented. aug_max (int) – Maximum number of word will be augmented. If None is passed, a number of augmentations are calculated via aup_p. If the calculated result from aug_p is smaller than aug_max, will use the calculated result from aug_p. Otherwise, using aug_max. stopwords (list) – List of words which will be skipped from augment operation. stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation. tokenizer (func) – Customize tokenization process reverse_tokenizer (func) – Customize reverse of tokenization process name (str) – Name of this augmenter

dict_path (str) – Path of misspelling dictionary
aug_p (float) – The percentage of a word will be augmented.
aug_min (int) – Minimum number of word will be augmented.
aug_max (int) – Maximum number of word will be augmented. If None is passed, a number of augmentations are calculated via aup_p. If the calculated result from aug_p is smaller than aug_max, will use the calculated result from aug_p. Otherwise, using aug_max.
stopwords (list) – List of words which will be skipped from augment operation.
stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
tokenizer (func) – Customize tokenization process
reverse_tokenizer (func) – Customize reverse of tokenization process
name (str) – Name of this augmenter

The main method of SpellingAug is augment, which takes the following parameters:

data (object/list) – Data for augmentation. It can be a list of data (e.g. list of string or numpy) or a single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support strings or a list of strings.
n (int) – Default is 1. A number of unique augmented outputs. Will be forced to 1 if the input is a list of data
num_thread (int) – Number of threads for data augmentation. Use this option when you are using CPU and n is larger than 1

Fake Demographic data generation and Data Augmentation

Moreover, we use Faker Library to generate synthetic demographic data. To know more about Faker, read the documentation. For this demo, we generate Fake names and addresses using the faker. name and faker. address methods. Here is the script that generates fake names and addresses and their augmentation. Here is the script for the same:

from faker import Faker

import nlpaug.augmenter.word as naw

faker = Faker()

for i in range(5):

    name = faker.name()

    aug = naw.SpellingAug(aug_min=1,aug_p=20)  # Instantiate Spelling Aug.

    augname = aug.augment(name,n = 1)   

    address = faker.address()

    augaddress = aug.augment(address,n = 1)   

    print(f'{name},{augname}')

    print(f'{address},{augaddress}')

Below are the results.

Steven Craig,['Stiven Craig']
833 Patrick Springs Apt. 075
Katherinehaven, AS 44768,['833 Patrik Springs Apt. 075 Katherinehaven, As 44768']

Frank Singleton,['Frsnk Singleton']
58935 Natasha Bridge Suite 866
Maryfurt, MI 73057,['58935 Natasha Bridge Suite 866 Maryfurt, mi 73057']

Mrs. Rhonda Miller,['Mx. Rhonda Miller']
121 Marie Well
Lewisborough, AK 36198,['121 marie Weel Lewisborough, AK 36198']

Cindy Parker,['Cindy Parker']
USNS Schmidt
FPO AE 40157,['USNS Schmidt\nFPO AE 40157']

Ryan Willis,['RYan Willis']
8692 Stone Neck Suite 538
Wendyshire, KS 18657,['8692 Stone Neck Suite 538\nWendyshire, KS 18657']

Note the limitation that all the instances are not augmented due to the probabilistic nature of the algorithm.

Also Read: Interpretations and Definitions of Probability

Conclusion

Hope that this article is useful to the readers. Note that this is only for information purposes. We do not claim any guarantees regarding its accuracy or completeness.

TAGai Computer Vision deep learning ML NLP

Previous PostAn Introduction to Modeling Mindsets Next PostInterpretations and Definitions of Probability

Prasad Kulkarni

I am a Data Scientist with 6+ years of experience.

Natural Language Data Augmentation using nlpaug

Spelling Augmentation

Fake Demographic data generation and Data Augmentation

Conclusion

Prasad Kulkarni

Leave a ReplyCancel reply

Follow Us

Natural Language Data Augmentation using nlpaug

Spelling Augmentation

Fake Demographic data generation and Data Augmentation

Conclusion

Prasad Kulkarni

Related articles

Motivating AI-102: Azure AI Engineer Associate

Will Generative AI replace Artists?

Mind over Data – Towards Causality in AI

Leave a ReplyCancel reply

Follow Us

Most used tags