Natural Language Data Augmentation using nlpaug

956

Machine Learning Algorithms learn from data. And, with the advent of Deep Learning techniques, this is more profound than ever. Given large compute resources at their disposal, Deep Learning Algorithms thrive with massive amounts of data. However, data stores used for training have limited data. Moreover, they may also be curated. So what can you do to increase the volume and variety of the training data, without necessarily acquiring more data? The answer is Data Augmentation.

Data Augmentation creates additional synthetic data from your existing data. This is quite common in Computer Vision tasks since image data is hard to acquire. Data Augmentation is performed in the following ways in images:

  • Cropping
  • Flipping
  • Rotation
  • Noise Injection
  • Others

Having said that, this is less common in text/natural language data. For text data, the following methods are applicable:

  • Synonym Replacement
  • Random Insertion
  • Random Swapping
  • Random Deletion
  • Many More…

Moreover, Natural Language Data Augmentation could be performed at the following levels:

  • Character level
  • Word level
  • Sentence level

Fortunately, enough pre-built libraries exist in programming languages like python. One of them is nlpaug. Read more about nlpaug here.

Spelling Augmentation

Amongst the prominent Natural Language/Text data, we have demographic data such as names, addresses etc. In this example, we will show a word-level augmentation using SpellingAug class of the nlpaug library. To know more about SpellingAug, read this.

SpellingAug augments the spellings of text data. It takes the following parameters:

Parameters:
  • dict_path (str) – Path of misspelling dictionary
  • aug_p (float) – The percentage of a word will be augmented.
  • aug_min (int) – Minimum number of word will be augmented.
  • aug_max (int) – Maximum number of word will be augmented. If None is passed, a number of augmentations are calculated via aup_p. If the calculated result from aug_p is smaller than aug_max, will use the calculated result from aug_p. Otherwise, using aug_max.
  • stopwords (list) – List of words which will be skipped from augment operation.
  • stopwords_regex (str) – Regular expression for matching words which will be skipped from augment operation.
  • tokenizer (func) – Customize tokenization process
  • reverse_tokenizer (func) – Customize reverse of tokenization process
  • name (str) – Name of this augmenter

The main method of SpellingAug is augment, which takes the following parameters:

  • data (object/list) – Data for augmentation. It can be a list of data (e.g. list of string or numpy) or a single element (e.g. string or numpy). Numpy format only supports audio or spectrogram data. For text data, only support strings or a list of strings.
  • n (int) – Default is 1. A number of unique augmented outputs. Will be forced to 1 if the input is a list of data
  • num_thread (int) – Number of threads for data augmentation. Use this option when you are using CPU and n is larger than 1

Fake Demographic data generation and Data Augmentation

Moreover, we use Faker Library to generate synthetic demographic data. To know more about Faker, read the documentation. For this demo, we generate Fake names and addresses using the faker. name and faker. address methods. Here is the script that generates fake names and addresses and their augmentation. Here is the script for the same:

from faker import Faker

import nlpaug.augmenter.word as naw

faker = Faker()

for i in range(5):

    name = faker.name()

    aug = naw.SpellingAug(aug_min=1,aug_p=20)  # Instantiate Spelling Aug.

    augname = aug.augment(name,n = 1)   

    address = faker.address()

    augaddress = aug.augment(address,n = 1)   

    print(f'{name},{augname}')

    print(f'{address},{augaddress}')

 

Below are the results.

Steven Craig,['Stiven Craig']
833 Patrick Springs Apt. 075
Katherinehaven, AS 44768,['833 Patrik Springs Apt. 075 Katherinehaven, As 44768']

Frank Singleton,['Frsnk Singleton']
58935 Natasha Bridge Suite 866
Maryfurt, MI 73057,['58935 Natasha Bridge Suite 866 Maryfurt, mi 73057']

Mrs. Rhonda Miller,['Mx. Rhonda Miller']
121 Marie Well
Lewisborough, AK 36198,['121 marie Weel Lewisborough, AK 36198']

Cindy Parker,['Cindy Parker']
USNS Schmidt
FPO AE 40157,['USNS Schmidt\nFPO AE 40157']

Ryan Willis,['RYan Willis']
8692 Stone Neck Suite 538
Wendyshire, KS 18657,['8692 Stone Neck Suite 538\nWendyshire, KS 18657']

Note the limitation that all the instances are not augmented due to the probabilistic nature of the algorithm.

Also Read:  Interpretations and Definitions of Probability

Conclusion

Hope that this article is useful to the readers. Note that this is only for information purposes. We do not claim any guarantees regarding its accuracy or completeness.



I am a Data Scientist with 6+ years of experience.


Leave a Reply