MSDN Community: AI & Machine Learning

Explore the frontiers of Artificial Intelligence with TensorFlow.

Character-Level Text Generation with TensorFlow

Welcome to this tutorial on building a character-level text generation model using TensorFlow. This is a fascinating application of Recurrent Neural Networks (RNNs), specifically Long Short-Term Memory (LSTM) networks, that allows us to generate new text that mimics the style and content of a given corpus.

By the end of this tutorial, you will understand:

How to preprocess text data for character-level modeling.
The architecture of an RNN/LSTM model for sequence generation.
How to train a model to predict the next character in a sequence.
Techniques for generating new text from a trained model.

Prerequisites: Basic Python programming, familiarity with TensorFlow and Keras, and an understanding of basic deep learning concepts.

1. Setting Up and Data Preparation

First, let's import the necessary libraries and load our dataset. For this example, we'll use a sample text file. You can replace this with any text corpus you like (e.g., Shakespearean plays, novel excerpts, code snippets).

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, Dense, Embedding, Dropout
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
import numpy as np
import re

# Load and clean the text data
def load_and_clean_text(filepath):
    with open(filepath, 'r', encoding='utf-8') as f:
        text = f.read()
    text = text.lower()
    text = re.sub(r'[^a-z0-9.,!?;:\'"]', ' ', text) # Keep alphanumeric, punctuation, and space
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra whitespace
    return text

# Example: Replace 'sample.txt' with your text file path
try:
    text_data = load_and_clean_text('sample.txt')
except FileNotFoundError:
    print("sample.txt not found. Using a placeholder text for demonstration.")
    text_data = "this is a sample text for demonstrating character level text generation with tensorflow. lstm networks are powerful for sequence modeling."

# Create a set of unique characters
chars = sorted(list(set(text_data)))
char_to_int = dict((c, i) for i, c in enumerate(chars))
int_to_char = dict((i, c) for i, c in enumerate(chars))
vocab_size = len(chars)

print(f"Corpus length: {len(text_data)}")
print(f"Unique characters: {vocab_size}")

# Prepare sequences
seq_length = 100 # Length of input sequences
dataX = []
dataY = []
for i in range(0, len(text_data) - seq_length, 1):
    seq_in = text_data[i:i + seq_length]
    seq_out = text_data[i + seq_length]
    dataX.append([char_to_int[char] for char in seq_in])
    dataY.append(char_to_int[seq_out])

n_patterns = len(dataX)
print(f"Total patterns: {n_patterns}")

# Reshape X and normalize
X = np.reshape(dataX, (n_patterns, seq_length, 1))
# Scale to [0, 1]
# X = X / float(vocab_size) # Alternative for non-categorical data

# Convert the output variable (Y) into a binary matrix
y = to_categorical(dataY)

In this step, we load our text, convert it to lowercase, and remove unnecessary characters. We then create mappings between characters and integers, and prepare input-output sequences of a fixed length. Notice how we use `to_categorical` for the output, which is standard for multi-class classification tasks.

2. Building the LSTM Model

Now, let's define our neural network architecture. We'll use an Embedding layer to represent characters, followed by LSTM layers to capture sequential dependencies, and a Dense output layer to predict the next character.

# Define the model
embedding_dim = 128
rnn_units = 256

model = Sequential([
    # Using Embedding layer for direct integer input
    Embedding(vocab_size, embedding_dim, input_length=seq_length),
    # Using LSTM layer
    LSTM(rnn_units, return_sequences=True), # return_sequences=True if stacking LSTMs
    Dropout(0.2),
    LSTM(rnn_units),
    Dropout(0.2),
    # Output layer
    Dense(vocab_size, activation='softmax')
])

# Compile the model
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])

model.summary()

The `Embedding` layer maps each character (represented by its integer index) to a dense vector. The `LSTM` layers learn patterns over time. `Dropout` layers are added to prevent overfitting. The final `Dense` layer with `softmax` activation outputs probabilities for each character in the vocabulary being the next character.

3. Training the Model

With the model defined, we can now train it on our prepared data. Training can take a significant amount of time depending on the dataset size and model complexity.

# Training parameters
epochs = 50 # Adjust based on your needs and dataset size
batch_size = 128

# Train the model
print("Starting training...")
history = model.fit(X, y, epochs=epochs, batch_size=batch_size, verbose=1)
print("Training finished.")

# You might want to save the model
# model.save('char_generation_model.h5')

We use `categorical_crossentropy` as the loss function because our output is a probability distribution over the character vocabulary. The `Adam` optimizer is a good default choice. You can monitor the accuracy and loss during training to gauge performance.

4. Generating Text

Once the model is trained, we can use it to generate new text. This involves picking a starting sequence, predicting the next character, adding it to the sequence, and repeating the process.

# Function to generate text
def generate_text(model, start_sequence, num_generate=500):
    generated_text = start_sequence
    print(f"--- Starting Seed: '{start_sequence}' ---")

    # Convert the start sequence to integers
    input_sequence = [char_to_int[char] for char in start_sequence]
    # Pad the sequence to match the model's input length
    padded_sequence = pad_sequences([input_sequence], maxlen=seq_length, padding='pre')

    for _ in range(num_generate):
        # Predict the next character probabilities
        predictions = model.predict(padded_sequence, verbose=0)[0]
        # Sample the next character from the probability distribution
        predicted_id = np.random.choice(len(predictions), p=predictions)

        # Convert predicted integer back to character
        predicted_char = int_to_char[predicted_id]

        # Append the predicted character to the generated text
        generated_text += predicted_char

        # Update the input sequence for the next prediction
        # Remove the first character and add the new one
        input_sequence.append(predicted_id)
        input_sequence = input_sequence[1:] # Keep only the last seq_length characters
        padded_sequence = pad_sequences([input_sequence], maxlen=seq_length, padding='pre')


    return generated_text

# Generate text using a seed
start_seed = "the " # Example seed
generated_output = generate_text(model, start_seed, num_generate=300)
print("\n--- Generated Text ---")
print(generated_output)
print("----------------------")

The `generate_text` function takes the trained model and a starting seed. It iteratively predicts the next character, appends it, and updates the input sequence. The use of `np.random.choice` with probabilities allows for varied output each time you run the generation.

Further Exploration

This is a foundational example. You can explore numerous enhancements:

Larger Datasets: Train on more extensive text corpora for richer generation.
Model Architectures: Experiment with GRUs, more LSTM layers, or attention mechanisms.
Hyperparameter Tuning: Optimize `embedding_dim`, `rnn_units`, `dropout_rate`, learning rate, and `seq_length`.
Sampling Strategies: Implement temperature sampling for more controlled randomness.
Fine-tuning: Start with a pre-trained model and fine-tune it on your specific task.

Character-level generation is a powerful tool for creative writing, code completion, and understanding sequence modeling. Dive deeper and experiment!

Explore More TensorFlow Tutorials