Style Transfer in Text - Exploration and Evaluation

Introduction

This paper looks at the challenges of changing the style of text without using matching input-output examples.
It introduces two new ways to measure performance:
- Transfer Strength: How well the style is changed.
- Content Preservation: How much of the original content is kept.
The methods are tested on two tasks:
- Turning academic paper titles into news-style titles.
- Changing reviews from positive to negative.
Results show the models balance changing style and keeping content well.

The paper talks about earlier work on style transfer in images and text.
It explains the difficulties in text style transfer, like not having matching input-output examples and lacking good evaluation methods.
Different approaches, like adversarial networks and multi-task learning, are reviewed.
The authors point out the need for better ways to keep content and separate style.

Two models are proposed:
- The multi-decoder model, which has separate decoders for each style.
- The style-embedding model, which adds style information during generation.
To test these models, two metrics are used:
- Transfer strength: Measures how well the new style is applied.
- Content preservation: Measures how much content stays the same.
Tests on two datasets—paper-news titles and positive-negative reviews—show these models balance style change and content retention.
The multi-decoder model focuses more on style change, while the style-embedding model keeps more content.
Human evaluations show the content preservation metric aligns well with human opinions.

The models manage to change style while keeping content to varying levels.
The new metrics (transfer strength and content preservation) match human evaluations well, proving their usefulness.
The paper notes a trade-off between style change and content retention, suggesting different models for different tasks.

Two datasets are used:
- A paper-news title dataset with academic paper titles and their news-style versions.
- A positive-negative review dataset with Amazon reviews labeled as positive or negative.
The datasets were split into training, validation, and test sets.
Preprocessing steps included:
- Filtering sentence lengths.
- Converting text to lowercase.
- Replacing numbers with placeholders.
Different settings were tested to ensure fair evaluation:
- Word embedding size: Tried values like 64, 128, etc.
- Encoder hidden size: Tested sizes like 16, 32, 64, 128.
- Style embedding size: Checked 32, 64, 128 dimensions.
- Batch size: Set to 128.
- Optimizer: Used Adadelta with a learning rate of 0.0001.

The content preservation metric matched human opinions with a Spearman’s correlation of 0.5656 (p < 0.0001).
In the paper-news title task:
- The style-embedding model got content preservation scores of 0.89-0.95 and transfer strength scores of 0.2-0.6, better than the baseline.
In the positive-negative review task:
- The multi-decoder model got transfer strength of 0.8 and content preservation of 0.85, doing better than the style-embedding model, which scored 0.6 and 0.75.
The multi-decoder model improved transfer strength by 50% compared to the auto-encoder in this task, reaching 0.6.
Overall, the models improved style change and content retention by 20~50% compared to the baseline.