{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Sequence Generation Under Constraints\n",
"\n",
"Generating a sequence of values respecting some constraints can be easy for some constraints, and very hard for others. These exercises aim at showing examples of sequence generation and their complexity.\n",
"This tutorial complements these [slides](http://perezguillau.me/document/combi_tutorial_2018.pdf).\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 1) Random Sequence\n",
"Let's start with a simple constraint. Let's generate a sequence of 10 words, but all the words must come from a corpus. Here our corpus is the book made by Sun Tzu *the art of war*. The text is available [here](https://suntzusaid.com/download.php). You may need to remove few of the first lines."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"import random \n",
"import string\n",
"import re\n",
"import os\n",
"\n",
"# Init and datafile\n",
"corpus_file = \"./the_art_of_war.txt\"\n",
"text = None\n",
"\n",
"# reading the datafile\n",
"with open(corpus_file,'r') as f:\n",
" text = f.read()\n",
"\n",
"# extracting the words\n",
"words_order = [s.lower() for s in re.compile('\\w+').findall(text)]\n",
"words = list(set(words_order))\n",
"\n",
"# print a random sequence from the corpus\n",
"print(\"Random sentence of size 10:\")\n",
"print(' '.join( [words[random.randint(0,len(words))] for i in range(10)] ))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Each word being independent of each other, the generation is simply done by sampling words one by one from the corpus.\n",
"The constraint that we just satisfied was a simple unary constraint applied to each variable, restricting their values to a set of allowed ones. \n",
"\n",
"Since we are doing a standalone process of sampling here, we could extract the frequency of each word in the text and use them in our sampling. While this would not change the set of possible solutions, it biases the generation and allows us to reach solutions whose word frequency is closer to the corpus more frequently. \n",
"Imagine the impact if you apply then NLP using bags of words to your generated sequences. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2) Sequence with Rhymes\n",
"The ability to generate text with rhymes is an important question in lyrics and poetry generation. In poetry, a [couplet](https://en.wikipedia.org/wiki/Couplet) is a set of two successive lines that rhyme and have the same meter.\n",
"Thus, our next constraint is to be able to generate a sequence of 4 lines, such that the endings of the first two and last two lines rhyme. \n",
"\n",
"\n",
"### How to know if two words rhymes\n",
"There exists a large number of different [rhymes](https://en.wikipedia.org/wiki/Rhyme). \n",
"In order to simplify the extraction of the rhyme information, we will consider that two words rhyme if their last four letters are the same. \n",
"Even if that is not necessarily true ('equally', 'ally'), the exactness of the group of rhyming words does not change the mathematical model used to resolve it and an exact extractor of rhymes is not one of the goals of this tutorial."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# define rhymes set\n",
"rhyme_groups = {}\n",
"for w in words:\n",
" if len(w) < 4:\n",
" continue\n",
" end_of_word = w[-4:]\n",
" if end_of_word in rhyme_groups:\n",
" rhyme_groups[end_of_word].append(w)\n",
" else:\n",
" rhyme_groups[end_of_word] = [w]\n",
" \n",
"# Remove small sets\n",
"key_to_del = []\n",
"for key in rhyme_groups:\n",
" value = rhyme_groups[key]\n",
" if len(value) <= 3:\n",
" key_to_del.append(key)\n",
"for key in key_to_del:\n",
" del rhyme_groups[key]\n",
" \n",
"# 1) How to get a rhyme\n",
"rhyme1 = random.sample([*rhyme_groups],1)[0]\n",
"print(rhyme1 + \"+>\" + str(rhyme_groups[rhyme1]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generate a sequence with a rhyme\n",
"\n",
"We now have at our disposition a *rhyme dictionary*. \n",
"Use this dictionary to generate a sequence of 4 lines of 5 words with words from our corpus with the contraint that the endings of the first two and last two lines rhyme."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# 2) generate\n",
"n_random_words = 4\n",
"n_sentences = 2\n",
"\n",
"sequence = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO\n",
"#Generate a sequence with rhymes and store it into the sequence list\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n",
"print(\"Final text\")\n",
"for i, w in enumerate(sequence):\n",
" print(w + \" \", end='' if i%(n_random_words+1) != n_random_words else '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Generating sequence with rhymes was not that hard either. \n",
"But we now have the capability of generating a sequence of values such that some of them are related.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 3) Sequence with style\n",
"High school students are often asked to write stories or poems or other kinds of creative documents in the same style as the authors that they are currently studying. Most of the time this task consists of extracting characteristics of a certain text, like the special use of some hyperbole or even the family of the objectives themselves.\n",
"When listening to music the style may mean something different, and so too with paintings. \n",
"\n",
"Interest in learning and/or using the style of an artist is one of the fundamental questions of art based generative methods.\n",
"Recently impressive results have been accomplished, for example with deep-learning-based style transfer methods, which can recreate paintings with the *style* of another.\n",
"\n",
"\n",
"### What do we use as style\n",
"In this tutorial, we consider the *style* of a body of text (or music to be the succession of words (or musical notes) used in the given corpus. \n",
"So, if we are able to generate new content using the already seen successions of pairs of words, then the *style* will be the same. \n",
"\n",
"### Extracting this Style\n",
"Our first step is to extract the transitions between words from our corpus."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Reminder: words_order was storing all the text, word by word.\n",
"# extracting the transitions\n",
"transition = {}\n",
"for i, w in enumerate(words_order[:-1]):\n",
" if w in transition:\n",
" transition[w].append(words_order[i+1])\n",
" else:\n",
" transition[w] = [words_order[i+1]]\n",
"# just removing duplicates\n",
"transition = {word: list(set(nexts)) for word, nexts in transition.items()}\n",
"\n",
"# Example of transition\n",
"word = random.sample([*transition],1)[0]\n",
"print(\"the word '\"+word+ \"' can be followed by \"+ str(transition[word]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Generating using the transition function\n",
"\n",
"Now that we have extracted the *style* from our corpus, we can use it to generate some sequences.\n",
"Generate a sequence of 10 words with the same style of our corpus."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"sequence_length = 10\n",
"# TODO\n",
"# Generate a sequence with style and store it into the sequence list"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"print(' '.join(sequence))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 4) Generating sequence with Style and Rhymes\n",
"\n",
"The first two steps were not really hard to implement and a simple sampling algorithms was able to handle them.\n",
"\n",
"What do you think about the complexity of sampling a sequence subject to the combination of the two constraints?\n",
"\n",
"More precisely: What is the complexity of generating a sequence, following a transition function, and by putting dependency (here equality => rhyme) between variables?\n",
"\n",
"The answer to this question has been established by several different works, and is that it is hard, like __NP-complete__ hard just to prove that there exists a solution. It has also been proven that sampling sequences with exact probability (using the transition frequency for example) is **#P-complete**.\n",
"\n",
"While we can easily find solutions with the two constraints independently, finding a solution that respect both of them simulaneously is hard. While this statement may seem like bad news, its counterpart is the core of model-based problem solving: **hard problem can often be defined by the intersection of some easy problems**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 5) Model based problem definition\n",
"\n",
"Many problems can be defined by the intersection of several subproblems.\n",
"We create a __model__ by defining a finite set of __variables__, their __domains__ of definition (which is typically a finite set of values), and their __constraints__. \n",
"Constraints, relationships between variables, are a way of expressing subproblems, just like the rhyme and style constraints were able to define our problem. \n",
"They usually consist of a subset of the Cartesian product of the variables' domain. The intersection of the subsets of all the constraints of a problem is thus the set of valid solutions.\n",
"\n",
"### Constraint Programming model\n",
"\n",
"Many different frameworks exist for modeling problems, here we are going to use one of the most flexible one, in which we will be able to define our model easily. \n",
"Many constraints, like the transition constraints or the equality constraints, are native to this framework.\n",
"\n",
"We use the solver OR-tools made by google because of its efficient python binding. \n",
"Another advantage is that it includes both a CP and a CP/SAT solver with good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The installation of this solver can be done using pip:\n",
"\n",
" pip install ortools\n",
" \n",
"Here follows the definition of the basic requirement for our model, its variables and their associated domain: here, these are the indices of the words in the list *word*."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"#import the solver\n",
"from ortools.constraint_solver import pywrapcp\n",
"\n",
"# Create the solver.\n",
"solver = pywrapcp.Solver(\"one-over-f\")\n",
"\n",
"sequence_length = 20\n",
"min_dom = 0\n",
"max_dom = len(words)-1\n",
"\n",
"# sequence variables\n",
"seq = [solver.IntVar(min_dom, max_dom, \"seq(%i)\" % i) for i in range(sequence_length)]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"#### The transition constraint\n",
"\n",
"Transitions between variables is a well-studied problem in constraint programming.\n",
"In this example, the transition is even simpler since it only depends on the value of the previous variable.\n",
"Thus using the *transition* dictionary, we can create a binary table constraint (i.e. constraints applied to two variables), which is a constraint defined by the list of allowed combination (constraints named AllowedAssignement).\n",
"\n",
" solver.Add(solver.AllowedAssignments(variable_list, tuple_list))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Extract the conversion between words and indices\n",
"word_to_i = {w:i for i, w in enumerate(words)}\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO \n",
"# Build the table\n",
"transition_table = []"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO\n",
"# create the constraints\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The Rhyme constraint\n",
"\n",
"Now that the transition constraint is done, we can create our *rhyme* constraint.\n",
"From our previous definition, we choose to say that the words that must rhyme must belong to the same bag of *rhyming* words.\n",
"\n",
"Once again, this can be easily done using our *rhyme_groups* dictionary and table constraint.\n",
"Here we just want to enforce that any combination of the two words from the same list is allowed."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO\n",
"# Build the rhyme table\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# TODO\n",
"# Create the constraints\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Running the solver and extracting solutions\n",
"\n",
"Finally, once we have defined all our constraints, we need to ask the solver to find a (or many) solution for us. \n",
"This is an example of black box optimization, we do not necessarily know how the solver works, but only how to define a model.\n",
"\n",
"The following code set the search for a solution to a random selection of the values. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"db = solver.Phase(seq,\n",
" solver.CHOOSE_FIRST_UNBOUND,\n",
" solver.ASSIGN_RANDOM_VALUE)\n",
"\n",
"solver.NewSearch(db)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The search for a solution is done by calling the *NextSolution* method from the solver object. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"num_solution=0\n",
"while solver.NextSolution():\n",
" num_solution +=1\n",
" print([v.Value() for v in seq])\n",
" break\n",
"for i, w in enumerate(seq):\n",
" print(words[w.Value()] + \" \", end='' if i%(5) != 4 else '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here is our solution, with rhymes and style from our corpus. Even if the problem is hard, we get a solution in an instant.\n",
"This seems fast so let's try to generate a bunch of solutions, say 50 thousand."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"while solver.NextSolution():\n",
" num_solution +=1\n",
" if num_solution > 50000:\n",
" break\n",
"print(num_solution)\n",
"for i, w in enumerate(seq):\n",
" print(words[w.Value()] + \" \", end='' if i%(5) != 4 else '\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Results\n",
"\n",
"Let's recall what we just did. \n",
"First, we defined a set of variables with domain the interval [0,D] where D is the maximum index in the words list.\n",
"Second, we have defined the constraints.\n",
"One constraint uses the transition list and ensures that each two consecutive values appear consecutively in the text.\n",
"Another ensures that the words at the end of each 5-word sentence rhymes.\n",
"Finally, we ran the solver and obtained a solution of our model. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 6) Generation without plagiarism\n",
"\n",
"The CP solver API doesn't seem to have the forbidden table constraint, here follows the model considering that the constraint is available. In the next section, this problem is used using the CP/SAT solver. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Create the solver.\n",
"solver = pywrapcp.Solver(\"one-over-f\")\n",
"\n",
"# sequence variables (use previous init variables)\n",
"seq = [solver.IntVar(min_dom, max_dom, \"seq(%i)\" % i) for i in range(sequence_length)]\n",
"\n",
"# create the transition constraints\n",
"for i, v in enumerate(seq[:-1]):\n",
" solver.Add(solver.AllowedAssignments( # also named table constraint, or extensional constraint\n",
" (seq[i], seq[i + 1]), # variables\n",
" transition_table)) # table\n",
"\n",
"\n",
"# create the rhymes constraints\n",
"solver.Add(solver.AllowedAssignments((seq[4], seq[9]), rhymes_table)) # First rhyme\n",
"solver.Add(solver.AllowedAssignments((seq[14], seq[19]), rhymes_table)) # second one"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true,
"scrolled": true
},
"outputs": [],
"source": [
"plagiarism_table = [[word_to_i[words_order[i]],word_to_i[words_order[i+1]],word_to_i[words_order[i+2]],word_to_i[words_order[i+3]]] \n",
" for i, w in enumerate(words_order[:-3])]\n",
"\n",
"# not accessible\n",
"#for i, w in enumerate(seq[:-3]):\n",
"# solver.Add(solver.ForbiddenAssignments( # also named table constraint, or extensional constraint\n",
"# (seq[i], seq[i + 1], seq[i + 2], seq[i + 3]), # variables\n",
"# plagiarism_table)) # table"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": true
},
"source": [
"### CP / Sat solver model\n",
"\n",
"Since the forbidden table constraint is not easily available in the ortools solver, we are using the CP/SAT solver that they provide. Here is the new model."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from ortools.sat.python import cp_model\n",
"# Create the solver.\n",
"model = cp_model.CpModel()\n",
"\n",
"# sequence variables (use previous init variables)\n",
"seq = [model.NewIntVar(min_dom, max_dom, \"seq(%i)\" % i) for i in range(sequence_length)]\n",
"\n",
"# create the transition constraints\n",
"for i, v in enumerate(seq[:-1]):\n",
" model.AddAllowedAssignments( # also named table constraint, or extensional constraint\n",
" (seq[i], seq[i + 1]), # variables\n",
" transition_table) # table\n",
" \n",
"# create the rhymes constraints\n",
"model.AddAllowedAssignments((seq[4], seq[9]), rhymes_table) # First rhyme\n",
"model.AddAllowedAssignments((seq[14], seq[19]), rhymes_table) # second one\n",
"\n",
"# create plagiarism constraint\n",
"for i, w in enumerate(seq[:-3]):\n",
" model.AddForbiddenAssignments( # also named table constraint, or extensional constraint\n",
" (seq[i], seq[i + 1], seq[i + 2], seq[i + 3]), # variables\n",
" plagiarism_table) # table\n",
"\n",
"model.AddAllDifferent(seq) # All the values are pairwise different.\n",
" \n",
"solver = cp_model.CpSolver()\n",
"\n",
"solver.Solve(model)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"for i, w in enumerate(seq):\n",
" print(words[solver.Value(w)] + \" \", end='' if i%(5) != 4 else '\\n')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.1"
}
},
"nbformat": 4,
"nbformat_minor": 2
}