<?xml version="1.0" encoding='utf-8'?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN" "http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
<card id="card1" title="Large language model - Page 17 - Wikipedia">
<p>
<a accesskey="1" href="page.php?w=Large_language_model&amp;p=16">1.Previous</a><br />
<a accesskey="3" href="page.php?w=Large_language_model&amp;p=18">3.Next</a>
</p>
<p>(i.e. initial set of uni-grams). Successively the most frequent pair of adjacent characters is merged into a bi-gram and all instances of the pair are replaced by it. All occurrences of adjacent pairs of (previously merged) n-grams that most frequently occur together are then again merged into even lengthier n-gram, until a vocabulary of prescribed size is obtained. After a tokenizer is trained, any text can be tokenized by it, as long as it does not contain characters not appearing in the initial-set of uni-grams.</p>

<p><big>Dataset cleaning</big></p>
<p>In</p><p>
<a accesskey="1" href="page.php?w=Large_language_model&amp;p=16">1.Previous</a><br />
<a accesskey="3" href="page.php?w=Large_language_model&amp;p=18">3.Next</a>
</p>

<do type="prev" label="Search">
        <go href="search.wml"/>
</do>

</card>
</wml>
