<?xml version="1.0" encoding='utf-8'?>
<!DOCTYPE wml PUBLIC "-//WAPFORUM//DTD WML 1.1//EN" "http://www.wapforum.org/DTD/wml_1.1.xml">
<wml>
<card id="card1" title="Large language model - Page 16 - Wikipedia">
<p>
<a accesskey="1" href="page.php?w=large_language_model&amp;p=15">1.Previous</a><br />
<a accesskey="3" href="page.php?w=large_language_model&amp;p=17">3.Next</a>
</p>
<p>as</p>

<p>Tokenization also <a href="page.php?w=Data_compression">compress</a>es the datasets. Because LLMs generally require input to be an <a href="page.php?w=Array_%28data_structure%29">array</a> that is not <a href="page.php?w=Jagged_array">jagged</a>, the shorter texts must be "padded" until they match the length of the longest one.</p>

<p><big> Byte-pair encoding </big></p>
<p>As an example, consider a tokenizer based on byte-pair encoding. In the first step, all unique characters (including blanks and <a href="page.php?w=punctuation_mark">punctuation mark</a>s)</p><p>
<a accesskey="1" href="page.php?w=large_language_model&amp;p=15">1.Previous</a><br />
<a accesskey="3" href="page.php?w=large_language_model&amp;p=17">3.Next</a>
</p>

<do type="prev" label="Search">
        <go href="search.wml"/>
</do>

</card>
</wml>
