How to force jsPDF.js to correctly generate PDF with Cyrillic characters?

Sergey Ozeransky2013-11-20 10:13:21

PDF

Sergey Ozeransky, 2013-11-20 10:13:21

Good afternoon, I ran into a problem when working with jsPDF.js, this library does not work with UTF-8. Maybe someone came across and knows how to make it correctly create PDF with Cyrillic characters?

Answer the question

In order to leave comments, you need to log in

10 answer(s)

max7 M7, 2013-11-20
@max7

If I understand correctly what jsPDF is meant. That is simply not the case. This is due to pdf's unicode support options and different versions of pdf. After all, pdf (up to version 1.4 for sure) should be supported by printers without fonts installed on them, i.e. they must be embedded in the document. There are a lot of nuances. But.
Try adding the following functions to your jsPDF code:
A variant of the pdfEscape function

var padz = 
[
   "",
   "0",
   "00",
   "000",
   "0000"
];
var pdfEscape16 = function(text) 
{
   var ar = ["FEFF"];
   for(var i = 0, l = text.length, t; i < l; ++i)
   {
      t = text.charCodeAt(i).toString(16).toUpperCase();
      ar.push(padz[4 - t.length], t);
   }
   return ar.join("");
};

Add a function to the object at the end

text16: function(x, y, text) 
{
   // need page height
   if(pageFontSize != fontSize) 
   {
      out('BT /F1 ' + parseInt(fontSize) + '.00 Tf ET');
      pageFontSize = fontSize;
   }
   var str = sprintf('BT %.2f %.2f Td <%s> Tj ET', x * k, (pageHeight - y) * k, pdfEscape16(text));
   out(str);
}

and add text by calling the text16 function.

max7 M7, 2013-11-20
@max7

Here, there is a similar question on RSDN:
Outputting Unicode text to PDF .

max7 M7, 2013-11-20
@max7

Exactly, I looked at the old version...
We look at git , right through the code
Line 1149: API.text = function (text, x, y, flags) {
flags parameter
Line 1202: str = pdfEscape(text, flags);
pdfEscape function
Line 801: pdfEscape = function (text, flags) {
flags parameter
Line 811: return to8bitStream(text, flags)
to8bitStream function
Line 659: to8bitStream = function (text, flags) {
flags parameter
Line 660-707: Unicode comment .

/* PDF 1.3 spec:
"For text strings encoded in Unicode, the first two bytes must be 254 followed by
255, representing the Unicode byte order marker, U+FEFF. (This sequence conflicts
with the PDFDocEncoding character sequence thorn ydieresis, which is unlikely
to be a meaningful beginning of a word or phrase.) The remainder of the
string consists of Unicode character codes, according to the UTF-16 encoding
specified in the Unicode standard, version 2.0. Commonly used Unicode values
are represented as 2 bytes per character, with the high-order byte appearing first
in the string."

In other words, if there are chars in a string with char code above 255, we
recode the string to UCS2 BE - string doubles in length and BOM is prepended.

HOWEVER!
Actual *content* (body) text (as opposed to strings used in document properties etc)
does NOT expect BOM. There, it is treated as a literal GID (Glyph ID)

Because of Adobe's focus on "you subset your fonts!" you are not supposed to have
a font that maps directly Unicode (UCS2 / UTF16BE) code to font GID, but you could
fudge it with "Identity-H" encoding and custom CIDtoGID map that mimics Unicode
code page. There, however, all characters in the stream are treated as GIDs,
including BOM, which is the reason we need to skip BOM in content text (i.e. that
that is tied to a font).

To signal this "special" PDFEscape / to8bitStream handling mode,
API.text() function sets (unless you overwrite it with manual values
given to API.text(.., flags) )
flags.autoencode = true
flags.noBOM = true

*/

 /*
`flags` properties relied upon:
.sourceEncoding = string with encoding label.
"Unicode" by default. = encoding of the incoming text.
pass some non-existing encoding name
(ex: 'Do not touch my strings! I know what I am doing.')
to make encoding code skip the encoding step.
.outputEncoding = Either valid PDF encoding name
(must be supported by jsPDF font metrics, otherwise no encoding)
or a JS object, where key = sourceCharCode, value = outputCharCode
missing keys will be treated as: sourceCharCode === outputCharCode
.noBOM
See comment higher above for explanation for why this is important
.autoencode
See comment higher above for explanation for why this is important
*/

So far, I've taken everything apart, I'll take a look...

Sergey Ozeransky, 2013-11-20
@KREGI

the helvetica font is specified in the code, but does it support Unicode? in the list on wikipedia BE%D0%B4%D0%B4%D0%B5%D1%80%D0%B6%D0%B8%D0%B2%D0%B0%D1%8E%D1%89%D0%B8%D0%B5_% D1%8E%D0%BD%D0%B8%D0%BA%D0%BE%D0%B4 ) it is not

max7 M7, 2013-11-20
@max7

Here is Unicode in PDF , they write

See Appendix D (page 995) of the PDF specification. There is a limited number of fonts and
character sets pre-defined in a PDF consumer application. To display other characters you 
need to embed a font that contains them. It is also preferable to embed only a subset of 
the font, including only required characters, in order to reduce file size. 
I am also working on displaying Unicode characters in PDF and it is a major hassle.
adobe.com/devnet/pdf/pdf_reference.html

.
There is no Cyrillic alphabet in standard fonts.
jsPDF does not have font embedders.
For example, mPDF (PHP) does just that, injecting a font subset.
That is, the "viewer" pdf must itself contain / provide Unicode fonts.
Suppose our "viewer" gives such fonts.
Add to jspdf.js file before line 1912 (return API; jsPDF function) code

var padz = 
[
   "",
   "0",
   "00",
   "000",
   "0000"
];
var pdfEscape16 = function(text, flags) 
{
   var ar = ["FEFF"];
   for(var i = 0, l = text.length, t; i < l; ++i)
   {
      t = text.charCodeAt(i).toString(16).toUpperCase();
      ar.push(padz[4 - t.length], t);
   }
   return ar.join("");
};

API.text16 = function (text, x, y, flags) 
{
   /**
   * Inserts something like this into PDF
   BT
   /F1 16 Tf % Font name + size
   16 TL % How many units down for next line in multiline text
   0 g % color
   28.35 813.54 Td % position
   (line one) Tj
   T* (line two) Tj
   T* (line three) Tj
   ET
   */

   var undef, _first, _second, _third, newtext, str, i;
   // Pre-August-2012 the order of arguments was function(x, y, text, flags)
   // in effort to make all calls have similar signature like
   // function(data, coordinates... , miscellaneous)
   // this method had its args flipped.
   // code below allows backward compatibility with old arg order.
   if (typeof text === 'number') {
       _first = y;
       _second = text;
       _third = x;

       text = _first;
       x = _second;
       y = _third;
   }

   // If there are any newlines in text, we assume
   // the user wanted to print multiple lines, so break the
   // text up into an array. If the text is already an array,
   // we assume the user knows what they are doing.
   if (typeof text === 'string' && text.match(/[\n\r]/)) {
       text = text.split(/\r\n|\r|\n/g);
   }

   if (typeof flags === 'undefined') {
       flags = {'noBOM': true, 'autoencode': true};
   } else {

       if (flags.noBOM === undef) {
           flags.noBOM = true;
       }

       if (flags.autoencode === undef) {
           flags.autoencode = true;
       }

   }

   if (typeof text === 'string') {
       str = pdfEscape16(text, flags);
   } else if (text instanceof Array) { /* Array */
       // we don't want to destroy original text array, so cloning it
       newtext = text.concat();
       // we do array.join('text that must not be PDFescaped")
       // thus, pdfEscape each component separately
       for (i = newtext.length - 1; i !== -1; i--) {
           newtext[i] = pdfEscape16(newtext[i], flags);
       }
       str = newtext.join("> Tj\nT* <");
   } else {
       throw new Error('Type of text must be string or Array. "' + text + '" is not recognized.');
   }
   // Using "'" ("go next line and render text" mark) would save space but would complicate our rendering code, templates

   // BT .. ET does NOT have default settings for Tf. You must state that explicitely every time for BT .. ET
   // if you want text transformation matrix (+ multiline) to work reliably (which reads sizes of things from font declarations)
   // Thus, there is NO useful, *reliable* concept of "default" font for a page.
   // The fact that "default" (reuse font used before) font worked before in basic cases is an accident
   // - readers dealing smartly with brokenness of jsPDF's markup.
   out(
       'BT\n/' +
           activeFontKey + ' ' + activeFontSize + ' Tf\n' + // font face, style, size
           (activeFontSize * lineHeightProportion) + ' TL\n' + // line spacing
           textColor +
           '\n' + f2(x * k) + ' ' + f2((pageHeight - y) * k) + ' Td\n<' +
           str +
           '> Tj\nET'
   );
   
   return this;
};

Add text by calling the text16 function.

max7 M7, 2013-11-20
@max7

For example, in pdfjs (for node.js), adding a font to a document:
pdfjs
Function TTFFont.Subset.prototype.embed line 1065.
From line 1099 - 1143. // unicode map
Etc.

max7 M7, 2013-11-20
@max7

Try this project: Create PDFs in your browser (port of libharu)

6yp9T, 2013-11-20
@6yp9T

We once faced a similar problem, we solved it as follows:
1 - draw text on canvas;
2 - canvas to picture;
3 - image in jsPDF

Andrey Shevtsov, 2016-11-08
@phoenix2006

6yp9T , Do you have a good example?

Dmaw, 2020-01-29
@Dmaw

Take a screenshot of the text using html2canvas and transfer the image to PDF.