I got an comment yesterday, he said,
Hi, I just install Cygwin and pdfgrep for windows and thanks for your job man, it works almost perfectly but I would like to grep some string containing accent (é,è,à…) and I see in the manage, that’s possible with the –unac option if pdfgrep is compiled with unac, but I’m from France and not really comfortable the terminal so I don’t know what you are saying by “only available if pdfgrep is compiled with unac support”, so I try to use –unac and of course I get the error : “unknown option –unac” so i would like to know what I have to do to use this option because I found it nowhere on the all web ?
So I update the new version pdfgrep, in this version, we can use pdfgrep –unac. this option will remove accents and ligatures from both the search pattern and the PDF documents. This is useful if you want to search for a word containing ‘ae’, but the PDF uses the single character ‘æ’ instead, or you want search Hotel word in both English(Hotel) and French(Hôtel), then you just need
pdfgrep –unac Hotel accent.pdf
btw, pdfgrep windows version just like other windows console application, does not need Cygwin or MinGW32.
PS. when compile new version pdfgrep, I got a by-product, unaccent windows, used to remove accents from input stream or a string.
From PostgreSQL 9, it also has an unaccent function.
unaccent is a text search dictionary that removes accents (diacritic signs) from lexemes. It’s a filtering dictionary, which means its output is always passed to the next dictionary (if any), unlike the normal behavior of dictionaries. This allows accent-insensitive processing for full text search.
The current implementation of unaccent cannot be used as a normalizing dictionary for the thesaurus dictionary.