The Unreasonable Effectiveness of Pattern Generation
Zpravodaj Československého sdružení uživatelů TeXu, Tome 29 (2019) no. 1-4, pp. 73-86
Cet article a éte moissonné depuis la source Czech Digital Mathematics Library

Voir la notice de l'article

Languages are constantly evolving, and so are their hyphenation rules and needs. The effectiveness and utility of TeX's hyphenation have been proven by its usage in almost all typesetting systems in use today. The current Czech hyphenation patterns were generated in 1995, and no hyphenated word database was freely available. We have developed a new Czech word database and have used the patgen program to generate new effective Czech hyphenation patterns efficiently and evaluated their generalization qualities. We have achieved full coverage on the training dataset of 3,000,000 words, and developed a validation procedure of new patterns for Czech based on the testing database of 105,000 words approved by the Czech Academy of Science linguists. Our pattern generation case study exemplifies a practical solution to the widespread dictionary problem. The study has proven the versatility, effectiveness, and extensibility of Liang's approach to hyphenation developed for TeX. The unreasonable effectiveness of the pattern technology has led to applications that are and will be used, even more widely now, nearly 40 years after its inception.
Languages are constantly evolving, and so are their hyphenation rules and needs. The effectiveness and utility of TeX's hyphenation have been proven by its usage in almost all typesetting systems in use today. The current Czech hyphenation patterns were generated in 1995, and no hyphenated word database was freely available. We have developed a new Czech word database and have used the patgen program to generate new effective Czech hyphenation patterns efficiently and evaluated their generalization qualities. We have achieved full coverage on the training dataset of 3,000,000 words, and developed a validation procedure of new patterns for Czech based on the testing database of 105,000 words approved by the Czech Academy of Science linguists. Our pattern generation case study exemplifies a practical solution to the widespread dictionary problem. The study has proven the versatility, effectiveness, and extensibility of Liang's approach to hyphenation developed for TeX. The unreasonable effectiveness of the pattern technology has led to applications that are and will be used, even more widely now, nearly 40 years after its inception.
DOI : 10.5300/2019-1-4/73
Keywords: hyphenation patterns; patgen; unreasonable effectiveness; Czech; patgen; vzory dělení slov; nepochopitelná efektivita; čeština
@article{10_5300_2019_1_4_73,
     author = {Sojka, Petr and Sojka, Ond\v{r}ej},
     title = {The {Unreasonable} {Effectiveness} of {Pattern} {Generation}},
     journal = {Zpravodaj \v{C}eskoslovensk\'eho sdru\v{z}en{\'\i} u\v{z}ivatel\r{u} TeXu},
     pages = {73--86},
     year = {2019},
     volume = {29},
     number = {1-4},
     doi = {10.5300/2019-1-4/73},
     language = {en},
     url = {http://geodesic.mathdoc.fr/articles/10.5300/2019-1-4/73/}
}
TY  - JOUR
AU  - Sojka, Petr
AU  - Sojka, Ondřej
TI  - The Unreasonable Effectiveness of Pattern Generation
JO  - Zpravodaj Československého sdružení uživatelů TeXu
PY  - 2019
SP  - 73
EP  - 86
VL  - 29
IS  - 1-4
UR  - http://geodesic.mathdoc.fr/articles/10.5300/2019-1-4/73/
DO  - 10.5300/2019-1-4/73
LA  - en
ID  - 10_5300_2019_1_4_73
ER  - 
%0 Journal Article
%A Sojka, Petr
%A Sojka, Ondřej
%T The Unreasonable Effectiveness of Pattern Generation
%J Zpravodaj Československého sdružení uživatelů TeXu
%D 2019
%P 73-86
%V 29
%N 1-4
%U http://geodesic.mathdoc.fr/articles/10.5300/2019-1-4/73/
%R 10.5300/2019-1-4/73
%G en
%F 10_5300_2019_1_4_73
Sojka, Petr; Sojka, Ondřej. The Unreasonable Effectiveness of Pattern Generation. Zpravodaj Československého sdružení uživatelů TeXu, Tome 29 (2019) no. 1-4, pp. 73-86. doi: 10.5300/2019-1-4/73

1. Pereira, Fernando, Norvig, Peter, Halevy, Alon: The Unreasonable Effectiveness of Data. IEEE Intelligent Systems. 2009, vol. 24, no. 02, s. 8–12. ISSN 1541-1672. Dostupné z DOI: 10.1109/MIS.2009.36 | DOI

2. Wigner, Eugene P.: The Unreasonable Effectiveness of Mathematics in the Natural Sciences. Richard Courant Lecture in Mathematical Sciences delivered at New York University, May 11, 1959. Communications on Pure and Applied Mathematics. 1960, vol. 13, no. 1, s. 1–14. Dostupné z DOI: 10.1002/cpa.3160130102 | DOI | MR

3. Hamming, Richard W.: The Unreasonable Effectiveness of Mathematics. The American Mathematical Monthly. 1980, vol. 87, no. 2, s. 81–90. ISSN 00029890, 19300972. ISSN 00029890, 19300972. Dostupné také z: https://www.jstor.org/stable/2321982 | DOI | MR

4. Liang, Franklin M.: Word Hy-phen-a-tion by Com-put-er. 1983. Dostupné také z: https://tug.org/docs/liang/ Disertační práce. Stanford University.

5. Sojka, Petr: Competing Patterns in Language Engineering and Computer Typesetting. 2005. Disertační práce. Faculty of Informatics.

6. Reutenauer, Arthur, Miklavec, Mojca: TeX hyphenation patterns. [online]. TUG [cit. 2019-11-14]. Dostupné z: https://tug.org/tex-hyphen/

.7 Lemberg, Werner: A database of German words with hyphenation information. Dostupné také z: https://repo.or.cz/wortliste.git

8. Sojka, Petr, Ševeček, Pavel: Hyphenation in TeX - Quo Vadis?. TUGboat. 1995, vol. 16, no. 3, s. 280–289.

9. Internetová jazyková příručka (Internet Language Reference Book). [online]. Institute of Czech language, Czech Academy of Sciences [cit. 2019-07-18]. Dostupné z: http://prirucka.ujc.cas.cz/?id=135

10. Sojka, Petr: Hyphenation on Demand. TUGboat. 1999, vol. 20, no. 3, s. 241–247. https://tug.org/TUGboat/tb20-3/tb64sojka.pdf

11. Sojka, Ondřej, Sojka, Petr: cshyphen repository. Dostupné také z: https://github.com/tensojka/cshyphen

12. Sojka, Petr: Notes on Compound Word Hyphenation in TeX. TUGboat. 1995, vol. 16, no. 3, s. 290–297.

13. Jakubíčekm Milos, Kilgarriff, Adam, Kovář, Vojtěch, Rychlý, Pavel, Suchomel, Vít: The TenTen Corpus Family. In: Proc. of 7th International Corpus Linguistics Conference (CL). Lancaster, 2013, s. 125–127.

14. Suchomel, Vít, Pomikálek, Jan: Efficient Web Crawling for Large Text Corpora. In: KILGARRIFF, Adam; SHAROFF, Serge (eds.). Proc. of the seventh Web as Corpus Workshop (WAC). Lyon, 2012, s. 39–43. Dostupné také z: https://sigwac.org.uk/raw-attachment/wiki/WAC7/wac7-proc.pdf

15. Šmerk, Pavel: Fast Morphological Analysis of Czech. In: SOJKA, Petr; HORÁK, Aleš (eds.). Proceedings of Recent Advances in Slavonic Natural Language Processing, RASLAN 2009. Karlova Studánka, Czech Republic: Masaryk University, 2009, s. 13–16. ISBN 978-80-210-5048-8. Dostupné také z: http://nlp.fi.muni.cz/raslan/2009/

16. Scannell, Kevin Patrick: Hyphenation patterns for minority languages. TUGboat. 2003, vol. 24, no. 2, s. 236–239. | DOI

17. Shao, Yan, Hardmeier, Christina, Nivre, Joakim: Universal Word Segmentation: Implementation and Interpretation. ransactions of the Association for Computational Linguistics. 2018, vol. 6, s. 421–435. Dostupné z DOI: 10.1162/tacl_a_00033 | DOI

18. Sennrich, Rico, Haddor, Barry, Birch, Alexandra: Neural Machine Translation of Rare Words with Subword Units. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).Berlin, Germany: Association for Computational Linguistics, 2016, s. 1715-1725. Dostupné z DOI: 10.18653/v1/P16-1162 | DOI

19. Zeldes, Amir: A Characterwise Windowed Approach to Hebrew Morphological Segmentation. In: Proc. of the Fifteenth Workshop on Computational Research in Phonetics, Phonology, and Morphology. Brussels, Belgium: Association for Computational Linguistics, 2018, s. 101–110. Dostupné z DOI: 10.18653/v1/W18-5811

20. Lample, Guillaume, Sablayrolles, Alexandre, Ranzato, Marc'Aurelio, Denoyer, Ludovic, Jégou, Hervé: Large Memory Layers with Product Keys. [online]. 2019 [cit. 2019-07-18]. Dostupné z arXiv: 1907.05242 [cs.CL].

Cité par Sources :