The Chvátal–Sankoff problem: Understanding random string comparison through stochastic processes
Zapiski Nauchnykh Seminarov POMI, Representation theory, dynamical systems, combinatorial methods. Part XXXIV, Tome 517 (2022), pp. 191-224 Cet article a éte moissonné depuis la source Math-Net.Ru

Voir la notice du chapitre de livre

Given two equally long, uniformly random binary\break strings, the expected length of their longest common subsequence (LCS) is asymptotically proportional to the strings' length. Finding the proportionality coefficient $\gamma$, i.e. the limit of the normalised LCS length for two random binary strings of length $n \to \infty$, is a very natural problem, first posed by Chvátal and Sankoff in 1975, and as yet unresolved. This problem has relevance to diverse fields ranging from combinatorics and algorithm analysis to coding theory and computational biology. Using methods of statistical mechanics, as well as some existing results on the combinatorial structure of LCS, we link constant $\gamma$ to the parameters of a certain stochastic particle process. These parameters are determined by a specific (large) system of polynomial equations with integer coefficients, which implies that $\gamma$ is an algebraic number. Short of finding an exact closed-form solution for such a polynomial system, which appears to be unlikely, our approach essentially resolves the Chvátal–Sankoff problem, albeit in a somewhat unexpected way with a rather negative flavour.
@article{ZNSL_2022_517_a11,
     author = {A. Tiskin},
     title = {The {Chv\'atal{\textendash}Sankoff} problem: {Understanding} random string comparison through stochastic processes},
     journal = {Zapiski Nauchnykh Seminarov POMI},
     pages = {191--224},
     year = {2022},
     volume = {517},
     language = {en},
     url = {http://geodesic.mathdoc.fr/item/ZNSL_2022_517_a11/}
}
TY  - JOUR
AU  - A. Tiskin
TI  - The Chvátal–Sankoff problem: Understanding random string comparison through stochastic processes
JO  - Zapiski Nauchnykh Seminarov POMI
PY  - 2022
SP  - 191
EP  - 224
VL  - 517
UR  - http://geodesic.mathdoc.fr/item/ZNSL_2022_517_a11/
LA  - en
ID  - ZNSL_2022_517_a11
ER  - 
%0 Journal Article
%A A. Tiskin
%T The Chvátal–Sankoff problem: Understanding random string comparison through stochastic processes
%J Zapiski Nauchnykh Seminarov POMI
%D 2022
%P 191-224
%V 517
%U http://geodesic.mathdoc.fr/item/ZNSL_2022_517_a11/
%G en
%F ZNSL_2022_517_a11
A. Tiskin. The Chvátal–Sankoff problem: Understanding random string comparison through stochastic processes. Zapiski Nauchnykh Seminarov POMI, Representation theory, dynamical systems, combinatorial methods. Part XXXIV, Tome 517 (2022), pp. 191-224. http://geodesic.mathdoc.fr/item/ZNSL_2022_517_a11/

[1] A. Abboud, A. Backurs, V. Vassilevska Williams, “Tight hardness results for LCS and other sequence similarity measures”, Proceedings of FOCS, 2015, 59–78 | MR

[2] K. S. Alexander, “The rate of convergence of the mean length of the longest common subsequence”, Annal. Appl. Probab., 4:4 (1994), 1074–1082 | MR

[3] C. E. R. Alves, E. N. Cáceres, S. W. Song, “An all-substrings common subsequence algorithm”, Discrete Appl. Math., 156:7 (2008), 1025–1035 | DOI | MR

[4] R. A. Baeza-Yates, R. Gavaldà, G. Navarro, R. Scheihing, “Bounding the expected length of longest common subsequences and forests”, Theory of Computing Systems, 32:4 (1999), 435–452 | DOI | MR

[5] K. E. Batcher, “Sorting networks and their applications”, Proceedings of AFIPS, 32 (1968), 307–314

[6] A. Borodin, I. Corwin, V. Gorin, “Stochastic six-vertex model”, Duke Math. J., 165:3 (2016), 563–624 | DOI | MR

[7] K. Bringmann, M. K{ü}nnemann, “Multivariate fine-grained complexity of longest common subsequence”, Proceedings of ACM-SIAM SODA, 2018, 1216–1235 | MR

[8] B. Bukh, C. Cox, “Periodic words, common subsequences and frogs”, Annals Appl. Probab., 32:2 (2022), 1295–1332 | DOI | MR

[9] B. Bukh, V. Guruswami, J. Hastad, “An improved bound on the fraction of correctable deletions”, IEEE Transactions on Information Theory, 63:1 (2017), 93–103 | DOI | MR

[10] R. Bundschuh, “High precision simulations of the longest common subsequence problem”, European Phys. J. B, 22:4 (2001), 533–541 | DOI

[11] J. Casse, “Probabilistic cellular automata with general alphabets possessing a Markov chain as an invariant distribution”, Adv. Appl. Probab., 48:2 (2016), 369–391 | DOI | MR

[12] C C Chang, H J Keisler, Model Theory, Studies in Logic and the Foundations of Mathematics, 73, third edition, North-Holland, 1990 | MR

[13] P. Charalampopoulos, P. Gawrychowski, S. Mozes, O. Weimann, “An Almost Optimal Edit Distance Oracle”, Proceedings of ICALP, Leibniz International Proceedings in Informatics, 198, 2021, 48:1–48:20 | MR

[14] P. Charalampopoulos, T. Kociumaka, Sh. Mozes, “Dynamic string alignment”, Proceedings of CPM, LIPIcs, 161, 2020, 9:1–9:13 | MR

[15] V. Chvátal, D. Sankoff, “Longest common subsequences of two random sequences”, J. Appl. Probab., 12:2 (1975), 306–315 | DOI | MR

[16] M. Crochemore, C. S. Iliopoulos, Y. J. Pinzon, J. F. Reid, “A fast and practical bit-vector algorithm for the Longest Common Subsequence problem”, Inform. Proc. Letters, 80 (2001), 279–285 | DOI | MR

[17] M. Crochemore, G. M. Landau, M. Ziv-Ukelson, “A subquadratic sequence alignment algorithm for unrestricted score matrices”, SIAM J. Comput., 32 (2003), 1654–1673 | DOI | MR

[18] V. Dančík, Expected Length of Longest Common Subsequences, PhD thesis, University of Warwick, 1994

[19] J. Boutet De Monvel, “Extensive simulations for longest common subsequences: Finite size scaling, a cavity solution, and configuration space properties”, Europ. Phys. J. B, 7:2 (1999), 293–308 | DOI

[20] J. G. Deken, “Some limit results for longest common subsequences”, Discrete Math., 26:1 (1979), 17–31 | DOI | MR

[21] Pablo A. Ferrari, “TASEP hydrodynamics using microscopic characteristics”, Probab. Surveys, 15 (2018), 1–27 | MR

[22] P. Gawrychowski, “Faster algorithm for computing the edit distance between SLP-compressed strings”, Proceedings of SPIRE, Lecture Notes in Computer Science, 7608, 2012, 229–236 | DOI | MR

[23] D. Hermelin, G. M. Landau, S. Landau, O. Weimann, “Unified Compression-Based Acceleration of Edit-Distance Computation”, Algorithmica, 65:2 (2013), 339–353 | DOI | MR

[24] H. Hyyrö, “Mining bit-parallel LCS-length algorithms”, Proceedings of SPIRE, Lecture Notes in Computer Science, 10508, 2017, 214–220 | DOI | MR

[25] C U Jensen and H Lenzing, Model Theoretic Algebra, Algebra, Logic and Applications, 2, Gordon and Breach Science Publishers, 1989 | MR

[26] D. E. Knuth, The Art of Computer Programming: Sorting and Searching, v. 3, Addison Wesley, 1998 | MR

[27] Th. Kriecherbauer, J. Krug, “A pedestrian's view on interacting particle systems, KPZ universality and random matrices”, J. Phys. A: Math. Theor., 43:40 (2010), 403001 | DOI | MR

[28] P. Krusche, A. Tiskin, “String comparison by transposition networks”, London Algorithmics 2008 Theory and Practice, Texts in Algorithmics, 11, College Publications, 2009 | MR

[29] B. F. Logan, L. A. Shepp, “A variational problem for random Young tableaux”, Adv. Math., 26:2 (1977), 206–222 | DOI | MR

[30] G. S. Lueker, “Improved bounds on the average length of longest common subsequences”, J. ACM, 56:3 (2009), 17:1–17:38 | DOI | MR

[31] J. Mairesse, I. Marcovici, “Around probabilistic cellular automata”, Theor. Comput. Sci., 559 (2014), 42–72 | DOI | MR

[32] S. N. Majumdar, S. Nechaev, “Exact asymptotic results for the Bernoulli matching model of sequence alignment”, Phys. Review E: Statistical, Nonlinear, and Soft Matter Physics, 72:2 (2005), 020901 | DOI | MR

[33] J. Martin, P. Schmidt, “Multi-type TASEP in discrete time”, Latin Amer. J. Probab. Math. Statist., 8 (2011), 303–333 | MR

[34] W. J. Masek, M. S. Paterson, “A faster algorithm computing string edit distances”, J. Comput. System Sci., 20:1 (1980), 18–31 | DOI | MR

[35] U. Matarazzo, D. Tsur, M. Ziv-Ukelson, “Efficient all path score computations on grid graphs”, Theor Comput. Sci., 525 (2014), 138–149 | DOI | MR

[36] M. Paterson, V. Dančík, “Longest common subsequences”, Proceedings of MFCS, Lecture Notes in Computer Science, 841, 127–142 | DOI | MR

[37] P. A. Pevzner, M. S. Waterman, “Open combinatorial problems in computational molecular biology”, Proceedings of ISTCS, 158–173 | MR

[38] V. B. Priezzhev, G. M. Sch{ü}tz, “Exact solution of the Bernoulli matching model of sequence alignment”, J. Statist. Mech.: Theory and Experiment, 09 (2008), P09007

[39] N. Rajewsky, L. Santen, A. Schadschneider, M. Schreckenberg, “The asymmetric exclusion process: comparison of update procedures”, J. Statist. Phys., 92 (1998), 151–194 | DOI | MR

[40] D. Romik, The Surprising Mathematics of Longest Increasing Subsequences, Cambridge University Press, Cambridge, 2014 | MR

[41] H. Rost, “Non-equilibrium behaviour of a many particle process: Density profile and local equilibria”, Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 58:1 (1981), 41–53 | DOI | MR

[42] Y. Sakai, “An Almost Quadratic Time Algorithm for Sparse Spliced Alignment”, Theory Comput. Systems, 48:1 (2011), 189–210 | DOI | MR

[43] Y. Sakai, “A substring–substring LCS data structure”, Theor. Comput. Sci., 753:2 (2019), 16–34 | DOI | MR

[44] S. Salsa, Partial Differential Equations in Action, UNITEXT, 99, Springer International Publishing, 2016 | MR

[45] M. Schimd, G. Bilardi, “Bounds and Estimates on the Average Edit Distance”, Proceedings of SPIRE, 2019, 91–106 | MR

[46] J. P. Schmidt, “All highest scoring paths in weighted grid graphs and their application to finding all approximate repeats in strings”, SIAM J. Comput., 27:4 (1998), 972–992 | DOI | MR

[47] J. M. Steele, “An Efron-Stein inequality for nonsymmetric statistics”, Annals Statist., 14:2 (1986), 753–758 | DOI | MR

[48] J. M. Steele, Probability Theory and Combinatorial Optimization, CBMS-NSF regional conference series in applied mathematics, 69, SIAM, 1997 | MR

[49] A. Tiskin, “Semi-local longest common subsequences in subquadratic time”, J. Discrete Algorithms, 6:4 (2008), 570–581 | DOI | MR

[50] A. Tiskin, “Semi-local string comparison: Algorithmic techniques and applications”, Math. Comput. Sci., 1:4 (2008), 571–603 | DOI | MR

[51] A. Tiskin, “Periodic String Comparison”, Proceedings of CPM, Lecture Notes in Computer Science, 5577, 2009, 193–206 | DOI | MR

[52] A. Tiskin, “Towards Approximate Matching in Compressed Strings: Local Subsequence Recognition”, Proceedings of CSR, Lecture Notes in Computer Science, 6651, 2011, 401–414 | DOI | MR

[53] A. Tiskin, “Fast distance multiplication of unit-monge matrices”, Algorithmica, 71 (2015), 859–888 | DOI | MR

[54] A. Tiskin, “Bounded-length Smith-Waterman alignment”, Proceedings of WABI, Leibniz International Proceedings in Informatics, 143, 2019, 16:1–16:12

[55] A. Tiskin, “Communication vs synchronisation in parallel string comparison”, Proceedings of SPAA, 2020, 479–489

[56] A. M. Vershik, S. V. Kerov, “Asymptotics of the Plancherel measure of the symmetric group and the limiting form of Young tableaux”, Dokl. Akad. Nauk, 233:6 (1977), 1024–1027 | MR

[57] R. A. Wagner, M. J. Fischer, “The string-to-string correction problem”, J. ACM, 21:1 (1974), 168–173 | DOI | MR