diff --git a/latex/popl2024/popl.pdf b/latex/popl2024/popl.pdf index 220fcfa4..3d895fbd 100644 Binary files a/latex/popl2024/popl.pdf and b/latex/popl2024/popl.pdf differ diff --git a/latex/popl2024/popl.tex b/latex/popl2024/popl.tex index 1efe2efa..646646b1 100644 --- a/latex/popl2024/popl.tex +++ b/latex/popl2024/popl.tex @@ -157,19 +157,19 @@ \section{Problem statement}\label{sec:problem} In either case, it follows $\Delta(\err{\sigma}, \ell) \leq \max(m, n)$ and $\ell$ is always reachable via a finite nonempty set of Levenshtein edits, i.e., $0 < \Delta(\err{\sigma}, \ell) < \infty$. \end{proof} -Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$. +Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneqq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneqq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$. \begin{lemma}\label{lemma:interleaving} - The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + 1} \choose c}(|\Sigma| + 1)^c\right)$. + The complexity of searching $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + c} \choose c}(|\Sigma| + 1)^c\right)$. \end{lemma} \begin{proof} - We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + 1$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + 1$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$. + We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + c$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneqq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + c$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$. \end{proof} We note that the above bound applies to all strings and languages, and relates to the Hamming bound on $H_q(\err{\sigma}_\varepsilon)$, which only considers substitutions.~\footnote{This reflects our general approach, which builds a surjection from the interleaved Hamming ball onto the Levenshtein ball.} In practice, much tighter bounds may be obtained by considering the structure of $\ell$ and $\err{\sigma}$. For example, based on an empirical evaluation from a dataset of human errors and repairs in Python code snippets ($|\Sigma| = 50, |\err{\sigma}| < 40, \Delta(\err{\sigma}, \ell) \in [1, 3]$), we estimate the \textit{filtration rate}, i.e., the density of the admissible set relative to the Levenshtein ball, $D=|A|/|\Delta_q(\err{\sigma})|$ to have empirical mean $E_\sigma[D] \approx 2.6\times 10^{-4}$, and variance $\mathrm{Var}_\sigma[D] \approx 3.8\times10^{-7}$. -In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible. +In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneqq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneqq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible. \section{Matrix Theory}\label{sec:matrix} @@ -388,7 +388,7 @@ \subsection{Tensor sparsification}\label{sec:sparsity} Regardless of $\sigma$, it is always possible to discharge nonterminals inhabiting $\Lambda_{i, j}$ bitvectors outside the $(n-j+i)$-step neighborhood of $S$, or the $(j-i)$-step neighborhood of any terminal due to unreachability. This constrains several of the least- and greatest- upper diagonals of $\mathcal{M}_\sigma$. Further improvements to sparsity can be achieved by considering the specific structure of $\sigma$ and $\mathcal{G}$. \begin{definition}[Parikh image] - Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order. + Let $\pi: \underline\Sigma^*\rightarrow\mathbb{N}^{|\Sigma|}$ be the Parikh vector~\cite{parikh1966context}, which counts the number of times each terminal appears in a string. We define the Parikh image as the set of all terminals indexed a nonzero element of the Parikh vector, i.e., $\Pi(\sigma:\underline\Sigma^*) \coloneqq \{\sigma': \Sigma \mid 0 < \pi(\sigma)[\sigma']\}$. %Ordered by inclusion, the set of all minimal Parikh images, $\Pi^*(\sigma)\coloneqq \min_{\bm\sigma : 2^V \mid \forall \sigma' \in \bm\sigma \sigma' \in \text{H}(\sigma) \text{ and } }$ forms a partial order. \end{definition} \begin{lemma} diff --git a/latex/tacas2023/nfa_cfg.tex b/latex/tacas2023/nfa_cfg.tex new file mode 100644 index 00000000..b2836e36 --- /dev/null +++ b/latex/tacas2023/nfa_cfg.tex @@ -0,0 +1,139 @@ +\begin{minipage}[c]{0.45\textwidth} + \centering + \underline{NIA}\vspace{10pt} + \resizebox{\textwidth}{!}{ + \begin{tikzpicture}[ +%->, % makes the edges directed + >=stealth', + node distance=2.5cm, % specifies the minimum distance between two nodes. Change if necessary. +% every state/.style={thick, fill=gray!10}, % sets the properties for each ’state’ node + initial text=$ $, % sets the text that appears on the start arrow + ] + \node[state, initial] (00) {$q_{0,0}$}; + \node[state, right of=00] (10) {$q_{1,0}$}; + \node[state, right of=10] (20) {$q_{2,0}$}; + \node[state, right of=20] (30) {$q_{3,0}$}; + \node[right of=30] (40) {$\vphantom{\vdots}\cdots$}; + \node[accepting, state, right of=40] (n0) {$q_{n,0}$}; + + \node[state, above of=00] (01) {$q_{0,1}$}; + \node[state, right of=01] (11) {$q_{1,1}$}; + \node[state, right of=11] (21) {$q_{2,1}$}; + \node[state, right of=21] (31) {$q_{3,1}$}; + \node[right of=31] (41) {$\vphantom{\vdots}\cdots$}; + \node[accepting, state, right of=41] (n1) {$q_{n,1}$}; + + \node[above of=01] (0j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$}; +\node[right of=0j] (1j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$}; +\node[right of=1j] (2j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$}; +\node[right of=2j] (3j) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$}; +\node[right of=3j] (4j) {$\iddots$}; +\node[accepting, right of=4j] (nj) {$\mathmakebox[\widthof{$\cdots$}]{\vdots}$}; + +\node[state, above of=0j] (0k) {$q_{0,k}$}; +\node[state, right of=0k] (1k) {$q_{1,k}$}; +\node[state, right of=1k] (2k) {$q_{2,k}$}; +\node[state, right of=2k] (3k) {$q_{3,k}$}; +\node[right of=3k] (4k) {$\vphantom{\vdots}\cdots$}; +\node[accepting, state, right of=4k] (nk) {$q_{n,k}$}; + +\draw [->] (00) edge[below] node{$\sigma_1$} (10); +\draw [->] (10) edge[below] node{$\sigma_2$} (20); +\draw [->] (20) edge[below] node{$\sigma_3$} (30); +\draw [->] (30) edge[below] node{$\sigma_4$} (40); +\draw [->] (40) edge[below] node{$\sigma_n$} (n0); + +\draw [->] (01) edge[below] node{$\sigma_1$} (11); +\draw [->] (11) edge[below] node{$\sigma_2$} (21); +\draw [->] (21) edge[below] node{$\sigma_3$} (31); +\draw [->] (31) edge[below] node{$\sigma_4$} (41); +\draw [->] (41) edge[below] node{$\sigma_n$} (n1); + +\draw [->] (0j) edge[below] node{$\sigma_1$} (1j); +\draw [->] (1j) edge[below] node{$\sigma_2$} (2j); +\draw [->] (2j) edge[below] node{$\sigma_3$} (3j); +\draw [->] (3j) edge[below] node{$\sigma_4$} (4j); +\draw [->] (4j) edge[below] node{$\sigma_n$} (nj); + +\draw [->] (0k) edge[below] node{$\sigma_1$} (1k); +\draw [->] (1k) edge[below] node{$\sigma_2$} (2k); +\draw [->] (2k) edge[below] node{$\sigma_3$} (3k); +\draw [->] (3k) edge[below] node{$\sigma_4$} (4k); +\draw [->] (4k) edge[below] node{$\sigma_n$} (nk); + +\draw [->] (00) edge[left] node{$*$} (11); +\draw [->] (10) edge[left] node{$*$} (21); +\draw [->] (20) edge[left] node{$*$} (31); +\draw [->] (30) edge[left] node{$*$} (41); +\draw [->] (30) edge[bend right, below] node{$\sigma_5$} (41); +\draw [->] (40) edge[ right] node{$\sigma_n$} (n1); +\draw [->] (40) edge[bend right, left] node{$*$} (n1); + +\draw [->] (01) edge[left] node{$*$} (1j); +\draw [->] (11) edge[left] node{$*$} (2j); +\draw [->] (21) edge[left] node{$*$} (3j); +\draw [->] (31) edge[left] node{$*$} (4j); +\draw [->] (31) edge[bend right, below] node{$\sigma_5$} (4j); +\draw [->] (41) edge[ right] node{$\sigma_n$} (nj); +\draw [->] (41) edge[bend right, left] node{$*$} (nj); + +\draw [->] (0j) edge[left] node{$*$} (1k); +\draw [->] (1j) edge[left] node{$*$} (2k); +\draw [->] (2j) edge[left] node{$*$} (3k); +\draw [->] (3j) edge[left] node{$*$} (4k); +\draw [->] (3j) edge[bend right, below] node{$\sigma_5$} (4k); +\draw [->] (4j) edge[ right] node{$\sigma_n$} (nk); +\draw [->] (4j) edge[bend right, left] node{$*$} (nk); + +\draw [->] (00) edge[bend left, left] node{$*$} (01); +\draw [->] (10) edge[bend left, left] node{$*$} (11); +\draw [->] (20) edge[bend left, left] node{$*$} (21); +\draw [->] (30) edge[bend left, left] node{$*$} (31); +\draw [->] (40) edge[right] node{$*$} (41); +\draw [->] (n0) edge[bend right, right] node{$*$} (n1); + +\draw [->] (01) edge[bend left, left] node{$*$} (0j); +\draw [->] (11) edge[bend left, left] node{$*$} (1j); +\draw [->] (21) edge[bend left, left] node{$*$} (2j); +\draw [->] (31) edge[bend left, left] node{$*$} (3j); +\draw [->] (41) edge[right] node{$*$} (4j); +\draw [->] (n1) edge[bend right, right] node{$*$} (nj); + +\draw [->] (0j) edge[bend left, left] node{$*$} (0k); +\draw [->] (1j) edge[bend left, left] node{$*$} (1k); +\draw [->] (2j) edge[bend left, left] node{$*$} (2k); +\draw [->] (3j) edge[bend left, left] node{$*$} (3k); +\draw [->] (4j) edge[right] node{$*$} (4k); +\draw [->] (nj) edge[bend right, right] node{$*$} (nk); + +\draw [->] (00) edge[below] node{$\sigma_2$} (21); +\draw [->] (10) edge[below] node{$\sigma_3$} (31); +\draw [->] (20) edge[below] node{$\sigma_4$} (41); + +\draw [->] (01) edge[below] node{$\sigma_2$} (2j); +\draw [->] (11) edge[below] node{$\sigma_3$} (3j); +\draw [->] (21) edge[below] node{$\sigma_4$} (4j); + +\draw [->] (0j) edge[below] node{$\sigma_2$} (2k); +\draw [->] (1j) edge[below] node{$\sigma_3$} (3k); +\draw [->] (2j) edge[below] node{$\sigma_4$} (4k); + +%https://tex.stackexchange.com/a/20986/139648 +\draw [decorate,decoration={brace,amplitude=10pt,raise=10pt,mirror}] (00.south west) -- (n0.south east) node[midway,yshift=-3em]{\textbf{String length}}; +\draw [decorate,decoration={brace,amplitude=10pt,raise=20pt}] (00.south west) -- (0k.north west) node[midway,xshift=-40pt,rotate=90]{\textbf{Levenshtein edit distance}}; +\end{tikzpicture} +} +\end{minipage} +\hfill +\begin{minipage}[l]{5 cm} +\centering +\underline{CFG}\vspace{7pt} +\begin{align*} +S &\Rightarrow \{\cdot \in Q \mid \delta(\cdot, q_{n,0}) \leq k\}\\ +* &\Rightarrow \{\cdot \in \Sigma\}\\ +\big\{q_{i, j} &\Rightarrow \{q_{i, j-1}*\} \mid i, j \in [1, n]\times[1, k]\big\}\\ +\big\{q_{i, j} &\Rightarrow \{q_{i-1, j-1}*\}\mid i, j\in[1, n]\times [1, k]\big\}\\ +\big\{q_{i, j} &\Rightarrow \{q_{i-1, j} \sigma_i \}\mid i, j \in [1, n]\times[0, k]\big\} \\ +\big\{q_{i, j} &\Rightarrow \{q_{i-2, j-1} \sigma_i\} \mid i, j \in [2, n]\times[1, k] \big\}\\ +\end{align*} +\end{minipage} \ No newline at end of file diff --git a/latex/tacas2023/tacas.pdf b/latex/tacas2023/tacas.pdf index accc05a5..7f510a2b 100644 Binary files a/latex/tacas2023/tacas.pdf and b/latex/tacas2023/tacas.pdf differ diff --git a/latex/tacas2023/tacas.tex b/latex/tacas2023/tacas.tex index 0dcb93f1..76a07846 100644 --- a/latex/tacas2023/tacas.tex +++ b/latex/tacas2023/tacas.tex @@ -187,11 +187,11 @@ \subsection{Example}\label{sec:example} } \end{small} -The same procedure can be translated, without loss of generality, into the bit domain ($\mathbb{Z}_2^{|V|}$) using a lexicographic ordering, however these both are recognizers. That is to say, $[S\in V_{0, 3}]\Leftrightarrow [V_{0, 3, 3}=1] \Leftrightarrow [A(\sigma) \neq \varnothing]$. Since $V_{0, 3} = \{S\}$, we know there is at least one $\sigma' \in A(\sigma)$, but $M_\infty$ does not reveal its identity. +The same procedure can be translated, without loss of generality, into the bit domain ($\mathbb{Z}_2^{|V|}$) using a lexicographic ordering, however these both are recognizers. That is to say, $[S\in V_{0, 3}]\Leftrightarrow [V_{0, 3, 3}=\bs] \Leftrightarrow [A(\sigma) \neq \varnothing]$. Since $V_{0, 3} = \{S\}$, we know there is at least one $\sigma' \in A(\sigma)$, but $M_\infty$ does not reveal its identity. %$\{\text{xor}, \land, \top\}$ is a functionally complete set is equivalent to $\mathbb{Z}_2$ $\top := 1, \land := \times, \text{xor} := +$. We can define $=$ as $(a = b) \Leftrightarrow (a \text{ xor } b) \text{ xor } \top \Leftrightarrow (a + b) + \top$. -In order to extract the inhabitants, we can translate the bitwise procedure into an equation with free variables. Here, we can encode the idempotency constraint directly as $M = M^2$. We first define $X \boxtimes Z \coloneqq [X_2 \land Z_1, \bot, \bot, X_1 \land Z_0]$ and $X \boxplus Z \coloneqq [X_i \lor Z_i]_{i \in [0, |V|)}$. Since the unit nonterminals $O, N$ can only occur on the superdiagonal, they may be safely ignored by $\otimes$. To solve for $M_\infty$, we proceed by first computing $V_{0, 2}, V_{0, 3}$ as follows: +In order to extract the inhabitants, we can translate the bitwise procedure into an equation with free variables. Here, we can encode the idempotency constraint directly as $M = M^2$. We first define $X \boxtimes Z \coloneqq [X_2 \land Z_1, \bot, \bot, X_1 \land Z_0]$ and $X \boxplus Z \coloneqq [X_i \lor Z_i]_{i \in [0, |V|)}$. Since the unit nonterminals $O, N$ can only occur on the superdiagonal, they may be safely ignored by $\otimes$. To solve for $M_\infty$, we proceed by first computing $V_{0, 2}, V_{1, 3}$ as follows: \begin{align} V_{0, 2} &= V_{0, j} \cdot V_{j, 2} = V_{0, 1} \boxtimes V_{1, 2}\\ @@ -226,9 +226,129 @@ \subsection{Example}\label{sec:example} Now we know that $\sigma =$ 1 \underline{O} \underline{N} is a valid solution, and therefor we can take the product $\{1\}\times \sigma_r^{-1}(O) \times \sigma_r^{-1}(N)$ to recover the admissible set, yielding $A(\sigma)=\{1+0, 1+1, 1\times 0, 1\times 1\}$. In this case, since $G$ is unambiguous, there is only one parse tree satisfying $V_{0, |\sigma|, |\sigma|}$, but in general, there can be multiple valid parse trees, in which case we can decode them incrementally. +The question naturally arises, where should one put the holes? One solution is to interleave $\varepsilon$ between every token as $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, augment the grammar to admit $\varepsilon^+$, then sample holes without replacement from all possible locations. Below we illustrate this procedure on a single Python snippet. + +%A well-known result in FL theory is that the class of context-free languages are closed under intersection with regular languages, i.e., +% +%\begin{align} +% \ell_1:\textsc{Reg}, \ell_2: \textsc{Cfl} \vdash \text{ there exists } G \text{ s.t. } L(G): \textsc{Cfl} \text{ and } L(G) = \ell_1\cap\ell_2 +%\end{align} +% +%To compute the intersection between a CFG and a regular grammar, there is a standard construction from Beigel and Gasarch~\cite{beigelproof}, which allows us to compute the intersection grammar. We can then use the same procedure as before to compute the fixpoint, $M_\infty$, and extract the inhabitants. +% +%Alternatively, the procedure we use is to generate the contents of the grammar incrementally by sampling holes without replacement. We illustrate the procedure below, then discuss the theory. + +\begin{enumerate}[leftmargin=.23\linewidth] + \item \texttt{d = sum([foo(i\err{]} for i in vals))} + \item \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} + \hline + \texttt{d} & \texttt{=} & \texttt{sum} & \texttt{(} & \texttt{[} & \texttt{foo} & \texttt{(} & \texttt{i} & \texttt{]} & \texttt{for} & \texttt{i} & \texttt{in} & \texttt{vals} & \texttt{)} & \texttt{)} \\\hline + \end{tabular} + \item \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} + \hline + \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline + \end{tabular} + \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} + \hline + \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline + \end{tabular} + \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} + \hline + \cellcolor{black!15}\texttt{\_} & \texttt{=} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \cellcolor{black!15}\texttt{\_} \\\hline + \end{tabular}\\ + \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} + \hline + \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \texttt{(} & \texttt{[} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline +% \cellcolor{black!15}\texttt{\_} & \texttt{=} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \cellcolor{black!15}\texttt{\_} \\\hline +% & & & & & & & & & & & & & & \\\hline + \end{tabular}\\$\cdots$ + \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c||c|c|||c|||c|||c|||c|||c|||c|||c|||} + \hline +% \texttt{w} & \texttt{=} & & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline + \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{)} \\\hline + & & & & & & & \cellcolor{green!25}\texttt{+} & & \cellcolor{orange!25}\texttt{)} & & & & & \cellcolor{orange!25}\texttt{]} & \\\hline + \end{tabular} + \item \texttt{d = sum([foo(\hlgreen{+}i\hlorange{)} for i in vals\hlorange{]})} + \item \texttt{d = sum([foo(i\hlorange{)} for i in vals\hlorange{]})} +\end{enumerate} + +The initial broken string, \texttt{d = sum([foo(i\err{]} for i in vals))} (1), is first tokenized using a lexer to obtain the sequence in (2). Lexical tokens containing identifiers are abstracted in step (3), and interleaved with the empty token in step (4). We then sample hole configurations without replacement in step (5), many of which will have no admissible solutions. Eventually, the solver will discover an admissible solution, as seen in step (6). This solution is then used to generate a patch, which is applied to the original string in step (7), then reduced to its minimal form in step (8), and sampling is repeated until all possibilities are exhausted or a predetermined timeout expires. + +To admit variable-length edits and enable deletion, we first define a $\varepsilon^+$-production and introduce it to the right- and left-hand side of each terminal in a unit production in our grammar, $\mathcal{G}$:\vspace{5pt} + +\begin{prooftree} + \AxiomC{$\mathcal{G} \vdash \varepsilon \in \Sigma$} + \RightLabel{$\varepsilon\textsc{-dup}$} + \UnaryInfC{$\mathcal{G} \vdash (\varepsilon^+ \rightarrow \varepsilon \mid \varepsilon\:\varepsilon^+) \in P$} + \DisplayProof + \hskip 1.5em + \AxiomC{$\mathcal{G} \vdash (A \rightarrow B) \in P$} + \RightLabel{$\varepsilon^+\textsc{-int}$} + \UnaryInfC{$\mathcal{G} \vdash (A \rightarrow B\:\varepsilon^+ \mid \varepsilon^+\:B \mid B) \in P$} +\end{prooftree} + +Finally, to sample $\sigma\sim\Delta_{q}(\err{\sigma})$, we first interleave $\err\sigma$ as $\err\sigma_\varepsilon$ (see Lemma~\ref{lemma:interleaving}), then enumerate hole templates $\text{H}(\err\sigma_\varepsilon, i) = \sigma_{1\ldots i-1}\:\text{\_ \_}\:\sigma_{i+1\ldots n}$ for each $i \in \cdot \in \stirlingii{n}{d}$ and $d \in 1\ldots q$, then solve for $\tilde{\sigma} \in \text{H}(\err\sigma_\varepsilon, i)$ satisfying $[S \in \Lambda^*_{\tilde\sigma, \mathcal{G}}] \Leftrightarrow [\tilde\sigma \in \mathcal{L}(\mathcal{G})]$. If $\bm\sigma \coloneqq \text{H}(\err\sigma_\varepsilon, i)$ is nonempty, then each edit from each patch in each $\tilde{\sigma} \in \bm\sigma$ will match one of the following patterns, covering all three Levenshtein edits:\vspace{-10pt} + +\begin{align*} + \text{Deletion}&=\begin{cases} + \,\ldots\sigma_{i-1}\:\text{\hlred{$\gamma_1$}\:\hlred{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_{1, 2} = \varepsilon\label{eq:del} + \end{cases}\\ + \text{Substitution}&=\begin{cases} + \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlred{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 \neq \varepsilon \land \gamma_2 = \varepsilon\\ + \ldots\sigma_{i-1}\:\text{\hlred{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 = \varepsilon \land \gamma_2 \neq \varepsilon\\ + \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\{\gamma_1, \gamma_2\}\cap\{\varepsilon, \sigma_i\} = \varnothing + \end{cases}\\ + \text{Insertion}&=\begin{cases} + \ldots\sigma_{i-1}\:\text{\hlgreen{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 = \sigma_i \land \gamma_2 \notin \{\varepsilon, \sigma_i\}\label{eq:ins2}\\ + \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlgreen{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 \notin \{\varepsilon, \sigma_i\} \land \gamma_2 = \sigma_i\label{eq:ins1}\\ + \ldots\sigma_{i-1}\:\text{\hlgreen{$\gamma_1$}\:\hlgreen{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_{1,2} = \sigma_i\label{eq:copy} + \end{cases} +\end{align*} + +\subsection{Bar-Hillel Construction} + +Manually generating the edits is more controllable way to synthesize edits, but can be unnecessarily expensive if the goal is to synthesize all edits within a fixed edit distance. The second approach is more efficient, but requires generating a large grammar. We now describe the Bar-Hillel construction, which allows us to generate a grammar from a finite automaton, and then use the grammar to generate the edits without enumerating holes. + +\begin{definition} +A finite state automata is a tuple $\mathcal{A} = \langle Q, \Sigma, \delta, I, F\rangle$, where $Q$ is a finite set of states, $\Sigma$ is a finite alphabet, $\delta \subseteq Q \times \Sigma \times Q$ is the transition function, and $I, F \subseteq Q$ are the set of initial and final states, respectively. +\end{definition} + +\begin{lemma}\label{lemma:bar-hillel} +For any context-free language $\ell$ and finite state automata $\alpha$, there exists a context-free grammar $G^\cap$ such that $\mathcal{L}(G^\cap) = \ell \cap \mathcal{L}(\alpha)$. See Bar-Hillel~\cite{bar1961formal}. +\end{lemma} + +\noindent Beigel and Gasarch~\cite{beigelproof} provide one explicit way to construct $G^\cap$: + +\begin{prooftree} + \AxiomC{$i \in I, f \in F\vphantom{\overset{a}{\rightarrow}}$} +% \RightLabel{$\varepsilon^+\textsc{-int}$} + \UnaryInfC{$\big(S\rightarrow (i,S,f)\big) \in P^\cap$} + \DisplayProof + \hskip 1.5em + \AxiomC{$(q\overset{a}{\rightarrow}r) \in \delta$} +% \RightLabel{$\varepsilon^+\textsc{-int}$} + \UnaryInfC{$\big((q,a,r)\rightarrow a\big)\in P^\cap$} +\end{prooftree} +\begin{prooftree} + \AxiomC{$(w \rightarrow xz) \in P$} + \AxiomC{$p, q, r \in Q$} + \BinaryInfC{$\big((p,w,r)\rightarrow (p,w,q)(q,w,r)\big) \in P^\cap$} +\end{prooftree} + +Conveniently, the Levenshtein ball is recognized by a nondeterministic finite automata (NFA). From Lemma~\ref{lemma:bar-hillel}, we know the intersection of any context-free language and NFA is context-free, and therefor we can construct a single context-free grammar $G^\cap$ recognizes and generates the admissible set. + +\begin{figure}[H] + \begin{center} + \resizebox{.9\textwidth}{!}{\input{nfa_cfg.tex}} + \end{center} + \caption{Levenshtein reachability from $\Sigma^n$ can be described as an NFA, or a left-linear CFG.} +\end{figure} + +\noindent These rewrite rules can generate a large grammar. We can approximate the cardinality of the intersected grammar, $|P^\cap|=|I||F| + |\delta| + |P||Q|^3$. Although the grammar can be large, it significantly constrains the number of edits that must be checked compared with the enumerative approach. To generate edits from it, we can use the same procedure as before, but instead of interleaving $\err\sigma$ with $\varepsilon$ and introducing holes, we simply use $A\big((\_)^{|\err{\sigma}| + d}\big, G^\cap)$. + \subsection{Semiring Algebras} -There are a number of alternate semirings which solve the same problem. A first approach requires solving for $A(\sigma)$ using a semiring algebra, and propagating the values from the bottom-up as a string to a list of strings. Letting $D = V \rightarrow \mathcal{P}(\Sigma^*)$, we define $\oplus, \otimes: D \times D \rightarrow D$. Initially, we have $p(s: \Sigma) \coloneqq \{v \mid [v \rightarrow s]\in P\}$ and $p(\_) := \bigcup_{s\in \Sigma} p(s)$, then we compute the fixpoint using the algebra: +There are a number of alternate semirings which can be used to solve for $A(\sigma)$. A first approach propagates the values from the bottom-up, while mapping nonterminals to lists of strings. Letting $D = V \rightarrow \mathcal{P}(\Sigma^*)$, we define $\oplus, \otimes: D \times D \rightarrow D$. Initially, we have $p(s: \Sigma) \coloneqq \{v \mid [v \rightarrow s]\in P\}$ and $p(\_) := \bigcup_{s\in \Sigma} p(s)$, then we compute the fixpoint using the algebra: \begin{equation} X \oplus Z := \{v \rightarrow \big(X(v) \cup Z(v)\big) \mid v \in V\} @@ -261,10 +381,10 @@ \subsection{Semiring Algebras} \subsection{Complexity} -Let us consider some loose bounds on the the complexity of BCFL. To do, we first consider the complexity of computing language-edit distance, which is a lower-bound on the complexity of BCFL. +Let us consider some loose bounds on the the complexity of BCFLR. To do, we first consider the complexity of computing language-edit distance, which is a lower-bound on BCFLR complexity. \begin{definition} - Language edit distance (LED) is the problem of computing the minimum number of edits required to transform an invalid string into a valid one, where validity is defined as containment in a context-free language, $\ell: \mathcal{L}$, i.e., $\Delta^*(\err{\sigma}, \ell) \coloneqq \min_{\sigma \in \ell}\Delta(\err{\sigma}, \sigma)$, and $\Delta$ is the Levenshtein distance. LED is known to have a subcubic complexity.~\cite{bringmann2019truly} + Language edit distance (LED) is the problem of computing the minimum number of edits required to transform an invalid string into a valid one, where validity is defined as containment in a context-free language, $\ell: \mathcal{L}$, i.e., $\Delta^*(\err{\sigma}, \ell) \coloneqq \min_{\sigma \in \ell}\Delta(\err{\sigma}, \sigma)$, and $\Delta$ is the Levenshtein distance. LED is known to have subcubic complexity~\cite{bringmann2019truly}. \end{definition} We seek to find the set of strings $S$ such that $\forall \tilde{\sigma}\in S, \Delta(\err{\sigma}, \tilde{\sigma}) \leq q$, where $q$ is the maximum number of edits greater than or equal to the language edit distance. We call this set the \textit{Levenshtein ball} of $\err{\sigma}$ and denote it $\Delta_q(\err{\sigma})$. Since $1 \leq \Delta^*(\err{\sigma}, \ell) \leq q$, we have $1 \leq q$. We now consider an upper bound on $\Delta^*(\err{\sigma}, \ell)$, i.e., the greatest lower bound on $q$ such that $\Delta_q(\err{\sigma}) \cap \ell \neq \varnothing$. @@ -287,67 +407,17 @@ \subsection{Complexity} Let us now consider the maximum growth rate of the \textit{admissible set}, $A \coloneqq \Delta_q(\err{\sigma}) \cap \ell$, as a function of $q$ and $n$. Let $\bar\ell \coloneqq \{\err{\sigma}\}$. Since $\bar\ell$ is finite and thus regular, $\ell = \Sigma^* \setminus \{\err{\sigma}\}$ is regular by the closure of regular languages under complementation, and thus context-free a fortiori. Since $\ell$ accepts every string except $\err{\sigma}$, it represents the worst CFL in terms of asymptotic growth of $A$. \begin{lemma}\label{lemma:interleaving} -BCFL complexity is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + 1} \choose c}(|\Sigma| + 1)^c\right)$. +The complexity $A$ is upper bounded by $\mathcal{O}\left(\sum_{c=1}^q{{cn + n + c} \choose c}(|\Sigma| + 1)^c\right)$. \end{lemma} \begin{proof} - We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + 1$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneqq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + 1$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + 1} \choose c}|\Sigma_\varepsilon|^c$. + We can overestimate the size of $A$ by considering the number of unique ways to insert, delete, or substitute $c$ terminals into a string $\err{\sigma}$ of length $n$. This can be overaproximated by interleaving $\varepsilon^c$ around every token, i.e., $\err{\sigma}_\varepsilon\coloneqq \left(\varepsilon^c\err{\sigma}_i\right)_{i=1}^n\varepsilon^c$, where $|\err{\sigma}_\varepsilon| = cn + n + c$, and only considering substitution. We augment $\Sigma_\varepsilon \coloneqq \Sigma \cup \{\varepsilon\}$ so that deletions and insertions may be treated as special cases of substitution. Thus, we have $cn + n + c$ positions to substitute $(|\Sigma_\varepsilon|)$ tokens, i.e., ${{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$ ways to edit $\err{\sigma}_\varepsilon$ for each $c \in [1, q]$. This upper bound is not tight, as overcounts many identical edits w.r.t. $\err{\sigma}$. Nonetheless, it is sufficient to show $|A| < \sum_{c=1}^q{{cn + n + c} \choose c}|\Sigma_\varepsilon|^c$. \end{proof} We note that the above bound applies to all strings and languages, and relates to the Hamming bound on $H_q(\err{\sigma}_\varepsilon)$, which only considers substitutions.~\footnote{This reflects our general approach, which builds a surjection from the interleaved Hamming ball onto the Levenshtein ball.} In practice, much tighter bounds may be obtained by considering the structure of $\ell$ and $\err{\sigma}$. For example, based on an empirical evaluation from a dataset of human errors and repairs in Python code snippets ($|\Sigma| = 50, |\err{\sigma}| < 40, \Delta(\err{\sigma}, \ell) \in [1, 3]$), we estimate the \textit{filtration rate}, i.e., the density of the admissible set relative to the Levenshtein ball, $D=|A|/|\Delta_q(\err{\sigma})|$ to have empirical mean $E_\sigma[D] \approx 2.6\times 10^{-4}$, and variance $\mathrm{Var}_\sigma[D] \approx 3.8\times10^{-7}$. In practice, this problem is ill-posed even when $q = \Delta^*(\err{\sigma}, \ell) \approx 1$. For example, consider the language of ursine dietary preferences. Although $\err{\sigma}\coloneqq$ ``Bears like to eat plastic'' is not a valid sentence, e.g., $\tilde{\sigma}\coloneqq$``Bears like to eat'' is $(\Delta^*=1)$, however there are many others with roughly the same edit distance, e.g., ``Bears like to eat \{\hlorange{berries}, \hlorange{honey}, \hlorange{fish}\}'', or ``\{\hlgreen{Polar}, \hlgreen{Panda}\} bears like to eat \{\hlgreen{seals}, \hlgreen{bamboo}\}''. In general, there are usually many strings nearby $\err{\sigma}$, and we seek to find those among them which are both syntactically valid and semantically plausible as quickly as possible. -\subsection{Our solution to bounded CFL reachability} - -Now, let us return to the problem of bounded CFL reachability. The question naturally arises, of should one put the holes? One solution is to sample holes without replacement from all possible locations. Below is an example illustrating this procedure on a single Python snippet. - -%A well-known result in FL theory is that the class of context-free languages are closed under intersection with regular languages, i.e., -% -%\begin{align} -% \ell_1:\textsc{Reg}, \ell_2: \textsc{Cfl} \vdash \text{ there exists } G \text{ s.t. } L(G): \textsc{Cfl} \text{ and } L(G) = \ell_1\cap\ell_2 -%\end{align} -% -%To compute the intersection between a CFG and a regular grammar, there is a standard construction from Beigel and Gasarch~\cite{beigelproof}, which allows us to compute the intersection grammar. We can then use the same procedure as before to compute the fixpoint, $M_\infty$, and extract the inhabitants. -% -%Alternatively, the procedure we use is to generate the contents of the grammar incrementally by sampling holes without replacement. We illustrate the procedure below, then discuss the theory. - -\begin{enumerate}[leftmargin=.23\linewidth] - \item \texttt{d = sum([foo(i\err{]} for i in vals))} - \item \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} - \hline - \texttt{d} & \texttt{=} & \texttt{sum} & \texttt{(} & \texttt{[} & \texttt{foo} & \texttt{(} & \texttt{i} & \texttt{]} & \texttt{for} & \texttt{i} & \texttt{in} & \texttt{vals} & \texttt{)} & \texttt{)} \\\hline - \end{tabular} - \item \begin{tabular}{|c|c|c|c|c|c|c|c|c|c|c|c|c|c|c|} - \hline - \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline - \end{tabular} - \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} - \hline - \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline - \end{tabular} - \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} - \hline - \cellcolor{black!15}\texttt{\_} & \texttt{=} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \cellcolor{black!15}\texttt{\_} \\\hline - \end{tabular}\\ - \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||c|||} - \hline - \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \texttt{(} & \texttt{[} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline -% \cellcolor{black!15}\texttt{\_} & \texttt{=} & \cellcolor{black!15}\texttt{\_} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \cellcolor{black!15}\texttt{\_} \\\hline -% & & & & & & & & & & & & & & \\\hline - \end{tabular}\\$\cdots$ - \item \begin{tabular}{|||c|||c|||c|||c|||c|||c|||c||c|c|||c|||c|||c|||c|||c|||c|||c|||} - \hline -% \texttt{w} & \texttt{=} & & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \texttt{w} & \texttt{]} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \texttt{)} & \texttt{)} \\\hline - \texttt{w} & \texttt{=} & \texttt{w} & \texttt{(} & \texttt{[} & \texttt{w} & \texttt{(} & \cellcolor{black!15}\texttt{\_} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{for} & \texttt{w} & \texttt{in} & \texttt{w} & \cellcolor{black!15}\texttt{\_} & \texttt{)} \\\hline - & & & & & & & \cellcolor{green!25}\texttt{+} & & \cellcolor{orange!25}\texttt{)} & & & & & \cellcolor{orange!25}\texttt{]} & \\\hline - \end{tabular} - \item \texttt{d = sum([foo(\hlgreen{+}i\hlorange{)} for i in vals\hlorange{]})} - \item \texttt{d = sum([foo(i\hlorange{)} for i in vals\hlorange{]})} -\end{enumerate} - -The initial broken string, \texttt{d = sum([foo(i\err{]} for i in vals))} (1), is first tokenized using a lexer to obtain the sequence in (2). Lexical tokens containing identifiers are abstracted in step (3), and interleaved with the empty token in step (4). We then sample hole configurations without replacement in step (5), many of which will have no admissible solutions. Eventually, the solver will discover an admissible solution, as seen in step (6). This solution is then used to generate a patch, which is applied to the original string in step (7), then reduced to its minimal form in step (8), and sampling is repeated until all possibilities are exhausted or a predetermined timeout expires. - \subsection{Sampling the Levenshtein ball without replacement in $\mathcal{O}(1)$}\label{sec:dsi} Now that we have a reliable method to synthesize admissible completions for strings containing holes, i.e., fix \textit{localized} errors, $F: (\mathcal{G} \times \underline\Sigma^n) \rightarrow \{\Sigma^n\}\subseteq \mathcal{L}(\mathcal{G})$, how can we use $F$ to repair some unparseable string, i.e., $\err{\sigma_1\ldots\:\sigma_n}: \Sigma^n \cap\mathcal{L}(\mathcal{G})^\complement$ where the holes' locations are unknown? Three questions stand out in particular: how many holes are needed to repair the string, where should we put those holes, and how ought we fill them to obtain a parseable $\tilde{\sigma} \in \mathcal{L}(\mathcal{G})$? @@ -356,7 +426,7 @@ \subsection{Sampling the Levenshtein ball without replacement in $\mathcal{O}(1) To implement this strategy, we first construct a surjection $\varphi^{-1}: \mathbb{Z}_2^m\twoheadrightarrow\Delta_{q}(\err{\sigma})$ from bitvectors to Levenshtein edits over $\err\sigma, \Sigma$, sample bitvectors without replacement using a characteristic polynomial, then decode the resulting bitvectors into Levenshtein edits. This ensures the sampler eventually visits every Levenshtein edit at least exactly once and at most approximately once, without needing to store any samples, and discovers a steady stream of admissible edits throughout the solving process, independent of the grammar or string under repair. -More specifically, we employ a pair of [un]tupling functions $\kappa, \rho: \mathbb{N}^k \leftrightarrow \mathbb{N}$ which are (1) bijective (2) maximally compact (3) computationally tractable (i.e., closed form inverses). $\kappa$ will be used to index $\stirlingii{n}{k}$\footnote[2]{\text{Following Stirling, we use $\stirlingii{n}{d}$ to denote the set of all $d$-element subsets of $\{1,\ldots, n\}$.}}-combinations via the Maculay representation~\cite{knuth2005taocp} and $\rho$ will index $\Sigma^k$ tuples, but is slightly more tricky to define. To maximize compactness, there is an elegant pairing function courtesy of Szudzik~\cite{szudzik2006elegant}, which enumerates concentric square shells over the plane $\mathbb{N}^2$ and can be generalized to hypercubic shells in $\mathbb{N}^k$. +More specifically, we employ a pair of [un]tupling functions $\kappa, \rho: \mathbb{N}^k \leftrightarrow \mathbb{N}$ which are (1) bijective (2) maximally compact (3) computationally tractable (i.e., closed form inverses). $\kappa$ will be used to index $\stirlingii{n}{k}$\footnote[2]{\text{Following Stirling, we use $\stirlingii{n}{d}$ to denote the set of all $d$-element subsets of $\{1,\ldots, n\}$.}}-combinations and $\rho$ will index $\Sigma^k$ tuples, but is slightly more tricky to define. To maximize compactness, there is an elegant pairing function courtesy of Szudzik~\cite{szudzik2006elegant}, which enumerates concentric square shells over the plane $\mathbb{N}^2$ and can be generalized to hypercubic shells in $\mathbb{N}^k$. Although $\langle\kappa, \rho\rangle$ could be used directly to exhaustively search the Levenshtein ball, they are temporally biased samplers due to lexicographic ordering. Rather, we would prefer a path that uniformly visits every fertile subspace of the Levenshtein ball over time regardless of the grammar or string in question: subsequences of $\langle\kappa, \rho\rangle$ should discover valid repairs with frequency roughly proportional to the filtration rate, i.e., the density of the admissible set relative to the Levenshtein ball. These additional constraints give rise to two more criteria: (4) ergodicity and (5) periodicity. @@ -411,37 +481,6 @@ \subsection{Sampling the Levenshtein ball without replacement in $\mathcal{O}(1) In addition to its statistically desirable properties, our sampler has the practical benefit of being trivially parallelizable using the leapfrog method, i.e., given $p$ independent processors, each one can independently check $[\varphi^{-1}(\langle\kappa, \rho\rangle^{-1}(\mathbf{R}_{i}), \err{\sigma}) \in \mathcal{L}(\mathcal{G})]$ where $i \equiv p_j \pmod p$. This procedure linearly scales with the number of processors, exhaustively searching $\Delta_{q}(\err{\sigma})$ in $p^{-1}$ of the time required by a single processor, or alternately drawing $p$ times as many samples in the same time. -To admit variable-length edits and enable deletion, we first define a $\varepsilon^+$-production and introduce it to the right- and left-hand side of each terminal unit production contained in the grammar, $\mathcal{G}$, as follows:\vspace{5pt} - -\begin{prooftree} - \AxiomC{$\mathcal{G} \vdash \varepsilon \in \Sigma$} - \RightLabel{$\varepsilon\textsc{-dup}$} - \UnaryInfC{$\mathcal{G} \vdash (\varepsilon^+ \rightarrow \varepsilon \mid \varepsilon\:\varepsilon^+) \in P$} - \DisplayProof - \hskip 1.5em - \AxiomC{$\mathcal{G} \vdash (A \rightarrow B) \in P$} - \RightLabel{$\varepsilon^+\textsc{-int}$} - \UnaryInfC{$\mathcal{G} \vdash (A \rightarrow B\:\varepsilon^+ \mid \varepsilon^+\:B \mid B) \in P$} -\end{prooftree} - -Finally, to sample $\sigma\sim\Delta_{q}(\err{\sigma})$, we first interleave $\err\sigma$ as $\err\sigma_\varepsilon$, then enumerate hole templates $\text{T}(\err\sigma_\varepsilon, i) = \sigma_{1\ldots i-1}\:\text{\_ \_}\:\sigma_{i+1\ldots n}$ for each $i \in \cdot \in \stirlingii{n}{d}$ and $d \in 1\ldots q$, then solve for $A(\err\sigma_\varepsilon)$. If $\bm\sigma \coloneqq \text{A}(\err\sigma_\varepsilon)$ is nonempty, then each edit from each patch in each $\tilde{\sigma} \in \bm\sigma$ will match one of the following patterns, covering all three Levenshtein edits:\vspace{-10pt} - -\begin{align*} - \text{Deletion}&=\begin{cases} - \,\ldots\sigma_{i-1}\:\text{\hlred{$\gamma_1$}\:\hlred{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_{1, 2} = \varepsilon\label{eq:del} - \end{cases}\\ - \text{Substitution}&=\begin{cases} - \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlred{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 \neq \varepsilon \land \gamma_2 = \varepsilon\\ - \ldots\sigma_{i-1}\:\text{\hlred{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 = \varepsilon \land \gamma_2 \neq \varepsilon\\ - \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\{\gamma_1, \gamma_2\}\cap\{\varepsilon, \sigma_i\} = \varnothing - \end{cases}\\ - \text{Insertion}&=\begin{cases} - \ldots\sigma_{i-1}\:\text{\hlgreen{$\gamma_1$}\:\hlorange{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 = \sigma_i \land \gamma_2 \notin \{\varepsilon, \sigma_i\}\label{eq:ins2}\\ - \ldots\sigma_{i-1}\:\text{\hlorange{$\gamma_1$}\:\hlgreen{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_1 \notin \{\varepsilon, \sigma_i\} \land \gamma_2 = \sigma_i\label{eq:ins1}\\ - \ldots\sigma_{i-1}\:\text{\hlgreen{$\gamma_1$}\:\hlgreen{$\gamma_2$}}\:\sigma_{i+1}\ldots\hspace{0.2cm}\gamma_{1,2} = \sigma_i\label{eq:copy} - \end{cases} -\end{align*} - \noindent Although complete with respect to $\Delta_{q}(\err{\sigma})$, this approach can produce patches containing more Levenshtein edits than are strictly necessary to repair $\err\sigma$. To ensure patches are both minimal and syntactically valid, we first introduce a simple technique to minimize the repairs in \S\ref{sec:minimization}. By itself, uniformly sampling minimal repairs $\tilde\sigma\sim\Delta_{q}(\err{\sigma})\cap\mathcal{L}(\mathcal{G})$ is sufficient but can be quite time-consuming. To further reduce sample complexity and enable real-time repairs, we will then introduce a more efficient density estimator based on adaptive resampling (\S\ref{sec:adaptive}). \subsection{Patch minimization}\label{sec:minimization} @@ -517,7 +556,7 @@ \subsection{Probabilistic reachability}\label{sec:adaptive} % \delta_{\textsc{KR}}(\mu, \nu) \coloneqq \inf_{\pi\in \Pi(\mu, \nu)}\int_{\Omega\times \Omega} \delta(x, y)d\pi(x, y) %\end{align} -More specifically in our setting, we want to sample from a discrete product space that factorizes into (1) the specific edit locations (e.g., informed by caret position, historical edit locations, or a static analyzer), (2) probable completions (e.g., from a Markov chain or neural language model) and (3) an accompanying \textit{cost model}, $C: (\Sigma^* \times \Sigma^*) \rightarrow \mathbb{R}$, which may be any number of suitable distance metrics, such as language edit distance, finger travel distance on a physical keyboard, weighted Levenshtein distance, or stochastic contextual edit distance~\cite{cotterell+al.acl14} in the case of probabilistic edits. Our goal then, is to discover repairs which minimize $C(\err{\sigma}, \tilde{\sigma})$, subject to the given grammar and latency constraints. +More specifically, we want to sample from a discrete product space that factorizes into (1) the edit locations (e.g., informed by caret position, historical edit locations, etc.), (2) probable completions (e.g., from a Markov chain or neural language model) and (3) an accompanying \textit{cost model}, $C: (\Sigma^* \times \Sigma^*) \rightarrow \mathbb{R}$, which may be any number of suitable distance metrics, such as language edit distance, weighted Levenshtein distance, or stochastic contextual edit distance~\cite{cotterell+al.acl14} in the case of probabilistic edits. Our goal then, is to discover repairs minimizing $C(\err{\sigma}, \tilde{\sigma})$, subject to the given grammar and latency constraints. %\subsection{Ranking} % diff --git a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SortValiant.kt b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SortValiant.kt index 4bc7f294..64b3401c 100644 --- a/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SortValiant.kt +++ b/src/commonMain/kotlin/ai/hypergraph/kaliningraph/parsing/SortValiant.kt @@ -49,7 +49,7 @@ fun CFG.sortwiseAlgebra(metric: ChoiceMetric): Ring = times = { x, y -> join(x, y, metric) }, ) -var MAX_SORT_CAPACITY = 1000 +var MAX_SORT_CAPACITY = 50 // X ⊗ Z := { w | ∈ X × Z, (w -> xz) ∈ P } // Greedily selects candidate string fragments according to ChoiceMetric fun CFG.join(X: Sort, Z: Sort, metric: ChoiceMetric = { it.weight }): Sort = diff --git a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt index fd2f4054..dff40517 100644 --- a/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt +++ b/src/commonTest/kotlin/ai/hypergraph/kaliningraph/parsing/SetValiantTest.kt @@ -846,7 +846,7 @@ Yield_Arg -> From_Keyword Test | Testlist_Endcomma O -> * O -> + L -> O N - """.parseCFG() + """.parseCFG().noNonterminalStubs val fsa = """ INIT: 1, 3 FINAL: 4 @@ -862,7 +862,7 @@ Yield_Arg -> From_Keyword Test | Testlist_Endcomma 3 -> 1 + 3 -> 1 * 1 -> a | 1 a | START + | START * - """.parseCFG() + """.parseCFG().noNonterminalStubs println("Grammar size: ${bhcfg.size}") println("Solutions:") @@ -880,12 +880,12 @@ Yield_Arg -> From_Keyword Test | Testlist_Endcomma val clock = TimeSource.Monotonic.markNow() measureTime { -// bhcfg.solveSeq(template) - bhcfg.solve(template) { it.weight } +// cfg.solveSeq(template) + cfg.solve(template) { it.weight } .onEach { println("${clock.elapsedNow().inWholeMilliseconds}ms: " + it) - assertTrue { it in bhcfg.language } - assertTrue { it in fsaCfg.language } +// assertTrue { it in bhcfg.language } +// assertTrue { it in fsaCfg.language } assertTrue { it in cfg.language } }.toList().also { println("Found ${it.size} solutions.") } }.also { println("Sequential solver took: ${it.inWholeMilliseconds}ms") }