Generating text for utterances

Generating text for utterances

Aug 7, 2020

1 Make source text.

Here is a list of submitted sentences to the common voice project sentnece collection tool. It includes sentences yet to be validated.. but It is as good a start as any.

This json can be parsed down to lines of text by using json query to select the sentence element, and sed to remove the quote marks. We want to remove those so the model doesn’t get confused by them. It might be worth randomising the order of the lines too.

'''jq '.data[].sentence' <
mozilla.sentence.collector.20200527.records.json | sed -e 's/^"//'
-e 's/"$//' | head >
mozilla.sentence.collector.20200527.records.utts.txt

or maybe

gsed -e 's/\\\u2060//g' < mozilla.sentence.collector.20200527.records.json  | jq  '.data[] | select(.approved!=false).sentence' | gsed "s/[$SINGQ]/'/g; s/[$DUBQ]/\"/g" | iconv -f utf-8 -t ascii//translit |  sed -e 's/^"//' -e 's/"$//' | sed -e 's/\\\"/"/' > mozilla.sentence.collector.20200527.records.utts.clean.txt