You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The issue arises when an em-dash is used between 2 words, and where there are other punctuation marks adjacent to the em-dash. i.e. where the first part before the em-dash is exclaimed or is a question etc.
Here is an "OK" sample, where the em-dash is the only character (no other adjacent punctuation):
Within its first year, Mabel and Harry had sampled all of the bakery’s offerings—all 62 items—and had also decided that the exercise was worth repeating.
She waited for some time without hearing anything more: at last came a rumbling of little cartwheels, and the sound of a good many voices all talking together: she made out the words: “Where’s the other ladder?—Why, I hadn’t to bring but one; Bill’s got the other—Bill! fetch it here, lad!—Here, put ’em up at this corner—No, tie ’em together first—they don’t reach half high enough yet—Oh! they’ll do well enough; don’t be particular—Here, Bill! catch hold of this rope—Will the roof bear?—Mind that loose slate—Oh, it’s coming down! Heads below!” (a loud crash)—“Now, who did that?—It was Bill, I fancy—Who’s to go down the chimney?—Nay, I shan’t! You do it!—That I won’t, then!—Bill’s to go down—Here, Bill! the master says you’re to go down the chimney!”
I am aware of the issue but thank you for raising it. It's a problem with the underlying text tagging library (SpaCy). I have tried developing workarounds but so far without success. But I'll look into it further.
Huh.... I thought I had posted a suggested "dirty" fix but I don't see it here in the chat. Work-around is to pad any em-dashes with spaces before conversion, then removing the spaces after conversion. Though this is simple enough to do to my source text before running it through latin2shaw.
Yes, this should work, but it will (I think) mean that if the original text had spaces on either side of the dashes, they will be removed. I believe there is a way to tell the SpaCy tagger to recognise a dash next to punctuation as punctuation but I have so far struggled to make it work. This is what I was thinking to look into again. If that fails, I'll add your suggested workaround.
The issue arises when an em-dash is used between 2 words, and where there are other punctuation marks adjacent to the em-dash. i.e. where the first part before the em-dash is exclaimed or is a question etc.
Merriam-Webter does have examples showing this particular usage of the em-dash here
https://www.merriam-webster.com/grammar/em-dash-en-dash-how-to-use
Here is an "OK" sample, where the em-dash is the only character (no other adjacent punctuation):
Within its first year, Mabel and Harry had sampled all of the bakery’s offerings—all 62 items—and had also decided that the exercise was worth repeating.
𐑢𐑦𐑞𐑦𐑯 𐑦𐑑𐑕 𐑓𐑻𐑕𐑑 𐑘𐑽, ·𐑥𐑱𐑚𐑩𐑤 𐑯 ·𐑣𐑨𐑮𐑦 𐑣𐑨𐑛 𐑕𐑭𐑥𐑐𐑩𐑤𐑛 𐑷𐑤 𐑝 𐑞 𐑚𐑱𐑒𐑼𐑦𐑟 𐑪𐑓𐑼𐑦𐑙𐑟—𐑷𐑤 62 𐑲𐑑𐑩𐑥𐑟—𐑯 𐑣𐑨𐑛 𐑷𐑤𐑕𐑴 𐑛𐑦𐑕𐑲𐑛𐑩𐑛 𐑞𐑨𐑑 𐑞 𐑧𐑒𐑕𐑼𐑕𐑲𐑟 𐑢𐑪𐑟 𐑢𐑻𐑔 𐑮𐑦𐑐𐑰𐑑𐑦𐑙.
Here is a particular bad sample from Alice:
She waited for some time without hearing anything more: at last came a rumbling of little cartwheels, and the sound of a good many voices all talking together: she made out the words: “Where’s the other ladder?—Why, I hadn’t to bring but one; Bill’s got the other—Bill! fetch it here, lad!—Here, put ’em up at this corner—No, tie ’em together first—they don’t reach half high enough yet—Oh! they’ll do well enough; don’t be particular—Here, Bill! catch hold of this rope—Will the roof bear?—Mind that loose slate—Oh, it’s coming down! Heads below!” (a loud crash)—“Now, who did that?—It was Bill, I fancy—Who’s to go down the chimney?—Nay, I shan’t! You do it!—That I won’t, then!—Bill’s to go down—Here, Bill! the master says you’re to go down the chimney!”
𐑖𐑰 𐑢𐑱𐑑𐑩𐑛 𐑓 𐑕𐑳𐑥 𐑑𐑲𐑥 𐑢𐑦𐑞𐑬𐑑 𐑣𐑽𐑦𐑙 𐑧𐑯𐑦𐑔𐑦𐑙 𐑥𐑹: 𐑨𐑑 𐑤𐑭𐑕𐑑 𐑒𐑱𐑥 𐑩 𐑮𐑳𐑥𐑚𐑤𐑦𐑙 𐑝 𐑤𐑦𐑑𐑩𐑤 𐑒𐑸𐑑𐑢𐑰𐑤𐑟, 𐑯 𐑞 𐑕𐑬𐑯𐑛 𐑝 𐑩 𐑜𐑫𐑛 𐑥𐑧𐑯𐑦 𐑝𐑶𐑕𐑩𐑟 𐑷𐑤 𐑑𐑷𐑒𐑦𐑙 𐑑𐑩𐑜𐑧𐑞𐑼: 𐑖𐑰 𐑥𐑱𐑛 𐑬𐑑 𐑞 𐑢𐑻𐑛𐑟: «𐑢𐑺𐑟 𐑞 𐑳𐑞𐑼 ladder?—𐑢𐑲, 𐑲 𐑣𐑨𐑛𐑩𐑯𐑑 𐑑 𐑚𐑮𐑦𐑙 𐑚𐑳𐑑 𐑢𐑳𐑯; ·𐑚𐑦𐑤𐑟 𐑜𐑪𐑑 𐑞 𐑳𐑞𐑼—·𐑚𐑦𐑤! 𐑓𐑧𐑗 𐑦𐑑 𐑣𐑽, lad!—𐑣𐑽, 𐑐𐑫𐑑 𐑩𐑥 𐑳𐑐 𐑨𐑑 𐑞𐑦𐑕 𐑒𐑹𐑯𐑼—𐑯𐑴, 𐑑𐑲 𐑩𐑥 𐑑𐑩𐑜𐑧𐑞𐑼 𐑓𐑻𐑕𐑑—𐑞𐑱 𐑛𐑴𐑯𐑑 𐑮𐑰𐑗 𐑣𐑭𐑓 𐑣𐑲 𐑦𐑯𐑳𐑓 𐑘𐑧𐑑—𐑴! 𐑞𐑱𐑤 𐑛𐑵 𐑢𐑧𐑤 𐑦𐑯𐑳𐑓; 𐑛𐑴𐑯𐑑 𐑚𐑰 𐑐𐑼𐑑𐑦𐑒𐑘𐑩𐑤𐑼—𐑣𐑽, ·𐑚𐑦𐑤! 𐑒𐑨𐑗 𐑣𐑴𐑤𐑛 𐑝 𐑞𐑦𐑕 𐑮𐑴𐑐—𐑢𐑦𐑤 𐑞 𐑮𐑵𐑓 bear?—𐑥𐑲𐑯𐑛 𐑞𐑨𐑑 𐑤𐑵𐑕 𐑕𐑤𐑱𐑑—𐑴, 𐑦𐑑𐑕 𐑒𐑳𐑥𐑦𐑙 𐑛𐑬𐑯! 𐑣𐑧𐑛𐑟 𐑚𐑦𐑤𐑴!» (𐑩 𐑤𐑬𐑛 crash)—»𐑯𐑬, 𐑣𐑵 𐑛𐑦𐑛 that?—𐑦𐑑 𐑢𐑪𐑟 ·𐑚𐑦𐑤, 𐑲 𐑓𐑨𐑯𐑕𐑦—𐑣𐑵𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 chimney?—𐑯𐑱, 𐑲 𐑖𐑭𐑯𐑑! 𐑿 𐑛𐑵 it!—𐑞𐑨𐑑 𐑲 𐑢𐑴𐑯𐑑, then!—·𐑚𐑦𐑤𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯—𐑣𐑽, ·𐑚𐑦𐑤! 𐑞 𐑥𐑭𐑕𐑑𐑼 𐑕𐑧𐑟 𐑿𐑼 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦!»
The text was updated successfully, but these errors were encountered: