Skip to content

Issue - double punctuation between words (em-dash usage) #66

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
burgerdroid opened this issue Apr 9, 2024 · 3 comments
Open

Issue - double punctuation between words (em-dash usage) #66

burgerdroid opened this issue Apr 9, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@burgerdroid
Copy link

The issue arises when an em-dash is used between 2 words, and where there are other punctuation marks adjacent to the em-dash. i.e. where the first part before the em-dash is exclaimed or is a question etc.

Merriam-Webter does have examples showing this particular usage of the em-dash here
https://www.merriam-webster.com/grammar/em-dash-en-dash-how-to-use

Here is an "OK" sample, where the em-dash is the only character (no other adjacent punctuation):

Within its first year, Mabel and Harry had sampled all of the bakery’s offerings—all 62 items—and had also decided that the exercise was worth repeating.

𐑢𐑦𐑞𐑦𐑯 𐑦𐑑𐑕 𐑓𐑻𐑕𐑑 𐑘𐑽, ·𐑥𐑱𐑚𐑩𐑤 𐑯 ·𐑣𐑨𐑮𐑦 𐑣𐑨𐑛 𐑕𐑭𐑥𐑐𐑩𐑤𐑛 𐑷𐑤 𐑝 𐑞 𐑚𐑱𐑒𐑼𐑦𐑟 𐑪𐑓𐑼𐑦𐑙𐑟—𐑷𐑤 62 𐑲𐑑𐑩𐑥𐑟—𐑯 𐑣𐑨𐑛 𐑷𐑤𐑕𐑴 𐑛𐑦𐑕𐑲𐑛𐑩𐑛 𐑞𐑨𐑑 𐑞 𐑧𐑒𐑕𐑼𐑕𐑲𐑟 𐑢𐑪𐑟 𐑢𐑻𐑔 𐑮𐑦𐑐𐑰𐑑𐑦𐑙.

Here is a particular bad sample from Alice:

She waited for some time without hearing anything more: at last came a rumbling of little cartwheels, and the sound of a good many voices all talking together: she made out the words: “Where’s the other ladder?—Why, I hadn’t to bring but one; Bill’s got the other—Bill! fetch it here, lad!—Here, put ’em up at this corner—No, tie ’em together first—they don’t reach half high enough yet—Oh! they’ll do well enough; don’t be particular—Here, Bill! catch hold of this rope—Will the roof bear?—Mind that loose slate—Oh, it’s coming down! Heads below!” (a loud crash)—“Now, who did that?—It was Bill, I fancy—Who’s to go down the chimney?—Nay, I shan’t! You do it!—That I won’t, then!—Bill’s to go down—Here, Bill! the master says you’re to go down the chimney!”

𐑖𐑰 𐑢𐑱𐑑𐑩𐑛 𐑓 𐑕𐑳𐑥 𐑑𐑲𐑥 𐑢𐑦𐑞𐑬𐑑 𐑣𐑽𐑦𐑙 𐑧𐑯𐑦𐑔𐑦𐑙 𐑥𐑹: 𐑨𐑑 𐑤𐑭𐑕𐑑 𐑒𐑱𐑥 𐑩 𐑮𐑳𐑥𐑚𐑤𐑦𐑙 𐑝 𐑤𐑦𐑑𐑩𐑤 𐑒𐑸𐑑𐑢𐑰𐑤𐑟, 𐑯 𐑞 𐑕𐑬𐑯𐑛 𐑝 𐑩 𐑜𐑫𐑛 𐑥𐑧𐑯𐑦 𐑝𐑶𐑕𐑩𐑟 𐑷𐑤 𐑑𐑷𐑒𐑦𐑙 𐑑𐑩𐑜𐑧𐑞𐑼: 𐑖𐑰 𐑥𐑱𐑛 𐑬𐑑 𐑞 𐑢𐑻𐑛𐑟: «𐑢𐑺𐑟 𐑞 𐑳𐑞𐑼 ladder?—𐑢𐑲, 𐑲 𐑣𐑨𐑛𐑩𐑯𐑑 𐑑 𐑚𐑮𐑦𐑙 𐑚𐑳𐑑 𐑢𐑳𐑯; ·𐑚𐑦𐑤𐑟 𐑜𐑪𐑑 𐑞 𐑳𐑞𐑼—·𐑚𐑦𐑤! 𐑓𐑧𐑗 𐑦𐑑 𐑣𐑽, lad!—𐑣𐑽, 𐑐𐑫𐑑 𐑩𐑥 𐑳𐑐 𐑨𐑑 𐑞𐑦𐑕 𐑒𐑹𐑯𐑼—𐑯𐑴, 𐑑𐑲 𐑩𐑥 𐑑𐑩𐑜𐑧𐑞𐑼 𐑓𐑻𐑕𐑑—𐑞𐑱 𐑛𐑴𐑯𐑑 𐑮𐑰𐑗 𐑣𐑭𐑓 𐑣𐑲 𐑦𐑯𐑳𐑓 𐑘𐑧𐑑—𐑴! 𐑞𐑱𐑤 𐑛𐑵 𐑢𐑧𐑤 𐑦𐑯𐑳𐑓; 𐑛𐑴𐑯𐑑 𐑚𐑰 𐑐𐑼𐑑𐑦𐑒𐑘𐑩𐑤𐑼—𐑣𐑽, ·𐑚𐑦𐑤! 𐑒𐑨𐑗 𐑣𐑴𐑤𐑛 𐑝 𐑞𐑦𐑕 𐑮𐑴𐑐—𐑢𐑦𐑤 𐑞 𐑮𐑵𐑓 bear?—𐑥𐑲𐑯𐑛 𐑞𐑨𐑑 𐑤𐑵𐑕 𐑕𐑤𐑱𐑑—𐑴, 𐑦𐑑𐑕 𐑒𐑳𐑥𐑦𐑙 𐑛𐑬𐑯! 𐑣𐑧𐑛𐑟 𐑚𐑦𐑤𐑴!» (𐑩 𐑤𐑬𐑛 crash)—»𐑯𐑬, 𐑣𐑵 𐑛𐑦𐑛 that?—𐑦𐑑 𐑢𐑪𐑟 ·𐑚𐑦𐑤, 𐑲 𐑓𐑨𐑯𐑕𐑦—𐑣𐑵𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 chimney?—𐑯𐑱, 𐑲 𐑖𐑭𐑯𐑑! 𐑿 𐑛𐑵 it!—𐑞𐑨𐑑 𐑲 𐑢𐑴𐑯𐑑, then!—·𐑚𐑦𐑤𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯—𐑣𐑽, ·𐑚𐑦𐑤! 𐑞 𐑥𐑭𐑕𐑑𐑼 𐑕𐑧𐑟 𐑿𐑼 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦!»

@Shavian-info Shavian-info added the bug Something isn't working label Apr 20, 2024
@Shavian-info
Copy link
Owner

I am aware of the issue but thank you for raising it. It's a problem with the underlying text tagging library (SpaCy). I have tried developing workarounds but so far without success. But I'll look into it further.

@burgerdroid
Copy link
Author

Huh.... I thought I had posted a suggested "dirty" fix but I don't see it here in the chat. Work-around is to pad any em-dashes with spaces before conversion, then removing the spaces after conversion. Though this is simple enough to do to my source text before running it through latin2shaw.

def latin2shaw(text):
+    text = text.replace('—', ' — ')

    ...

+    text_shaw = text_shaw.replace(' — ', '—')
     return text_shaw

Confirmed that re-running latin2shaw with the above 2 lines added fixes it:

𐑖𐑰 𐑢𐑱𐑑𐑩𐑛 𐑓 𐑕𐑳𐑥 𐑑𐑲𐑥 𐑢𐑦𐑞𐑬𐑑 𐑣𐑽𐑦𐑙 𐑧𐑯𐑦𐑔𐑦𐑙 𐑥𐑹: 𐑨𐑑 𐑤𐑭𐑕𐑑 𐑒𐑱𐑥 𐑩 𐑮𐑳𐑥𐑚𐑤𐑦𐑙 𐑝 𐑤𐑦𐑑𐑩𐑤 𐑒𐑸𐑑𐑢𐑰𐑤𐑟, 𐑯 𐑞 𐑕𐑬𐑯𐑛 𐑝 𐑩 𐑜𐑫𐑛 𐑥𐑧𐑯𐑦 𐑝𐑶𐑕𐑩𐑟 𐑷𐑤 𐑑𐑷𐑒𐑦𐑙 𐑑𐑩𐑜𐑧𐑞𐑼: 𐑖𐑰 𐑥𐑱𐑛 𐑬𐑑 𐑞 𐑢𐑻𐑛𐑟: «𐑢𐑺𐑟 𐑞 𐑳𐑞𐑼 𐑤𐑨𐑛𐑼?—𐑢𐑲, 𐑲 𐑣𐑨𐑛𐑩𐑯𐑑 𐑑 𐑚𐑮𐑦𐑙 𐑚𐑳𐑑 𐑢𐑳𐑯; ·𐑚𐑦𐑤𐑟 𐑜𐑪𐑑 𐑞 𐑳𐑞𐑼—·𐑚𐑦𐑤! 𐑓𐑧𐑗 𐑦𐑑 𐑣𐑽, 𐑤𐑨𐑛!—𐑣𐑽, 𐑐𐑫𐑑 𐑩𐑥 𐑳𐑐 𐑨𐑑 𐑞𐑦𐑕 𐑒𐑹𐑯𐑼—𐑯𐑴, 𐑑𐑲 𐑩𐑥 𐑑𐑩𐑜𐑧𐑞𐑼 𐑓𐑻𐑕𐑑—𐑞𐑱 𐑛𐑴𐑯𐑑 𐑮𐑰𐑗 𐑣𐑭𐑓 𐑣𐑲 𐑦𐑯𐑳𐑓 𐑘𐑧𐑑—𐑴! 𐑞𐑱𐑤 𐑛𐑵 𐑢𐑧𐑤 𐑦𐑯𐑳𐑓; 𐑛𐑴𐑯𐑑 𐑚𐑰 𐑐𐑼𐑑𐑦𐑒𐑘𐑩𐑤𐑼—𐑣𐑽, ·𐑚𐑦𐑤! 𐑒𐑨𐑗 𐑣𐑴𐑤𐑛 𐑝 𐑞𐑦𐑕 𐑮𐑴𐑐—𐑢𐑦𐑤 𐑞 𐑮𐑵𐑓 𐑚𐑺?—𐑥𐑲𐑯𐑛 𐑞𐑨𐑑 𐑤𐑵𐑕 𐑕𐑤𐑱𐑑—𐑴, 𐑦𐑑𐑕 𐑒𐑳𐑥𐑦𐑙 𐑛𐑬𐑯! 𐑣𐑧𐑛𐑟 𐑚𐑦𐑤𐑴!» (𐑩 𐑤𐑬𐑛 𐑒𐑮𐑨𐑖)—«𐑯𐑬, 𐑣𐑵 𐑛𐑦𐑛 𐑞𐑨𐑑?—𐑦𐑑 𐑢𐑪𐑟 ·𐑚𐑦𐑤, 𐑲 𐑓𐑨𐑯𐑕𐑦—𐑣𐑵𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦?—𐑯𐑱, 𐑲 𐑖𐑭𐑯𐑑! 𐑿 𐑛𐑵 𐑦𐑑!—𐑞𐑨𐑑 𐑲 𐑢𐑴𐑯𐑑, 𐑞𐑧𐑯!—·𐑚𐑦𐑤𐑟 𐑑 𐑜𐑴 𐑛𐑬𐑯—𐑣𐑽, ·𐑚𐑦𐑤! 𐑞 𐑥𐑭𐑕𐑑𐑼 𐑕𐑧𐑟 𐑿𐑼 𐑑 𐑜𐑴 𐑛𐑬𐑯 𐑞 𐑗𐑦𐑥𐑯𐑦!»

@Shavian-info
Copy link
Owner

Yes, this should work, but it will (I think) mean that if the original text had spaces on either side of the dashes, they will be removed. I believe there is a way to tell the SpaCy tagger to recognise a dash next to punctuation as punctuation but I have so far struggled to make it work. This is what I was thinking to look into again. If that fails, I'll add your suggested workaround.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants