Published on

Domain Transfer for Punctuation Retrieval

Categorised under

I would like to thank Dr. Chieu Hai Leong for his guidance throughout the four months of my internship, without whom my project would not be possible.

About the problem

The problem of punctuation retrieval is simple to comprehend, being something we encounter daily in speech or writing.

Many researchers have looked into the issue, and working models are able to generalise to unseen examples within the training domain, achieving decent levels of performance on the top-4 occuring punctuation marks - namely, Period (.), Comma (,), Question (?) and Space ( ).

The main focus of this project was to look into ways to improve performance on low-resource domains.

This would allow our trained models to capture specific punctuation patterns associated with various domains, yet maximising the sharing of common knowledge between domains.

Proposed Model

Proposed Model

Here is a brief look at the model I ended my project with. It was created with two main aims.

  1. To maximise sharing of information across domains by identifying features that are common across all datasets

This is achieved by utilising the gradient reversal layer layer to maximise the domain invariance for shared features within the transforme layer.

This layer penalises features which strongly differentiates a domain from another, and encourages the model to utilise features common across all domains in the earlier layers.

  1. To allow the model to capture key characteristics of each domain.

The second goal is achieved by allowing the model to account for differences in punctuation features across domains, using the domain logits along with the transformer hidden states as input to the punctuation classifier.

There are various ways this can be done. My approach has some similarities with training a distinct classifier for each domain, and using the domain classifier to give a different weightage to the punctuation classifier for each domain.

  • Despite having a similar model as BERT, the ELECTRA model performed substantially better in my experiments, likely due to its improved pretraining procedure using Replaced Token Detection instead of Masked Language Modelling.
  • I explored the use of a weighted Dice Loss to target the problem of punctuation class imbalance. It is represented by this formula:
LDL=1kwki,kwk(12pikyik+λpik+yik+λ)αL_{DL}=\frac{1}{\sum_k{w_k}}\sum_{i,k}{w_k(1-\frac{2\cdot p_{ik}\cdot y_{ik}+\lambda}{p_{ik}+ y_{ik}+\lambda})^{\alpha}}

where α\alpha is a tunable focal parameter

Sample prediction

Just to show that my model works decently, here are some test samples from various sources. The pre-trained model and its configuration, training code, as well as all training data can be found in my github. The model was trained only on the smaller set of 4 punctuation classes with 6 unfrozen layers, and is able to perform relatively well on datasets from other domains.

Query : My name's Mary Dell, and live in the Dallas, Texas area where there's a lot of pollution . Okay, and I'm up in Wisconsin … Oh. … uh, my name is Terry … Uh-huh. … and, uh, in the small town we don't, but, uh, we're not that far from the city where there's tons of pollution. Yes. Okay, I'll go ahead and start recording that, Okay. Okay, um, just in particular here in the Twin Cities we have a lot of big corporations and, um … Oh. … I'm sure there's a lot of pollution. We, uh, before moving to Wisconsin lived across from where they were, um, gravel pits … Uh-huh. … and also where they were making tar, and so we would occasionally if the wind was blowing the right direction would get the smell of tar and … Ooh. … it would, uh, smell the continuous, you knew that you were also breathing that into your lungs, so … Right. … and it was like miles away, but just the idea of having that come across the way it did in the wind … Uh-huh. … kind of made you know that there was a lot of pollution, and that was just one small corporation that was doing the pollution of that, but … Uh-huh. … also, um, where we also lived, too, it was very close to a highway, so got a lot of fumes from, uh, trucks … Uh-huh. … and cars and you would literally see the pollution on your furniture, I mean, it was … Oh, gosh, … on your tables, you would dust every day, and it was dirty, um, type of dust, I mean, it wasn't the typical dust that you get way out in the country … Uh-huh. … um … Uh-huh. … for, live in a small, um, town now quite a ways from the Twin Cities, about an hour away, and may dust once a week, and that's all it needs versus every day when lived in the city …
Result: my name's mary dell, and live in the dallas, texas area, where there's a lot of pollution. okay. and i'm up in wisconsin. oh, uh, my name is terry. uh, huh. and, uh, in the small town, we don't, but, uh, we're not that far from the city where there's tons of pollution. yes. okay. i'll go ahead and start recording that. okay. okay. um, just in particular, here in the twin cities, we have a lot of big corporations and, um, oh, i'm sure there's a lot of pollution. we, uh, before moving to wisconsin, lived across from where they were, um, gravel pits. uh, huh. and// also where they were making tar, and so we would, occasionally, if the wind was blowing the right direction, would get the smell of tar and, ooh, it would, uh, smell the continuous, you knew that you were also breathing that into your lungs. so, right. and it was like miles away, but just the idea of having that come across the way it did in the wind. uh, huh. kind of made, you know that there was a lot of pollution, and that was just one small corporation that was doing the pollution of that, but, uh, huh. also, um, where we also lived, too, it was very close to a highway, so got a lot of fumes from, uh, trucks. uh, huh.// and cars, and you would literally see the pollution on your furniture, i mean, it was, oh, gosh, on your tables. you would dust every day, and it was dirty, um, type of dust, i mean, it wasn't the typical dust that you get way out in the country. uh, huh. um, uh, huh. for, live in a small, um, town now, quite a ways from the twin cities, about an hour away, and may dust once a week, and that's all it needs versus every day when lived in the city.

Query : Are you there? Hi. Thank you so much for being here, Ray. Do you see me? I see you. You look great. Thank you so much for being here, we really appreciate it. My pleasure. OK, so lay it on us, Ray, how bad is it, how worried should we be? Well, I think you could look at this like a tsunami that's hit — the virus itself and the social distancing. And then what are the consequences in terms of the wreckage, when you look at it? And I think you have to think of that as incomes and balance sheets, you know? So it was a tremendous income hit. And then the balance sheets losses. And who has what savings and so on. And then how is that dealt with. In order to understand that, you have to realize that there are these holes. These holes in income, and the holes in the balance sheets. And then you have to realize that there is the production of money and credit. And who produces that money and credit. OK, the money and credit comes in different flavors. There is US dollar money and credit. There is Euro-dollar money and credit. And so when you look at the world, and you're seeing it, you're seeing a situation that is the same as existed, really, in the 1930 to '45 period, in that now we're seeing the production of a lot of debt, a lot of borrowing by the government. We're seeing zero interest rates and not the traditional kind of monetary policy. But the producing of a lot of money and credit — so the Federal Reserve is buying the Treasury's debt. And the Treasury is getting that money to, mostly, Americans, in some imperfect but remarkably large way.
Result: are you there? hi. thank you so much for being here. ray. do you see me? i see you. you look great. thank you so much for being here. we really appreciate it. my pleasure. ok, so, lay it on us, ray. how bad is it? how worried should we be? well, i think you could look at this like a tsunami that's hit. the virus itself and the social distancing, and then what are the consequences in terms of the wreckage? when you look at it? and i think you have to think of that as incomes and balance sheets, you know. so it was a tremendous income hit, and then the balance sheets losses, and who has what savings? and// so on. and then, how is that dealt with? in order to understand that, you have to realize that there are these holes, these holes in income and the holes in the balance sheets, and then you have to realize that there is the production of money and credit, and who produces that money and credit? ok, the money and credit comes in different flavors, there is us, dollar money and credit, there is euro, dollar money and credit. and so when you look at the world and you're seeing it, you're seeing a situation that is the same as existed, really in the 1930 to'45 period, in that now we're seeing the production of a lot of// debt, a lot of borrowing by the government. we're seeing zero interest rates, and not the traditional kind of monetary policy, but the producing of a lot of money and credit. so the federal reserve is buying the treasury's debt, and the treasury is getting that money to mostly americans, in some imperfect, but remarkably large way.

Query : Previously on The Last Man on Earth… Last man on Earth. Last woman on Earth. I'm not gonna have sex with you unless we're married. I do? Oh… Oh, my God. Hey. Look, seven people left, and two of them are named Phil Miller ? You were gonna leave Todd in the desert. You're done here, Tandy. Tucson's my home. Don't even think about coming back to Tucson. So, where should we go? You're staying with me? You had a brother? I didn't know that. Phil, I got the tequila! All right. Can I drive? Carol, it's very complicated. We should go back and get that bomb. Carol… Phil… I knew you were gonna say that. I don't know how to put a bomb back in that little thingy. We're Americans. We put a man on the Moon. Fine, if you want to go back and get the bomb, we'll go back and get the bomb. That won't be necessary, Phil. It's fine. Just the fact that you offered is good enough for me. You know what? How about tomorrow I go back to that parking lot and I drop a little caution cone in front of it? Thank you, Phil. And that's why you're the bomb. Come here, give me some sugar. Oh! Insurance company is gonna hear about that one. Oh. USA! USA! USA! USA! So lonely. Wish I had somebody to make out with. I may be able to help you with that. Ben Franklin? You just made me discover electricity in my shorts. Get in here. Stick your copper tongue down my mouth! And, in short, my position on Syria is, uh… don't know. The situation is very Syri-ous. Boom, I still got it. Next question. Yes. Brice. My position on Tucson and its inhabitants remains the same.
Result: previously on the last man on earth, last man on earth, last woman on earth. i'm not gonna have sex with you unless we're married. i do. oh. oh, my god. hey, look, seven people left, and two of them are named phil miller. you were gonna leave todd in the desert? you're done here. tandy. tucson's my home. don't even think about coming back to tucson. so where should we go? you're staying with me? you had a brother? i didn't know that, phil. i got the tequila. all right. can i drive, carol? it's very complicated. we should go back and get that bomb. carol. phil. i// knew you were gonna say that, i don't know how to put a bomb back in that little thingy. we're americans. we put a man on the moon. fine. if you want to go back and get the bomb, we'll go back and get the bomb. that won't be necessary, phil. it's fine. just the fact that you offered is good enough for me. you know? what? how about tomorrow? i go back to that parking lot, and i drop a little caution cone in front of it? thank you, phil. and that's why you're the bomb. come here, give me some sugar. oh, insurance company is gonna hear about that.// one. oh, usa, usa, usa, usa. so lonely. wish i had somebody to make out with. i may be able to help you with that. ben franklin, you just made me discover electricity in my shorts. get in here, stick your copper tongue down my mouth. and in short, my position on syria is, uh, don't know, the situation is very syri, ous. boom. i still got it. next question. yes, brice. my position on tucson and its inhabitants remains the same.

Query : Among the 205 confirmed cases reported from 28 May to 3 June, 66 cases have tested positive for their serology tests, 106 have tested negative, and 33 serology test results are pending.
Result: among the 205 confirmed cases reported from 28 may to 3 june, 66 cases have tested positive for their serology tests, 106 have tested negative, and 33 serology test results are pending.

Query : So we won the election and we have the right to do it, Chris. President Trump, thank you. Same question to you, Vice President Biden. You have two minutes. Well, first of all, thank you for doing this and looking forward to this, Mr. President. Thank you, Joe. The American people have a right to have a say in who the Supreme Court nominee is and that say occurs when they vote for United States Senators and when they vote for the President of United States. They’re not
Result: so we won the election, and we have the right to do it. chris. president trump. thank you. same question to you, vice president biden, you have two minutes. well, first of all, thank you for doing this, and looking forward to this, mr. president. thank you, joe. the american people have a right to have a say in who the supreme court nominee is, and that say, occurs when they vote for united states senators, and when they vote for the president of united states, they ’ re not.

Takeaways

  • Ensure that the data processing / generation is done properly, writing tests where necessary to ensure this.
  • Before writing any code, find a strong existing baseline, or models that works for similar problems.
  • Test each part of the model throughly to catch minor bugs early!