Posts by Tags

amazon ec2


co-reference graph

[Paper Explained] [AAAI2021] Structured Co-reference Graph Attention for Video-grounded Dialogue

9 minute read


Flow up to my brief introduction about video dialogue of the previous blog, today, I will into detail with one of these state-of-the-art approaches in this topic. The paper I wanna introduce is Structured Co-reference Graph Attention for Video-grounded Dialogue, Junyeong et al. published at AAAI 2021. On a high level, the authors proposed a bipartite co-reference structure to connect the information over multiple modalities (visual, linguistic), and then capture information from the complex spatial as well as the temporal dynamics of video via an attention graph. By representing underlying dependencies between modalities, this design has moved 1 step forward in the reasoning over language and visual.


my thoughts

Some random thoughts (My July rewind)

7 minute read


It is 2.02 am right now and my sleepy has not come yet so I think it is a good chance to have some words. The fact that after the 1st covid vaccine dose in the middle of June, I feel much harder to go to sleep, not sure if the vaccine is the main reason or it comes from my other problem. Anyway, with that purpose, today’s blog is not a technical paper review or some algorithms implementation, this post is about my thoughts atm.


Older Blog Posts

less than 1 minute read


For the older blogs, please visit my page at Viblo (unfortunately, all was written in Vietnamese). I had written all of those blogs while I had been starting to learn about AI.


[Paper Explained] [AAAI2021] Structured Co-reference Graph Attention for Video-grounded Dialogue

9 minute read


Flow up to my brief introduction about video dialogue of the previous blog, today, I will into detail with one of these state-of-the-art approaches in this topic. The paper I wanna introduce is Structured Co-reference Graph Attention for Video-grounded Dialogue, Junyeong et al. published at AAAI 2021. On a high level, the authors proposed a bipartite co-reference structure to connect the information over multiple modalities (visual, linguistic), and then capture information from the complex spatial as well as the temporal dynamics of video via an attention graph. By representing underlying dependencies between modalities, this design has moved 1 step forward in the reasoning over language and visual.

Video dialogue: Introduction

7 minute read


Historically, having a system which can discuss as well as interact with you about the football matches/ movies.. with its own knowledge has been considered a very ambitious goal. More than the current AI Visual Model nowsaday, that system must be able to infer video from the past, describe the present, and predict the future. In other words, our system’s capacity must be enough reproduce human intelligent level in video understanding.