Welcome to the first-panel conversation in the Expert Talks series presented by the Data Science Council of America (DASCA) on the topic ‘Natural language processing for sentiment analysis’. We're joined by an esteemed panel of Dr. Victor Maia and Prof. Carmine Buono.
Dr. Victor Maia, PDS™ holds a PhD in Economics and a Master's in Statistics, complemented by a specialization in Deep Learning and Machine Learning from MIT. Certified as a Principal Data Scientist by DASCA, he has earned a top-rated and expert-vetted reputation, successfully completing over 100 projects. Currently, he serves as the Chief Artificial Intelligence Officer for Principia, a leading FinTech company in Latin America that recently secured a $40 million Series A funding round.
Prof. Carmine Buono is a Technical Manager and Principal solution architect at Sferanet and, since 2021, Faculty Member of the DASCA Accredited Master's in Data Science at Rome Business School. He has more than 25 years of experience across a broad range of IT disciplines and several industries, including Cyber Security and Data Mining. Passionate about Computer Science since the age of 8, he started studying C++ as his first programming language. Today, after a degree as IT Engineer, a PMP certification, and several projects using Big Data, Machine Learning and Artificial Intelligence, alongside other achievements, he still carries the same enthusiasm as he did in the late 1980s.
The moderator, Kshama Malavalli, holds a bachelor's degree in physics from Cornell University and has a keen interest in exploring the various facets of data science, especially as applied to physics.
Q: Can you give us more insight into your work, past and present, in data science?
Dr. Victor Maia: I began my career in banking and statistical trading and founded my own company after solving a classification problem using machine learning with a logistic regression algorithm in C++. I returned to my true passion as a data scientist consultant after selling it and earned the PMP and DASCA’s Principal Data Scientist (PDS™) certifications. I run AI projects for a fintech startup now.
Prof. Carmine Buono: I started studying programming languages like C++ as a child during the IT revolution. During my university years, I became more focused on IT security and advanced analytics techniques. Today, I help colleagues choose the right data ingestion processor or data flow, define valuable dataset features, and select the best machine-learning algorithms for specific use cases.
Q: What is your take on how Natural Language Processing, used with Big Data, can transform both the back end and user experience in different applications (like speech recognition and chatbots)?
CB: I recall a project back in 2008, working on fax message analysis for an Italian ministry's crisis department, involving lexical, syntactic, and semantic analysis to classify and route documents. At that time, it was a formidable challenge due to the ever-evolving nature of the task – there were struggles to fit, change, and tune the model, features and rewards. I recently tackled a similar project involving PEC (Posta Elettronica Certificata) automation – a classification problem. In just two months, we achieved an impressive 92% F-score. Big Data's role in NLP's evolution cannot be overstated, providing vast and specialized document sets for model training; the initial disruptive phase has passed.
Q3: What is a classification problem?
VM: Using the example of a chatbot: in a classification problem, a chatbot distinguishes between solvable and unsolvable issues. An effective chatbot minimizes two errors: redirecting simple problems (Type 1 error) and failing to escalate complex issues (Type 2 error). Just like in the justice system where it is better to let a potentially guilty person go than mistakenly punishing an innocent one, it's better to escalate potentially solvable problems to humans (Type 1) rather than frustrate users (Type 2).
Q: How, if at all, in your experience has the recent explosion of large language models, at least in the public sphere, affected the broader field of NLP and your work?
VM: The recent rise of large language models like ChatGPT has notably impacted the field of NLP and my work. One significant change is the ability to handle multimodal interactions, including audio and image processing (e.g., interpreting a photo of a receipt or a gesture that indicates a sentiment of dissatisfaction). This allows us to interact with users through different mediums and has improved bot interactions by 20%, reducing the workload on human teams.
This has also enabled our unique new approach to credit assessment which involves focusing on public sentiments and specific areas to better handle the problem of clustering data. Though the solution looks like a black box, it has yielded promising results in back-testing scenarios and should go into production next year.
Q: What insights can sentiment analysis provide to businesses, governments, institutions, and individuals? How can sentiment analysis be applied effectively in educational institutions like the Rome Business School?
CB: Sentiment analysis can now be considered one of the key tools behind decision support systems. In business, it aids in marketing, sales, and opinion-mining campaigns, helping decide product launches and discontinuations. Government institutions use it to identify areas for improvement in policies and programs. In education, such as at the Rome Business School, we teach about sentiment analysis with real-world use cases, and I consider it a vital tool for training future managers to be global leaders.
Q: How might NLP for sentiment analysis work at Principia? Additionally, how is unstructured data pre-processed in this context?
VM (alongside slides): At Principia, the NLP sentiment analysis process starts with the collection of a large amount of unstructured data from our chatbot interactions including text, audio, and images. Next is unsupervised machine learning, but with a feedback and validation mechanism. Audio is converted to text, images are analyzed for sentiment, and text data is cleaned by removing stop words and special characters while performing stemming and lemmatization to obtain the root forms of words. Advanced language models like BERT are then used to determine sentiment and nuanced emotions in user messages. A feedback loop is established for continuous improvement.
Q: How crucial is it to incorporate quality control measures into projects like the one above, as well as other sentiment analysis projects?
CB: Quality control measures are essential in all projects in the IT world, especially in NLP where accuracy is crucial due to the complexity of natural language and linguistics. We have hundreds of languages, and understanding both the meaning and sentiment is a heavy task, making a high level of quality control important.
It's crucial to ensure that NLP models are accurate and reliable, given their increasing use in high-stakes applications like healthcare, finance, and criminal justice, which carry significant risks. Reducing bias in NLP models is another vital aspect. Without proper control, NLP could unintentionally produce unfair or discriminatory outcomes.
Q: What are the primary challenges in industry and academia regarding NLP, sentiment analysis, and large language models, including in an educational context?
VM & CB: In industry, primary challenges include data availability and quality, bias mitigation, and handling ambiguity in real-world applications like chatbots. For researchers, finding satisfactory non-proprietary data is a challenge. Models must understand user intent despite varying expressions of sentiment, and there are concerns about model interpretability and high computational costs. We must also factor in the costs of model construction, process setup, and data management.
Q: What are the boundaries being pushed right now in research, NLP, and sentiment analysis? And where are the resources being invested?
CB: Reducing bias is an ongoing concern among researchers who continuously work on developing new and more accurate models to improve reliability. While NLP is extensively utilized in various applications, additional possibilities are being explored, especially in healthcare and education. In terms of investment, LLMs and generative tasks appear to be promising areas for significant investment.
Q: As a student of data science, a young professional, or someone looking to explore different subfields within the field, where should they start?
VM: Sentiment analysis using LLMs relies on machine learning principles, deep learning, neural networks, computer vision and audio processing. Reinforcement learning improves chatbot responses based on customer satisfaction. Anomaly detection is vital for identifying patterns and outliers, aiding in fraud and security threat detection. Cloud computing and distributed systems are essential due to high computational demands. For beginners, Python is a more accessible choice than C++ for self-testing and practical use.
CB: There are various career paths you can pursue, such as becoming a data scientist, IT engineer or linguistic scientist, each with its unique demands and specialties. You can approach it from a bottom-up perspective or start with a project or work opportunities and then delve deeper into NLP through advanced degrees. Today, there are numerous opportunities available, and what matters most is your determination to achieve your goals.
Q: What relationship do you see between our topic today and other subfields or interesting applications of data science?
CB: The field of NLP is vast and lies at the intersection of various disciplines, including linguistics, IT, statistics, and more. All subfields within data science are important for most such applications, but individuals (as part of a team) can choose different subfields for specialization. In the future, we anticipate a new generation of artificial intelligence where many topics will converge into a single application, like a Singleton in programming. Participation is key to building this unified application.
And with that, I will thank our panelists and we will see you in the next one.
Keep up with the latest in Data Science with the DASCA newsletter.Subscribe