makerday ft chris 09/07
Summary
TLDRThe speaker emphasizes the importance of maintaining clean and compliant code repositories, particularly on platforms like GitHub, to ensure security and avoid legal issues. They discuss the use of tools like GitHub Copilot for generating compliant code and the necessity of understanding how repositories work. The speaker also touches on the role of data cleansing and the use of vector and graph databases in enhancing model understanding. They advocate for a holistic approach to problem-solving, combining scientific and engineering thinking, and stress the value of good documentation and output analysis in machine learning projects.
Takeaways
- π» Ensure repositories are clean and follow rules, especially when using platforms like GitHub.
- π Understand how repositories work and perform checks to ensure code is compliant and secure.
- π’ For large institutions like Assurance, code must be in compliance with government regulations to avoid rework.
- π Use tools like GitHub Copilot for code recommendations, which automatically censor sensitive information like SSH keys.
- π οΈ Developers should consider compliance when coding, even if they initially prioritize freedom and creativity.
- π Maintain good documentation in repositories, such as README files, to aid language models in understanding and using the code.
- π Use 'key vault' or secure databases to store sensitive information that should not be exposed in repositories.
- π Consider using vector and graph databases to enhance how models understand and interact with your data.
- π§ Think holistically about problem-solving, starting with a scientific approach and then applying engineering to implement solutions.
- π Analyze model outputs to understand performance and determine if adjustments in methodology or post-processing are needed.
Q & A
Why is it important to keep repositories clean?
-Repositories need to be clean because they are often subject to compliance checks and rules, especially when they are public on platforms like GitHub. Clean code ensures that there are no security risks or violations of privacy, which is crucial for both the developers and the users of the code.
What does the speaker mean by 'repositories' in the context of GitHub?
-In the context of GitHub, 'repositories' refers to the projects or collections of files that developers use to store their code. These repositories can be public or private and are a central part of version control and collaboration in software development.
What is the significance of compliance in the context of the script?
-Compliance is significant because it ensures that the code and practices followed by the developers adhere to legal and regulatory standards, especially important for large institutions and when dealing with sensitive data or government-related projects.
What is the role of 'Assurance' mentioned in the script?
-Assurance is likely a B2B firm mentioned in the script, which may be involved in providing compliance checks and ensuring that the models and code developed by the team are in line with government regulations.
Why is it necessary to keep production code intact and how does scanning contribute to this?
-Production code needs to be kept intact to ensure reliability and security. Scanning the code helps identify any vulnerabilities or compliance issues, thus preventing potential risks before the code is deployed or handed off to other entities like the government.
What is the purpose of 'key vault' as mentioned in the script?
-A 'key vault' is a secure storage mechanism used to safeguard sensitive information like API keys and passwords. It ensures that only authorized personnel can access these critical pieces of information, enhancing security within the development environment.
How does GitHub Co-pilot provide recommendations while ensuring compliance?
-GitHub Co-pilot provides recommendations by understanding the context of the code and the developer's intent. It ensures compliance by not exposing sensitive information like SSH keys and by hashing out any potentially sensitive data before presenting suggestions to the user.
What is the significance of using proper documentation like README files in repositories?
-Proper documentation, such as README files, is significant because it provides clear instructions and information about the project, which is essential for understanding the project's purpose, dependencies, and how to run the code. This information is crucial for both humans and machine learning models that may use the repository.
Why is it important to think like a scientist when approaching a problem in the context of the script?
-Thinking like a scientist is important because it encourages a holistic and innovative approach to problem-solving. It involves considering the process flow and desired outcomes without being limited by current technological constraints, which can lead to more effective and creative solutions.
What does the speaker suggest about the role of output analysis in understanding model performance?
-The speaker suggests that output analysis is crucial for understanding why a model is performing in a certain way. By analyzing the output, developers can determine if the issue lies with the underlying methodology of the large language model or if additional post-processing is required.
Why is it recommended to collect repositories that are useful and relevant to the specific query?
-Collecting repositories that are useful and relevant ensures that the data and code used are directly applicable to the problem at hand. This targeted approach can lead to more efficient and effective solutions, as opposed to using a broad and potentially irrelevant dataset.
Outlines
π οΈ Repository Compliance and Code Quality
The speaker emphasizes the importance of maintaining clean and compliant repositories, particularly on platforms like GitHub. They discuss the necessity of following rules and ensuring code is cleansed before it's handed off to authorities, such as the government. The speaker also touches on the use of tools like GitHub Copilot for generating compliant code and the importance of guarding against data theft and ensuring privacy. They provide examples of how to use GitHub Copilot effectively and securely, including the use of SDKs and handling of secrets.
π Data Privacy and Internal Repositories
This paragraph delves into the concept of data privacy within an internal ecosystem, such as the Assurance ecosystem mentioned. The speaker discusses the use of private repositories and the importance of having similar data for effective machine learning models. They explain the use of key vaults for secure data storage and the significance of proper documentation in repositories, such as README files, for aiding language models in understanding project requirements. The speaker also highlights the importance of thinking like a scientist when approaching problems and the role of documentation in machine learning projects.
π§ Thinking Holistically in Problem Solving
The speaker encourages a holistic approach to problem-solving, suggesting that one should first think like a scientist to conceptualize a solution and then like an engineer to implement it. They advocate for not limiting oneself with preconceived notions of what technology can or cannot do, and instead, to think creatively and outside the box. The speaker also discusses the importance of algorithm design and the potential pitfalls of relying too heavily on patches rather than creating robust solutions. They suggest analyzing output to understand model performance and to identify whether additional post-processing or changes to the underlying methodology are needed.
π Effective Repository Utilization and Continuous Learning
In the final paragraph, the speaker advises on how to effectively utilize repositories by selecting those that are useful and relevant to the problem at hand. They recommend unit testing and output analysis to understand the results produced by language models. The speaker also stresses the importance of continuous learning, as the field of language models is relatively new and constantly evolving. They offer to share resources on measuring the effectiveness of language model programs and encourage reaching out for help, emphasizing the importance of a comprehensive and open-minded approach to learning and problem-solving.
Mindmap
Keywords
π‘Repositories
π‘Compliance
π‘Code Cleansing
π‘GitHub
π‘Co-pilot
π‘Key Vaults
π‘Vector Database
π‘Graph Database
π‘Documentation
π‘Output Analysis
Highlights
Emphasis on maintaining clean repositories due to compliance with rules.
The importance of understanding how repositories work, especially on platforms like GitHub.
The necessity of code cleansing to ensure it is compliant before being handed off to the government.
Use of scans to ensure production code compliance with large institutions like Assurance.
The role of GitHub co-pilot in providing recommendations while adhering to compliance standards.
Explanation of how GitHub co-pilot avoids exposing sensitive information like SSH keys.
The concept of key vaults as a secure way to store sensitive information.
The significance of having similar data in private repositories for model training.
The use of vector and graph databases to enhance language model understanding.
The value of good documentation in helping language models learn and reducing data cleansing time.
The shift from data imputation to allowing models to handle it themselves.
Encouragement to think like a scientist first, then an engineer when approaching problems.
The importance of algorithm design and thinking holistically about solutions.
Advice on not limiting oneself with preconceived notions of what technology can or cannot do.
The significance of analyzing output to understand model performance and identify necessary adjustments.
The challenge of ensuring consistent results from language models and the role of randomness.
Recommendation to collect repositories that are useful and relevant to specific queries.
The importance of unit testing and output analysis in the development process.
Offer to share articles on measuring the effectiveness of language model programs.
Transcripts
um yes you do want to make sure you're
you're understand that our repositories
are def definitely very clean because we
do have to follow certain rules right we
do have when you go to like GitHub for
example github.com
um you look at the respositories
right oh my God let me sign let me sign
my account first
sorry uh
okay so
like definitely we go to like different
responsories for
example um people already started doing
different checks on respositories and
they already make sure their code is
cleansed and good to go because first
understand how respository works okay I
know that we all very smart we know how
respositories works right but even for
me I had I actually learned something
new yesterday I learned about
environments yesterday this new um thing
of how we keep our our production code
intact these days most of our production
code for example has certain
um scans from it they scan our code make
sure it's in compliance because um think
about large institution like Assurance
um I know some you guys still don't
really know assur is it's quite um it's
a B2B firm but so we do report both of
our models and and everything through
your government at the end of the day so
when you go to like a Banking Company
you go to any companies out there
definitely you have to make sure you're
in compliance right there and how do we
make sure we're in compliance of course
we want to make sure our cod in
compliance before we hand it off to the
government and then if they tell us that
oh all these errors are wrong then we
have to go back and do all the rework
right just just just in general right
like um that's why you could hear people
say oh my God we need to make sure that
our code is fit to criteria it's like
open source it it's not going to be
stealing people's informations all the
stuff you hear about all these different
types of guard rails that you hear from
people these days right realistically
that makes sense you know as a developer
you sometimes like who cares right
that's probably the first you ask that's
true um you should definitely have the
freedom to develop as much as you want
but you Al also consider that in their
mindsets of these developers they always
make sure that their code is in
compliance so when you want those
recommendations from C- pilot you want
recommendations that are in compliance
that actually helps your customer at the
end of the day you don't want just some
random code out there right so for
example let me give you a couple
examples when you use GitHub co-pilot
for example and let's just do one
interaction real quick you get compilot
um when they offer you
recommendations um
for example let me see have
here like they will they will always
hash out those keys for you they will
never
um they will never put those um SSH keys
for you because they don't want you to
see other people's codes right however
they understand the structure because
they designed the algorithm for you
already right so you might want to ask
like how do I connect to Azure um
machine Learning Studio for example
right you might ask something like this
because you want to understand how can I
connect to this type of service so that
I can run my code for example
right and
then you will wait for co-pilot to tell
you what to do right so let's make this
little
bigger co-pilot will tell you okay Chris
what you need to do
install the
SDK um the sofware development kit and
then run your code okay Second Step
import this third step they'll tell you
is you need to include these different
types of Secrets right back right here
right back in the day what they did
before disc copil got updated they put
xxxxx so they put these guard rails
already for your program so that people
because sometimes it's very dangerous
because as human human beings we don't
take it for granted that um putting
these secrets sometimes is very
dangerous but most of the times in your
programs in these um gbt programs
post-processing work before the before
the agent actually sends you the result
and gives it to you as his output they
always do a little clensing to it they
cens the program and they make sure that
these are already hashed out already so
from here you might not see it but
definitely um it could be there for
example so this is just how for example
how they provide you the best
recommendation you know but for your
case right this is specifically
generated for the specific generality of
your customer right but your customer is
only me it's only internal so if your
data is already already private and we
live in the same ecosystem the same
Assurance ecosystem then you shouldn't
have to worry about that okay so that is
one thing about how there's difference
between co-pilot versus using private
responsories because you need to have
the mindset that like this is going to
be sharable with me and most of the
times these things are already in the
what what what what it calls them is
called key vault key vault is just a
fancy word to call database a way for me
to store this information so people
cannot view it unless you're admins for
example that's just a fancy way to say
that right there um so for going back to
your problem right here is after you
create that um private respository make
sure you have some type of data that's
um that's very similar to each other too
you know because you have to understand
a model right when a model Auto actually
um learns from all this information and
we expose the API right they're
basically you know using this as like a
vector database right like how you guys
were describing where I think it was you
guys right like Tre this like a vector
database for example and quering that
information right and then using that
chat gbt model to help incorporate that
information right so this is how I think
it was a great idea when your team
members talked about can use rag this is
a great way too
another great way too is also use graph
databases for example when you go to
certain responsory this is just a demo
responsory there's certain key wordss
that you would put inside of here for
example and those key wordss is
typically what you would want your um
Vector database to understand and come
up with right for example so actually
let me give you a more realistic example
this is from one of my Capstone projects
I work for our team
oh please do not hack me okay that's
what I say please do not
hack me
see GI I don't know my password so I
need to look it up h
sorry I'm just siging in you guys it's
quite slow apologize yeah no worries
yeah um I was sort of like doing a bit
of research on my end like beforehand um
something I was looking into was sort of
like getting that data from the repos so
I was looking at like documentation for
um the GitHub rest API right so you can
get like all the different information
like you can get information from like
the readme.md or if you're doing like a
python like project get Thea from appp
yeah yeah so yep definitely and that's
where we would get that Source
information because if you create a
really good re me file like this is one
of my Capstone projects I can share with
you guys um that we work with Dr Tech on
um this is one of their products that
they help us develop um this is you
know they could incorporate little
information that's fine but at least you
see how like re me files should look
like they tell you to project name for
example they tell you like specific
keywords like about um what type of
packages you need to be installed right
these are all important information your
language model needs to understand right
dependencies um how to run the pipeline
for example these are all information
you're feeding to your model for example
right and parameters explanation is just
another way for us to identify those
certain parameters that needed in order
to run your program for example right so
you have to understand like um a human
how we think you know and how does a
computer think there actually very
similar because like when we think yes
we have that contextual awareness
because we're able to look at one
problem but we're also able to um cross
reference other materials that from our
past experience right now you have think
of a computer a computer Only Knows yes
or no so how can you help your computer
understand what's happening right here
that's why people are like oh let's
create a vector database let's create a
graph database so we can incorporate
that information that our computer
doesn't know because at the end of the
day it's true that model that language
large language model can only understand
a certain amount of information but how
can you help your model understand more
this is how you can if yes definitely
having good documentation is very good
because it can help your model to learn
it and another thing that's very
important too is um you spend less time
on
um on trying to cleans your data because
I know lots of us spend lots of times
during our machine learning courses to
cleanse our data right clean our data
make sure our data is clean make sure
that our data impute it and stuff like
that right I will tell you today that
nobody I don't I don't even do
imputation anymore okay I let the model
do it itself okay that's what I do okay
because I'm lazy I'm a very lazy person
I if I can find something that can help
me solve the problem faster I would do
it these days this is just more for you
to have think like a first think like a
scientist okay so first put your
scientist hat on don't think about
engineering don't think about the
limitations of what your program can do
that's the that's the first thing don't
think about that first have an idea how
the process flow and what you want now
after you have that put down your
scientist hat put back on your
engineering hat on in your engineering
hat how can you make it happen that's
why I would say best for you because
sometimes as human beings we do have a
lot of bias because we're like oh we
were taught that we cannot do it this
way because the technology doesn't exist
but to be honest if you want something
to happen you can make it happen it's
just whether or not you don't want to
make it happen so that's my best tip to
you guys is don't think like that it's
very if you want to be a scientist a
scientist we have to think outside the
box like right right now for example you
could tell me that um to gather data is
using a gury right and you're saying
Chris how can I do this what I would
tell you is Anything Can Happen what you
can do is simulate the gur yourself
write the write the code yourself and
simulate it you know so that's where I
would say know don't
because I know people always put these
limitations on how you think but I was
just say probably don't think like that
because that will give you more more
trouble for yourself because then you'll
be trying to find different types of
tools to put them together and you won't
be able to create a holistic solution so
that's some things I've noticed from
some people my past Capon projects that
they're all fixated on these fancy tools
but they're not thinking like a
scientist they're thinking more like an
engineer if something breaks How can I
how can I put a patch on it every single
time but I guarantee you putting patches
is a good thing for short-term purposes
but long-term perspective something is
going to break so you have to understand
algorithm design is very important so
it's very important for you guys to
think holistically how the solution
should look like so this is the reason
why people have now started investing
investing companies have started
investing so much money on cleansing
this data making a format very pretty so
you can use it and that's the reason why
your professors would teach you okay
guys in order to do professional
documentation you need to write
something like this right here because
not only I mean of course your professor
didn't back in the day they didn't do it
because they want they wanted for you to
use your large English model but the
idea is now flowing that you do need it
okay so that's what I'm trying to make
you guys think like outside a box and
see how everything fits together and
don't think just like a like um in one
side think about from all perspectives
how this really helps you right here um
and then you know some of your keyword
from your program when you when you send
that query to um co-pilot you want to
have you want to understand certain
scenarios right so in your codebase
there could be different types of um
known bugs for example so these are
released notes that they put under here
right there are special keywords that
they tell you what your program can and
cannot do so yes there could be certain
programs in your um co-pilot program
there certain key wordss that can help
your co-pilot pick up these words and
send a quick query to your to your
customer for example that's where you
have to think about as a human being
this is how I think but now how can I
help my computer my machine to think
like how I want to think like right
there so this is one example I can give
you guys for example of how to look at
the problem differently if that makes
sense um because another way too also um
I can tell you guys that this problem
might be a little diff difficult is um
after you look at the result you're like
the result is wrong I have to change
something but I want you to spend a
little bit of time to analyze your
output analyze your output if you do
some output
analysis understand why your model is
performing this way after you understand
why your model is performing this way it
could either be you need to add an extra
postprocessing to your program or is it
because the underlying l l m methodology
is wrong so spend a little time to
analyze that part because that part is
the hard part right there of how to
analyze your results because they could
have hallucinations and stuff like that
right so this is where I do see some
difficulty in this problem right there
is um how do you know your results would
be consistent you don't know because you
add a temperature to your problem you
add some Randomness and you want that
Randomness because you want something
like that so just make sure that um
just make sure about that those are also
some some tips I can tell you guys um
about how you can approach this problem
a little bit okay like um just go
through some very good and also just
because you're collecting a lot of
respositories make sure to collect some
respositories that actually useful okay
that can help you solve the problem okay
pick some like how I would post a
problem is I would probably have like
um just a scratch not if I have
suppositories 1 a 1 b 1 C for example I
and we have up to like what one Z for
example at least pick the ones that's
correlate to your specific query that
you want to and test it do those unit
testing okay so how can you test it do
some unit
testing and
see what results you get you
know you get and then analyze it and
then do then do output
analysis and see what's happening and
then you can understand what this will
help you do is understand why your
output is like
that is it my llm
methodology or
um um an extra an interpret an extra
post processing postprocessing I get
it so these are the things I would tell
you to think about you know when you
start when you start you know
implementation on your next step let's
think about these things right here and
then you then it will help you to
understand what type of responsories you
actually need for your program right
here and I'll send I'll send you guys
this don't don't take a picture I'll
send it to you guys okay don't worry
okay I'll send you guys this okay um but
yeah and like I said you need any help
feel free to reach out okay um but these
are the high Lev things I could think of
from my top of my head about what what
could be possible and I'll send you a
little bit of articles about um how
people measure the effectiveness of the
of their Pro of their LM program right
here because this this space is actually
quite new too even to me it's quite new
okay so yeah I I'm still learning this
part too okay um so hopefully this will
help you guys I know you guys have
another session but um like I said you
need anything feel free to reach out to
me
okay well hopefully you guys have rest
rest nice of your weekend okay yeah
Browse More Related Video
5.0 / 5 (0 votes)