The Thlog!

Sunday, October 29, 2023

AI Opportunities in 2023 - Lecture by Dr. Andrew Ng

AI as a general purpose tech...

is useful for lots of different applications e.g. electricity is good for a lot of things
AI collection: Supervised learning and Generative AI (in focus today) + Unsupervised learning and Reinforcement learning

Supervised learning: Good for labelling things e.g.

e-mail >> spam or not, ship-route >> fuel consumed, Ad & user-info >> will click

Workflow of Supervised learning apps: e.g. restaurant reviews classification

Collect dataset >> label data >> train a model >> deploy >> run

Last decade was the decade of large scale supervised learning. Small AI models could be built on not very powerful computers, which had good performance for certain small amount of data but with even large amount of data the performance would flatten out. With large AI models, however, the performance scales better and better with large data
This decade is adding to it the excitement of Generative AI

When we train a very large AI model on a lot of data, we get a LLM like ChatGPT
RLHF and other techniques tune AI output to be more helpful, honest and harmless
And at the heart of Generative AI is (Supervised learning) repeated prediction of next sub-part patterns given the data it has seen

The power of LLMs as a developer (not programmer) tool:

With prompt based AI - the workflow is:

Specify prompt >> Deploy to cloud (e.g. build restaurant review system in few days)

Opportunities: massive value will be created with Supervised learning and Generative AI together, by identifying and executing concrete use cases

Supervised learning will double in size and Generative AI will much more than double
for new start-ups and for large enterprises / companies

Lensa was an indefensible use case as it did not add value; AirBnB or Uber are defensible because these create value
The work ahead is to find the many diverse, value adding and defensible use cases

Refer to the "Potential AI projects space curve"

Advertising, and web search are the only large money making domains, with millions of users
As we go to the right of the curve, some example projects of interest may be:

Food inspection: cheese spread evenly on a pizza
Wheat harvesting: how tall is the wheat crop, at what height should it be chopped off
Materials grading, cloth grading...

Clearly industries other than advertising and web-search have a very long tail of $5 mn projects but with a very high cost of customisation

So, AI community needs to continue building better tools to help aggregate such use cases and make it easy for end users to do the customisations at affordable costs

Instead of needing to worry about pictures of pizza AI community will create tools to enable the IT department of the pizza factory to train an AI system on their own pizzas. Thus, to realise the value of $5 mn by leveraging some low/no code tools for AI

Referring to the AI Stack...

H/W semi-conductor layer at bottom is very capital intensive and very concentrated
Infrastructure layer above the semi-conductor later is also highly capital intensive and very concentrated
Developer tools layer is hyper-competitive, and only a few will be winners
All the above said layers can be successful only if the
All the above said layers can be successful only if the application layer on top is even more successful e.g. Amorai - app for romantic relationships coaching
Recipe for building startups, "don't rush to solutions", has been inverted and now we can just do that while still keeping it cost effective

Ideas >> Validate >> Get CEO >> Prototype (early users) >> Pre-seed Growth (MVP) >> Seed, Growth & Scale

Concrete ideas can be validated or falsified efficiently
Even highly profitable projects but low on ethics will / should be killed
AGI is still decades away
Other areas of interest may be predicting next pandemic, climate change predictions...

Saturday, October 28, 2023

Materialized Views

Introduction

Common, frequent queries against a database can become expensive. When the same query is run again and again, it makes sense to ‘virtualize’ the query. Materialized views address this need by enabling common queries to be represented by a database object that is continuously updated as data changes.

A View...

is a derived relation defined in terms of stored base relations (generally tables)
defines a SQL transformation from a set of base tables to a derived table; this transformation is typically recomputed / re-compiled every time the view is referenced to in a query
when created, does not compute any results nor does it change how data is stored or indexed
is a saved query on tables of a DB
is referenced to, in queries, as if it were a table

Example:

CREATE VIEW user_purchase_summary AS SELECT
  u.id as user_id,
  COUNT(*) as total_purchases,
  SUM(purchases.amount) as lifetime_value

FROM users u
JOIN purchases p ON p.user_id = u.id;

Every time a query referencing view/s is executed, it first computes the results of the view, and then computes the rest of the query using those results.

A Materialized View...

takes a regular view and materializes it by upfront computing and storing its results in a “virtual” table
is like a cache, i.e. a copy of the data that can be accessed quickly
is a regular view “materialized” by storing tuples of the view in the database
can have index structures and hence database access to materialized views can be much faster than recomputing the view

Example:

CREATE MATERIALIZED VIEW user_purchase_summary AS SELECT
  u.id as user_id,
  COUNT(*) as total_purchases,
  SUM(CASE when p.status = 'cancelled' THEN 1 ELSE 0 END) as cancelled_purchases

FROM users u
JOIN purchases p ON p.user_id = u.id;

A regular view is a saved query, and, a materialized view is a saved query along with its results stored as a table.

Implications of materializing a view

When referenced in a query, a materialized view is not recomputed as the results are pre-stored and hence querying materialized views tends to be faster
Because it’s stored as if it were a table, indexes can be built on the columns of a materialized view
Once a view is materialized, it is only accurate until the underlying base relations are modified. The process of updating a materialized view in response to changes in underlying is called view maintenance.

A “view” is an anchored perspective on changing inputs, results are constantly changing as the underlying data changes. Materialization just implies that the transformation is done proactively. So, "materialized views" should update automatically.

However, in practice, some databases need materialized views to be manually refreshed and others have implemented automatic updates, albeit with limitations.

Note: MySQL does not support materialized view as of now. Oracle, Snowflake, MongoDB, Redshift, PostgreSQL all others do.

Materialized views are used...

when SQL query is known ahead of time and needs to be repeatedly recalculated
primarily for caching the results of extremely heavy and complex queries that cannot be run frequently as regular views
as ability to define (using SQL) any complex transformation of data in DB, and let the DB maintain the results in a “virtual” table. when low end-to-end latency is required between when data originates to when it is reflected in a query
when low-latency query response times with high concurrency or high volume of queries is expected

Use of materialized views in...

Applications: Incrementally updated materialized views can be used to replace the caching and denormalization traditionally done to “guard” OLTP databases from read-side latency and overload. Instead of waiting for a query and doing computation to get the answer, we are now asking for the query upfront and doing the computation to update the results as the writes (creates, updates and deletes) come in. This inverts the constraints of traditional database architectures, allowing developers to build data-intensive applications without complex cache invalidation or denormalization.

Analytics: ELT bulk loads raw data into a warehouse and then transforms it via some complex SQLs. The transformation may use regular views (i.e. no caching - used when it is not overly slow), or cached tables built from the results of a SELECT query (used when regular views slow down the queries due to re-computations), or incrementally updated table/s (but user is responsible for writing the update strategy).

OR, use the fourth option i.e.

Use "materialized views", always remain more up-to-date, more automated and less error-prone to cached tables (the end user burden of deciding when and how to update is minimized).

Monday, July 10, 2017

Ethereum mining on AWS

Ethereum mining works only on g2.2xlarge or g2.8xlarge instances with Ubuntu 14.04 or later
Port 30303 must be opened for both TCP and UDP connections from `anywhere` (in security group settings)
Default Ubuntu available with ec2 is minimal i.e. some drm files required for the OS to see GPU drivers are missing. SSH into your machine and run following steps to fix this:

> sudo apt-get install linux-generic

(Click OK for default option/s when prompted)

> sudo reboot

Download CUDA drivers for ec2 instances (use Nvidia units). Working with .deb package (instead of .run) is easier (local or network makes no difference)

> wget http://developer.download.nvidia.com/compute/cuda/7_0/Prod/local_installers/rpmdeb/cuda-repo-ubuntu1404-7-0-local_7.0-28_amd64.deb

(Newer versions are available here)

> sudo dpkg -i <cuda repo package>

> sudo apt-get update

> sudo apt-get install cuda

Run the following command to check driver is installed:

> lshw -c video

A line that starts with "Configuration:" should mention "...driver=nvidia...", if it doesn't search carefully or try reboot.
If you see "...driver=nouveau..." instead of "...driver=nvidia..." then something is wrong - google how to get rid of it and reinstall cuda.
Build geth from source, refer here
run geth to allow it to catch up on the chain:

> ~/go-ethereum/build/bin/geth

install ethminer from cpp-ethereum dev PPAs, refer here
Use following command to check the current hash rate (~6 MH/s) and benchmark ethminer to check that your system is in order:

> ethminer -G -M

When geth catches up on the blockchain, use the following command to generate a new account:

> ~/go-ethereum/build/bin/geth account new

start geth again with RPC enabled by using command-line below

> ~/go-ethereum/build/bin/geth --rpc

Execute following command to start ethminer

> ethminer -G

If using larger g2 instance with 4 GPUs, ethminer needs to be started 4 times. Each time adding a "--opencl-device <0..3>" argument
Check logs carefully, ethminer should be getting work packages from geth and be "mining a block"

Sunday, December 27, 2015

How to...

Make YouTube videos run faster

Open Google Chrome

Ctrl + Shift + J - opens Developer Tools, ensure you are on Console tab

On the prompt copy and paste script below:

document.getElementsByTagName("video") [0].playbackRate = 2.5

Instead of playbackRate = 2.5, you can set any other floating number between 1.00 to 4.00

Create a rss feed for "The Thlog"

https://thethlog.blogspot.com/feeds/posts/default?alt=rss

Replace "thethlog" with name of any blog on blogspot, use it for another blog hosted on blogspt. Now, add it to a feed URL to an RSS reader (e.g. Feedly) OR be more creative with GPT.

Saturday, March 21, 2015

Sybase 12.5 to Sybase 15.5 migration - differences that I learnt about

1. Login triggers were introduced in ASE 15.0. A regular ASE stored procedure can automatically executed in background on successful login by any user.

2. Fast bcp is allowed for indexed tables in ASE 15.0.2 and above. bcp works in one of the two modes

Slow bcp - logs all the row inserts made, is slower and is used for tables that have one or more index
Fast bcp - only page allocations are logged, used for tables without indexes, used when fastest speed possible is required, can be used for tables with non-clustered indexes

3. sp_displaylogin displays when and why a login was locked & also when you last logged in.

4. Semantic partitions/smart partitioning: ASE 15 makes large databases easy to manage and more efficient by allowing you to divide tables into smaller partitions which can be individually managed. You can run maintenance tasks on selected partitions to avoid slowing overall performance, and queries run faster because ASE 15's smart query optimizer bypasses partitions that don't contain relevant data

5. With large data sets, filing through a mountain of results data can be difficult. ASE 15's bi-directional scrollable cursors make it convenient to work with large result sets because your application can easily move backward and forward through a result set, one row at a time. This especially helps with Web applications that need to process large result sets but present the user with subsets of those results

6. Computed columns: Often applications repeat the same calculation over and over for the same report or query. ASE 15 supports both virtual and materialized columns based on server calculations. Columns can be the computed result of other data in the table, saving that result for future repeated queries

7. Functional indexes: When applications need to search tables based on the result of a function, performance can suffer. Functional indexes allow the server to build indexes on a table based on the result of a function. When repeated searches use that function, the results do not need to be computed from scratch

8. Plan viewer in the form of a GUI: Plans for solving complicated queries can become very complex and make troubleshooting performance issues difficult. To make debugging queries simpler, ASE 15 provides a graphical query plan viewer that lets you visualize the query solution selected by ASE's optimizer.

9. In ASE 15.0, Update statistics and sp_recompile are not necessary after index rebuild

10. ASE 15 allows to assign two billion logical devices to a single server, with each device up to 4 Tb in size. It supports over 32,767 databases, and the maximum size limit for an individual database is 32 terabytes, extending the maximum storage per ASE server to over 1 million terabytes!

11. As of release 12.5.1, all changes to data cache are dynamic

12. ASE 15.0 and later versions no longer use vdevno. i.e. the disk init syntax doesn't need to mention the vdevno parameter.

13. Disk init syntax in 12.5 expects size parameter in K, M, and G only. From 15.0 and onwards, T (Terabyte) can be specified.
Also, pre 15.0; the maximum size of a device was 32GB

14. The configuration parameter ?default database size? was static in ASE 12. In ASE 12.5, it was made dynamic.
For ASE 15.0, the below table is specified by Sybase.
Logical Page Size 2K 4K 8K 16K
Initial default database size 3MB 4 MB 8 MB 16 MB
All system tables, initially 1.2 MB 2.4 MB 4.7 MB 9.4 MB

15. The auto database extension was introduced in 12.5.1 and supported later versions.

16. The Dump/Load Database and Dump/Load Tran syntax differ from version 12.5.0.3 and 12.5.2(and hence later versions) (See sybooks for more information. The compression levels 1-9 were introduced.)

17. ASE 12.5.0.3 and earlier versions used to allow only one tempdb in the server. But all the later versions allow creation of multiple temporary databases.

18. Before 15.0, after changing a database option we need to use that database and do checkpoint on it. But ASE15.0 doesn't need this.

19. Restricting proxy authorization is available in 12.5.2 and later releases only.

20. From version 12.5.2 and onwards, cache creation is made dynamic (sp_cacheconfig [cachename [,"cache_size [P|K|M|G]?] It was static earlier.

21. Till 12.5.2, backing up a database with password was not possible. ASE 12.5.2 and later allow dump database with passwd.

22. Cross platform dumps and loads were introduced in ASE 12.5.3

23. MDA tables (Monitoring and Diagnostic Tables) are available in 12.5.3 and later releases.

24. Row Level Locking: In ASE 15.0 all the system tables were converted into datarows format.

25. Group By without Order By