people cannot distinguish gpt-4 from a human in a turing test, by c. jones et al (2024)

2025-07-25 15:55:36 -04:00 · 2024-11-21 11:49:58 -08:00 · 2024-11-21 11:49:58 -08:00 · 592f593e44
commit 592f593e44
parent 5a83a44f8e
21 changed files with 78 additions and 53 deletions
--- a/deep_learning/README.md
+++ b/deep_learning/README.md
@ -0,0 +1,20 @@
+## ai agents
+
+<br>
+
+* **[deep learning](deep_learning.md)**
+* **[reinforcement learning](reinforcement_learning.md)**
+
+<br>
+
+----
+
+### cool resources
+
+<br>
+
+* **[cursor ai editor](https://www.cursor.com/)**
+* **[microsoft notes on ai agents](https://github.com/microsoft/generative-ai-for-beginners/tree/main/17-ai-agents)**
+* **[google's jax (composable transformations of numpy programs)](https://github.com/google/jax)**
+* **[machine learning engineering open book](https://github.com/stas00/ml-engineering)**
+* **[advances in financial machine learning](books/advances_in_financial_machine_learning.pdf)**
--- a/deep_learning/deep_learning.md
+++ b/deep_learning/deep_learning.md
@ -0,0 +1,164 @@
+## deep learning 
+
+<br>
+
+### timeline tl; dr
+
+<br>
+
+* **[2012: imagenet and alexnet](https://github.com/tensorflow/models/blob/master/research/slim/nets/alexnet.py)**
+
+* **[2013: atari with deep reinforcement learning](https://www.tensorflow.org/agents/tutorials/1_dqn_tutorial)**
+* **[2014: seq2seq](https://www.tensorflow.org/addons/tutorials/networks_seq2seq_nmt)**
+* **[2014: adam optmizer](https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/optimizer_v2/adam.py#L32-L281)**
+* **[2015: gans](https://www.tensorflow.org/tutorials/generative/dcgan)**
+* **[2015: resnets](https://github.com/tensorflow/tensorflow/blob/v2.2.0/tensorflow/python/keras/applications/resnet.py)**
+* **[2017: transformers](https://github.com/huggingface/transformers)**
+* **[2018: bert](https://arxiv.org/abs/1810.04805)**
+
+<br>
+
+---
+
+### deep reinforcement learning for trading
+
+<br>
+
+* a map consists of a set of states, a set of actions, a transition function that describes the probability of moving rom one state to another after taking an action, and a reward function that assigns a numerical reward to each state-action pair
+
+* the goal of a map is to maximize its expected cumulative reward over a sequence of actions, called a policy.
+
+* a policy is a function that maps each state to a probability distribution over actions. The optimal policy is the one that maximizes the expected cumulative rewards.
+
+* the problem of reinforcement learning can be formalized using ideas from dynamical systems theory, specifically, as the optimal control of incompletely-known Markov decision processes.
+
+* as opposed to supervised learning, an agent must be able to learn from its own experience. and as oppose to unsupervised learning because, reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure. 
+
+* the agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future. on a stochastic task, each action must be tried many times to gain a reliable estimate of its expected reward. 
+
+* beyond the agent and the environment, one can identify four main subelements of a reinforcement learning system: a policy, a reward signal, a value function, and, optionally, a model of the environment.
+
+* traditional reinforcement learning problems can be formulated as a markov decision process (MDP): 
+  * we have an agent acting in an environment
+  * each step *t* the agent receives as the input the current state S_t, takes an action A_t, and receives a reward R_{t+1} and the next state S_{t+1}
+  * the agent choose the action based on some policy pi: A_t = pi(S_t)
+  * it's our goal to find a policy that maximizes the cumulative reward Sum R_t over some finite or infinite time horizon
+
+
+<br>
+
+<img width="500" src="https://user-images.githubusercontent.com/1130416/227799494-d62aab7f-d6cf-419f-be03-1d2dbdee1853.png">
+
+<br>
+
+#### agent
+
+<br>
+
+* agent is the trading agent (e.g. the human trader who opens the gui of an exchange and makes trading decision based on the current state of the exchange and their account)
+
+<br>
+
+#### environment
+
+<br>
+
+* the exchange and other agents are the environment, and they are not something we can control
+* by putting other agents together into some big complex environment, we lose the ability to explicitly model them
+* if we try to reverse-engineer the algorithms and strategies that other traders are running, put us into a multi-agent reinforcement learning (MARL) problem setting
+
+<br>
+
+#### state
+
+<br>
+
+* in the case of trading on an exchange, we don't observe the complete state of the environment (e.g. other agents), so we are dealing with a partially observable markov decision process (pomdp).
+* what the agents observe is not the actual state S_t of the environment, but some derivation of that.
+* we can call the observation X_t, which is calculated using some function of the full state X_t ~ O(S_t)
+* the observation at each timestep t is simply the history of all exchange events received up to time t.
+* this event history can be used to build up the current exchange state, however, in order for our agent to make decisions, extra info such as account balance and open limit orders need to be included.
+
+<br>
+
+#### time scale
+
+<br>
+
+* hft techniques: decisions are based almost entirely on market microstructure signals. decisions are made on nanoseconds timescales and trading strategies use dedicated connections to exchanges and extremly fast but simple algorithms running fpga hardware.
+* neural networks are slow, they can't make predictions on nanoseconds time scales, so they can't compete with the speed of hft algorithms.
+* guess: the optimal time scale is between a few milliseconds and a few minutes.
+* can deep rl algorithms pick up hidden patterns?
+
+<br>
+
+#### action space
+
+<br>
+
+* the simplest approach has 3 actions: buy, hold, and sell. this works but limits us to placing market orders and to invest a deterministic amount of money at each step.
+* in the next level we would let our agents learn how much money to invest, based on the uncertainty of our model, putting us into a continuous action space.
+* in the next level, we would introduce limit orders, and the agent needs to decide the level (price) and wuantity of the order, and be able to cancel orders that have not been yet matched.
+
+<br>
+
+#### reward function
+
+<br>
+
+* there are several possible reward functions, an obvious would realized PnL (profit and loss). the agent receives a reward whenever it closes a position.
+* the net profit is either negative or positive, and this is the reward signal.
+* as the agent maximize the total cumulative reward, it learns to trade profitably. the reward function leads to the optimal policy in the limit.
+* however, buy and sell actions are rare compared to doing nothing; the agent needs to learn without receiving frequent feedback.
+* an alternative is unrealized pnl, which the net profit the agent would get if it were to close all of its positions immediately.
+* because the unrealized pnl may change at each time step, it gives the agent more frequent feedback signals. however the direct feedback may bias the agent towards short-term actions.
+* both naively optimize for profit, but a trader may want to minimize risk (lower volatility)
+* using the sharpe ration is one simple way to take risk into account. other way is maximum drawdown.
+
+<br>
+
+<img width="505" src="https://user-images.githubusercontent.com/1130416/227811225-9af06c79-3f86-48e8-899c-ee5a80bc91e1.png">
+
+<br>
+
+#### learned policies
+
+<br>
+
+* instead of needing to hand-code a rule-based policy, rl directly learns a policy
+
+
+<br>
+
+#### trained directly in simulation environments
+
+<br>
+
+* we need separate backtesting and parameter optimization steps because it was difficult for our strategies to take into account environmental factors: order book liquidity, fee structures, latencies.
+* getting around environmental limitations is part of the opimization process. if we simulate the latency in the reinforcement learning environment, and this results in the agent making a mistake, the agent will get a negative rewards, forcing it to learn to work around the latencies.
+* by learning a model of the environment and performing rollouts using techniques like a monte carlo tree search (mcts), we could take into account potential reactions of the market (other agents)
+* by being smart about the data we collect from the live environment, we can continously improve our model
+* do we act optimally in the live environment to generate profits, or do we act suboptimally to gather interesting information that we can use to improve the model of our environment and other agents?
+
+<br>
+
+#### learning to adapt to market conditions
+
+<br>
+
+* some strategy may work better in a bearish environment but lose money in a bullish environment.
+* because rl agents are learning powerful policies parameterized by NN, they can alos learn to adapt to market conditions by seeing them in historical data, given that they are trained over long time horizon and have sufficient memory.
+
+<br>
+
+#### trading as research
+
+<br>
+
+* the trading environment is a multiplayer game with thousands of agents acting simultaneously
+* understanding how to build models of other agents is only one possible we can, we can choose perfom actions in a live environment with the goal of maximizing the information grain with respect to kind policies the other agents may be following
+* trading agents receive sparse rewards from the market. naively applying reward-hungry rl algorithms will fail.
+* this opens up the possibility for new algorithms and techniques, that can efficiently deal with sparse rewards.
+* many of today's standard algorithms, such as dqn or a3c, use a very naive approach exploration - basically adding random noise to the policy. however, in the trading case, most states in the environment are bad, and there are only a few good ones. a naive random approach to exploration will almost never stumble upon good state-actions paris.
+* the trading environment is inherently nonstationary. market conditions change and other agent join, leave, and constantly change their strategies.
+* can we train an agent that can transit from bear to bull and then back to bear, without needing to be re-trained?
--- a/deep_learning/reinforcement_learning.md
+++ b/deep_learning/reinforcement_learning.md
@ -0,0 +1,119 @@
+## reinforcement learning
+
+<br>
+
+### tl; dr
+
+<br>
+
+* reinforcement learning is learning what to do (how to map situations to actions) so as to maximize a numerical reward signal
+* an autonomous agent is a software program or system that can operate independently and make decisions on its own, without direct intervention from a human
+
+<br>
+
+---
+
+### overview
+
+<br>
+
+* we formalize the problem of reinforcement using ideas from dynamical system theory, as the optimal control of incompletely-known Markov decision processes.
+* a learning agent must be able to sense the state of its environment to some extent and must be able to take actions that affect the state.
+* markov decision processes are intented to include just these three aspects, sensation, action, and goal.
+* the agent has to exploit what it has already experienced in order to obtain reward, but it has also to explore in order to make better action selections in the future.
+* on a stochastic tasks, each action must be tried many times to gain a reliable estimate of its expected reward.
+
+<br>
+
+---
+
+### elements of reinforcement learning
+
+<br>
+
+* beyond the agent and the environment, 4 more elements belong to a reinforcement learning system: a policy, a reward signal, a value funtion, and a model of the environmnet.
+* a policy defines the learning agent's way of behacing at a given time. It's a mapping from perceiv ed states of the environment to actions to be taken when in those states. in general, policies may be stochastics (specifying probabilities for each action).
+* a reward signal defines the goal of a reinforcement learning problem: on each time step, the environment sends to the reinforcement learning agent a single number called the reward. the agent's sole objective is to maximize the total reward over the run.
+* a value function specifies what is good in the long run, the valye of a state in the total amount of reward an agent can expect to accumulate over the future, starting from that state
+* a model of the environment.
+* the most important feature distinguishing reinforcement learning from other types of learning is that it uses training information that evaluates the actions taken rather than instructs by giving correct actions. 
+
+<br>
+
+---
+
+### finite markov decision processes (mdps)
+
+<br>
+
+* the problem involves evaluating feedbacks and choosing different actions in different situations.
+* mdps are a classical formalization of sequential decision making, where actions influence not just immediate rewards, but also subsequent situations.
+* mdps involve delayed reward and the need to trade off immediate and delayed reward.
+
+<br>
+
+##### the agent-environment interface
+
+* mdps are meant to be a straightfoward framing of the problem of learning from interaction to achieve a goal.
+* the learner and the decision makers is called the agent.
+* the thing it interacts with, comprimising everything outside the agent, is called the environment.
+* the environment gives rise to rewards, numerical values that the agent seeks to maximize over time through its choice of actions.
+
+<br>
+
+<img width="466" src="https://user-images.githubusercontent.com/1130416/228971927-3c574911-d0ca-4d2d-b795-8b0776599952.png">
+
+<br>
+
+* the agent and the environment interact at each of a sequence of discrete steps, t = 0, 1, 2, 3...
+* at each time step t, the agent receives some representation of the environments state St
+* on that basis, the agent selects an action At
+* one step later, in part of a consequence of its action, the agent receives a numerical rewards and finds itself in a new state.
+* the mdp and the agent together give rise to a sequence (trajectory)
+* in a finite mdp, the set of states, actions, and rewards all have a finite number of elements. in this case, the random variables R and S have well defined discrete probability distributions dependent only on the proceding state and action.
+* in a markov decision process, the probabilities given by p completely characterize the environment's dynamics.
+* the state must include information about all aspects of the past agent-environment interaction that make a differnce for the future.
+* anything that cannot be changed arbitrarily by the agent is considered to be outside of it and thus part of its environment.
+
+<br>
+
+##### goals and rewards
+
+
+* each episode ends in a special state called the terminal state, followed by a reset to a standard starting state or to a sample from a standard distribution of starting states.
+* almost all reinforcement learning algorithms involve estimating value functions—functions of states (or of state–action pairs) that estimate how good it is for the agent to be in a given state (or how good it is to perform a given action in a given state). 
+* the Bellman equation averages over all the possibilities, weighting each by its probability of occurring. tt states that the value of the start state must equal the
+(discounted) value of the expected next state, plus the reward expected along the way.
+* solving a reinforcement learning task means finding a policy that achieves a lot of reward over the long run. 
+
+<br>
+
+---
+
+### dynamic programming
+
+* collection of algorithms that can be used to compute optimal policies given a perfect model of the environment as a mdp.
+* a common way of obtaining approximate solutions for tasks with continuous states and actions is to quantize the state and action spaces and then apply finite-state DP methods. 
+* the reason for computing the value function for a policy is to help find better policies.
+* asynchronous DP algorithms are in-place iterative DP algorithms that are not organized in terms of systematic sweeps of the state set. these algorithms update the values of states in any order whatsoever, using whatever values of other states happen to be available. the values of some states may be updated several times before the values of others ar
+* policy evaluation refers to the (typi- cally) iterative computation of the value function for a given policy. 
+* policy improvement refers to the computation of an improved policy given the value function for that policy.
+
+<br>
+
+##### generalized policy interaction
+
+* policy iteration consists of two simultaneous, interacting processes, one making the value function consistent with the current policy (policy evaluation), and the other making the policy greedy with respect to the current value function (policy improvement). 
+* generalized policy iteration (GPI) refers to the general idea of letting policy-evaluation and policy-improvement processes interact, independent of the granularity and other details of the two processes. 
+* DP is sometimes thought to be of limited applicability because of the curse of dimen- sionality, the fact that the number of states often grows exponentially with the number of state variables
+
+<br>
+
+---
+
+### cool resources
+
+<br>
+
+* **[gymnasium api](https://gymnasium.farama.org/)**
+* **[reinforcement learning with unsupervised auxiliary tasks, by jaderberg et al.](https://arxiv.org/abs/1611.05397)**