Learning theme
:
Competition link: https://data.xm.gov.cn/contest-series-api/promote/register/3/UrnA69nb.
Description of competition questions
Bike sharing extends the context of urban public transport and solves the problem of "the last kilometer" for citizens. However, as the sharing economy model is accepted by more and more citizens and becomes a travel habit, the tide phenomenon also appears. The objective existence of the law of human activities of working during the day and resting at night, coupled with the concentration of commuting time, leads to the contradiction between supply and demand of "one car hard to find" and "no place to stop" in the morning and evening peak. Through the comprehensive analysis of vehicle data, this topic hopes to effectively locate the tide points in the early peak period of Xiamen Island, further design the optimization scheme of group intelligence in the peak period, and alleviate the problem of supply and demand of tide points, so as to provide data support for the urban management department and bicycle sharing operators to study and formulate the next optimization measures.
Competition task
Task 1: in order to better grasp the variation law and trend of tidal phenomena in the morning peak, participants need to carry out data analysis and calculation model construction based on the data provided by the organizer, identify the 40 areas with the most prominent tidal phenomena in the morning peak from 07:00 to 09:00 on weekdays, list the number and name of shared bicycle parking spots contained in each area, and provide the calculation method description and calculation model, Provide auxiliary support for the next optimization measures.
Task 2: according to the calculation results of Top40 area in task 1, the contestants further design the optimization scheme of shared single car tide points in peak hours, actively guide parking users to park at adjacent parking points, cut peaks and fill valleys, and alleviate the congestion of parking spaces at tide points (such as subway entrance). Participants are allowed to bring their own training data, but the source and use of the data should be explained in the entries, and their legal compliance should be guaranteed. (practitioners of urban public bicycles refer to the problem of "not borrowing and not entering" shared bicycles during morning and evening peak hours as "tidal phenomenon". The "tidal phenomenon" involved in this topic focuses on the problem of "not entering" and identifies the 40 areas where shared bicycles are most silted up in morning and evening peak hours)
code
import os, codecs import pandas as pd import numpy as np PATH = './data/'
# Share single vehicle track data bike_track = pd.concat([ pd.read_csv(PATH + 'gxdc_gj20201221.csv'), pd.read_csv(PATH + 'gxdc_gj20201222.csv'), pd.read_csv(PATH + 'gxdc_gj20201223.csv'), pd.read_csv(PATH + 'gxdc_gj20201224.csv'), pd.read_csv(PATH + 'gxdc_gj20201225.csv') ]) # Sort by vehicle ID and time bike_track = bike_track.sort_values(['BICYCLE_ID', 'LOCATING_TIME'])
import folium m = folium.Map(location=[24.482426, 118.157606], zoom_start=12) my_PolyLine=folium.PolyLine(locations=bike_track[bike_track['BICYCLE_ID'] == '000152773681a23a7f2d9af8e8902703'][['LATITUDE', 'LONGITUDE']].values,weight=5) m.add_children(my_PolyLine)
def bike_fence_format(s): s = s.replace('[', '').replace(']', '').split(',') s = np.array(s).astype(float).reshape(5, -1) return s # Share single vehicle parking lot (electronic fence) data bike_fence = pd.read_csv(PATH + 'gxdc_tcd.csv') bike_fence['FENCE_LOC'] = bike_fence['FENCE_LOC'].apply(bike_fence_format)
import folium m = folium.Map(location=[24.482426, 118.157606], zoom_start=12) # for data in bike_fence['FENCE_LOC'].values[:100]: # folium.Marker([data[0,1],data[0,0]]).add_to(m) for data in bike_fence['FENCE_LOC'].values[:100]: folium.Marker(list(data[0, ::-1])).add_to(m) m
# Share single vehicle order data bike_order = pd.read_csv(PATH + 'gxdc_dd.csv') bike_order = bike_order.sort_values(['BICYCLE_ID', 'UPDATE_TIME'])
import folium m = folium.Map(location=[24.482426, 118.157606], zoom_start=12) my_PolyLine=folium.PolyLine(locations=bike_order[bike_order['BICYCLE_ID'] == '0000ff105fd5f9099b866bccd157dc50'][['LATITUDE', 'LONGITUDE']].values,weight=5) m.add_children(my_PolyLine)
# Inbound passenger flow data of rail stations rail_inflow = pd.read_excel(PATH + 'gdzdtjsj_jzkl.csv') rail_inflow = rail_inflow.drop(0) # Outbound passenger flow data of rail stations rail_outflow = pd.read_excel(PATH + 'gdzdtjsj_czkl.csv') rail_outflow = rail_outflow.drop(0) # Code data of gate equipment at track station rail_device = pd.read_excel(PATH + 'gdzdkltj_zjbh.csv') rail_device.columns = [ 'LINE_NO', 'STATION_NO', 'STATION_NAME', 'A_IN_MANCHINE', 'A_OUT_MANCHINE', 'B_IN_MANCHINE', 'B_OUT_MANCHINE' ] rail_device = rail_device.drop(0)
# The LATITUDE range of the parking point is obtained bike_fence['MIN_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 1])) bike_fence['MAX_LATITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 1])) # Get the range of parking lot LONGITUDE bike_fence['MIN_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.min(x[:, 0])) bike_fence['MAX_LONGITUDE'] = bike_fence['FENCE_LOC'].apply(lambda x: np.max(x[:, 0])) from geopy.distance import geodesic # Calculate the specific area according to the parking area bike_fence['FENCE_AREA'] = bike_fence.apply(lambda x: geodesic( (x['MIN_LATITUDE'], x['MIN_LONGITUDE']), (x['MAX_LATITUDE'], x['MAX_LONGITUDE']) ).meters, axis=1) # Calculate the longitude and latitude of the center according to the parking point bike_fence['FENCE_CENTER'] = bike_fence['FENCE_LOC'].apply( lambda x: np.mean(x[:-1, ::-1], 0) )
For an in-depth understanding of Geohash, please refer to the following articles:
import geohash bike_order['geohash'] = bike_order.apply( lambda x: geohash.encode(x['LATITUDE'], x['LONGITUDE'], precision=6), axis=1) bike_fence['geohash'] = bike_fence['FENCE_CENTER'].apply( lambda x: geohash.encode(x[0], x[1], precision=6) )
bike_order['UPDATE_TIME'] = pd.to_datetime(bike_order['UPDATE_TIME']) bike_order['DAY'] = bike_order['UPDATE_TIME'].dt.day.astype(object) bike_order['DAY'] = bike_order['DAY'].apply(str) bike_order['HOUR'] = bike_order['UPDATE_TIME'].dt.hour.astype(object) bike_order['HOUR'] = bike_order['HOUR'].apply(str) bike_order['HOUR'] = bike_order['HOUR'].str.pad(width=2,side='left',fillchar='0') # Date and time for splicing bike_order['DAY_HOUR'] = bike_order['DAY'] + bike_order['HOUR']
bike_order[bike_order['geohash'] == 'ws7gx9']
Regional discharge and tide statistics
After completing the specific longitude and latitude matching, it is necessary to complete the specific regional flow statistics, that is, the flow at different times (inflow and outflow) in a certain range.
First, extract the time of order data:
bike_order['UPDATE_TIME'] = pd.to_datetime(bike_order['UPDATE_TIME']) bike_order['DAY'] = bike_order['UPDATE_TIME'].dt.day.astype(object) bike_order['DAY'] = bike_order['DAY'].apply(str) bike_order['HOUR'] = bike_order['UPDATE_TIME'].dt.hour.astype(object) bike_order['HOUR'] = bike_order['HOUR'].apply(str) bike_order['HOUR'] = bike_order['HOUR'].str.pad(width=2,side='left',fillchar='0') # Date and time for splicing bike_order['DAY_HOUR'] = bike_order['DAY'] + bike_order['HOUR']
Use the pivot table to count the inflow and outflow of each area at different times:
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], values='LOCK_STATUS', index=['geohash'], columns=['DAY_HOUR'], aggfunc='count', fill_value=0 ) bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], values='LOCK_STATUS', index=['geohash'], columns=['DAY_HOUR'], aggfunc='count', fill_value=0 )
bike_inflow.loc['wsk593'].plot() bike_outflow.loc['wsk593'].plot() plt.xticks(list(range(bike_inflow.shape[1])), bike_inflow.columns, rotation=40) plt.legend(['rate of inflow', 'Outflow'])
bike_inflow.loc['wsk52r'].plot() bike_outflow.loc['wsk52r'].plot() plt.xticks(list(range(bike_inflow.shape[1])), bike_inflow.columns, rotation=40) plt.legend(['rate of inflow', 'Outflow'], prop = None)
Method 1: Geohash matching calculation tide
Since the race question needs to count the tide phenomenon during the morning peak of the working day, we can count the single vehicle flow by day:
bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], values='LOCK_STATUS', index=['geohash'], columns=['DAY'], aggfunc='count', fill_value=0 ) bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], values='LOCK_STATUS', index=['geohash'], columns=['DAY'], aggfunc='count', fill_value=0 )
According to the inflow and outflow, the retained flow of each position can be calculated:
bike_remain = (bike_inflow - bike_outflow).fillna(0) # The number of cars riding away is greater than the number of cars coming in bike_remain[bike_remain < 0] = 0 # Average by day bike_remain = bike_remain.sum(1)
You can get the road with the most serious tidal conditions, export the results and submit the test:
# There are 993 streets in total bike_fence['STREET'] = bike_fence['FENCE_ID'].apply(lambda x: x.split('_')[0]) # The total area of reserved vehicle / street parking spaces is calculated to obtain the density bike_density = bike_fence.groupby(['STREET'])['geohash'].unique().apply( lambda hs: np.sum([bike_remain[x] for x in hs]) ) / bike_fence.groupby(['STREET'])['FENCE_AREA'].sum() # In reverse order of density bike_density = bike_density.sort_values(ascending=False).reset_index() bike_density.to_csv('./result.txt', index=None, sep='|')
Method 2: distance matching to calculate tide
If Geohash is used for statistics, there will be a problem. The statistical method will be inaccurate, so it can only be accurate to the street information. This section will try to use the longitude and latitude distance matching method. The specific idea is to calculate the nearest parking point of the order, and then calculate the specific tidal situation.
For the calculation of longitude and latitude distance, you can directly use the NearestNeighbors in sklearn. By setting the haversine distance, you can easily complete the calculation of the nearest parking point.
from sklearn.neighbors import NearestNeighbors # https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html knn = NearestNeighbors(metric = "haversine", n_jobs=-1, algorithm='brute') knn.fit(np.stack(bike_fence['FENCE_CENTER'].values))
However, if you directly use NearestNeighbors, the calculation speed will be very slow. If it is full quantity quantitative order data, it may take a long time. Therefore, hnsw can be used for approximate search, which is faster but less accurate.
import hnswlib import numpy as np p = hnswlib.Index(space='l2', dim=2) p.init_index(max_elements=300000, ef_construction=1000, M=32) p.set_ef(1024) p.set_num_threads(14) p.add_items(np.stack(bike_fence['FENCE_CENTER'].values))
Calculate parking positions for all orders:
index, dist = p.knn_query(bike_order[['LATITUDE','LONGITUDE']].values[:], k=1)
Calculate tidal flow at all parking points:
bike_order['fence'] = bike_fence.iloc[index.flatten()]['FENCE_ID'].values bike_inflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 1], values='LOCK_STATUS', index=['fence'], columns=['DAY'], aggfunc='count', fill_value=0 ) bike_outflow = pd.pivot_table(bike_order[bike_order['LOCK_STATUS'] == 0], values='LOCK_STATUS', index=['fence'], columns=['DAY'], aggfunc='count', fill_value=0 ) bike_remain = (bike_inflow - bike_outflow).fillna(0) bike_remain[bike_remain < 0] = 0 bike_remain = bike_remain.sum(1)
Calculate the density of parking spots:
bike_density = bike_remain / bike_fence.set_index('FENCE_ID')['FENCE_AREA'] bike_density = bike_density.sort_values(ascending=False).reset_index() bike_density = bike_density.fillna(0)
Final submission result: